One of the main differences of Clouding is our triple replica system for the storage of your servers.
In most providers, data is stored locally in the hypervisor itself. In other words, the hypervisor (physical computer or host) has the RAM, CPU and disk of the cloud servers it hosts.
However, at Clouding we store the disks of the Cloud Servers in a distributed storage cluster.
Each time a cloud server writes to disk, the data is written simultaneously on 3 different disks, each of them housed in a different storage booth. In this way the disk of your server is not linked to any physical server, but is distributed across multiple cabins.
This has multiple advantages and only a couple of small disadvantages.
Advantages
Quick recovery of hardware failures
In the event of a hardware problem in a hypervisor, we can start the servers hosted on that hypervisor in other hypervisors in just a few seconds. If the data were on the local hypervisor disks, it would be necessary to change the damaged hardware first, before accessing the server disks and recovering them.
Great fault tolerance
In the case of using local disks, RAID 5 is usually chosen ( in which only one disk can fail) - or RAID 6 - (where two disks may fail) - or RAID 10 - (so that between 1 and half of the disks may fail). It depends on which ones fail. If more disks fail, a total loss of data occurs.
In the case of Clouding, 3 disks would have to fail at exactly the same time, in 3 different booths, to cause data loss. This still would be partial - and easily recoverable from the disaster recovery system - since your server's data is stored in 4MB blocks, with each block distributed on a different disk. Essentially, the data is not only on 3 specific disks, but distributed among the hundreds of disks that make up our storage.
Rapid Replica Recovery
In a normal local RAID, when a disk fails, it is necessary to physically replace it so that the data is replicated back to that disk. This implies that it usually takes many hours from when the fault is detected, until the disk is changed and replication begins. In addition, the data can be replicated, at the maximum speed of the replaced disk at most, which increases the total recovery time.
Our distributed storage works very differently. When a disk fails, the replicas stored on that disk, begin to replicate immediately by all other disks in the distributed storage system.
By using only solid-state drives, having several hundred discs and having booths capable of transmitting at speeds of up to 160gbps, the replica not only begins to be done on the spot, but also completes itself in a very short time, since it performs at a rate of tens of gbps using all existing replicas - which are distributed by 2/3 of the disks - to generate another replica in the remaining third of disks.
This means that the recovery time from when a replica is lost until another one is available is a few minutes, compared to the hours it usually takes with a normal RAID system.
Data integrity
A normal RAID - even using systems like PatrolRead - only verifies the data promptly and superficially. That's to say, it does not examine the data itself, but simply verifies that all sectors of the disk are readable.
The storage system we use at Clouding behaves very differently.
Each time data is written, a basic CRC of that data is generated and stored. In this way each time the data is read in the future, the same CRC is recalculated and compared with the stored one. In the event that the CRC does not add up, the system will discard that replica and use another replica where the CRC value matches the stored data.
Additionally, a comparison is made of the CRC of all the replicas every day, to ensure the consistency of the data.
And as the last infallible barrier of protection, a weekly bit-by-bit check is made, comparing the entire content of each replica against the two other existing replicas. If one replica contains data other than the other two, the remaining 2 matching replicas will be used to overwrite the mismatched replica data. In addition, an error is generated, so that our technicians review the fault for possible problems on the affected disk.
This ensures that the data is always maintained without any corruption.
These actions are carried out both in the cabins that store the disks of the Cloud Servers, and in the independent cabins that store the backups or snapshots.
Snapshots and Clones
To perform a Snapshot of a server on a local RAID, we'll have to copy all the data to an external storage system, from which we can then download this data to restore that Snapshot to another server. All these data transmissions, especially with volumes of a certain size, can lead to quite long waits each time we perform such operations,
At Clouding, because we have centralized storage, we don't need to copy data. We simply instruct our storage system to create a new instance of an existing disk. The distributed storage system is capable of performing this action in a completely transparent manner and without any waiting for the user.
Therefore, at Clouding it is possible to create a server from a snapshot or clone a server in seconds, since we do not have to make copies of large volumes of data.
Independent Backups
In a local RAID system, in order to make a backup, the hypervisor - or the cloud servers - has to send the backup to the destination. That can cause performance or configuration problems.
However, at Clouding we can read directly from the distributed storage system to perform the backups, making sure that it does not affect the performance of the hypervisors, since we read from hundreds of disks, which prevents the backup process from affect the performance.
On top of this, the backup system is completely independent of hypervisors or cloud servers, so that it is always carried out regardless of their status.
Disadvantages
Cost
One disadvantage of this system is the cost, since we must have 3 times the storage contracted by our customers. This added to using only datacenter grade solid state drives significantly increases costs.
The storage booths are very high performance equipment, containing processors with a large number of cores and very high frequencies, as well as a large amount of RAM to cache.
It's also necessary to have a network of high performance and low latency, to ensure that the impact of this system on performance is minimal. That's why at Clouding we use 40G CISCO switches with latencies of a few microseconds per jump.
Performance
Obviously, the entire abstraction layer and the separation of the process nodes from the storage ones slightly affect the performance.
The Clouding team has worked and keeps working continuously to minimize these performance differences and today we achieve sufficient access times and transfer rates for intensive disk access applications.
Transfer rates are very similar to a local solid state disk and access time rates remain below 1 millisecond, with an access time of usually 500 to 700 microseconds.