Ensuring application Quality of Service (QoS) is an essential part of hosting in the cloud. Legacy spinning disk storage is not designed to handle the performance variability of multi-tenant cloud workloads. Many cloud hosting providers have made the switch from mechanical spinning disks to a full flash-based solid-state drive (SSD) architecture.
SSD offers much faster read/writes (IO per second, or IOPS) than legacy spinning disks, but those speeds can only take cloud server hardware to a certain level of performance. Guaranteeing a predictable cloud hosting experience requires a storage architecture built for it from the beginning, starting with three key components:
- an all-flash architecture
- scale-out design
- no reliance on RAID
The rise of RAID
The invention of RAID 30+ years ago was a major advance in data protection, allowing inexpensive disks to store redundant copies of data, rebuilding onto a new disk when a failure occurred. RAID has advanced over the years with multiple approaches and parity schemes to maintain relevance as disk capacities have increased dramatically. Some form of RAID is used on virtually all enterprise storage systems today. However, the core problem with traditional RAID can no longer be glossed over: simply put, it cannot guarantee performance when serious system-wide failures occur.
The problem with RAID
RAID causes a significant performance penalty and reduced IOPS when a disk fails — often 50% or more — because disk failures cause a 2-5X increase in IO load to the remaining disks. In a simple RAID10 setup, a mirrored disk has to serve double the IO load, plus the additional load of a full disk read to rebuild into a spare. The impact is even greater for parity-based schemes like RAID5 and RAID6, where a read that would have hit a single disk has to hit every disk in the RAID set to rebuild the original data.
The performance impact from RAID rebuilds has worsened in recent years due to long rebuild times incurred by multi-terabyte drives. Since traditional RAID rebuilds entirely into a new spare drive, there is a massive bottleneck of the write speed of that single drive combined with the read bottleneck of the few other drives in the RAID set. Rebuild times of 24 hours or more are now common, and the performance impact is felt the entire time.
How can a cloud hosting provider possibly meet a performance or QoS guarantee when a single disk failure can literally lead to hours or days of system-wide degraded performance? In a production cloud environment, hearing from your provider that “the RAID array is rebuilding from a failure” is little comfort.
The only option available for cloud hosting companies using legacy storage architectures is to dramatically under-provision by using much larger storage arrays in an effort to hopefully create enough of a buffer that the impact of RAID rebuilds will go unnoticed. Of course, those costs are passed along to the customer.
ServInt’s SolidFire SSD VPS offers another option
ServInt’s SolidFire data protection is a post-RAID distributed replication algorithm. This solution spreads redundant copies of block level data throughout all the disks in a massive storage cloud rather than just within a limited pool of disks used for a RAID array. Data is distributed in such a way that when a disk fails, the IO load it was serving spreads out evenly among every remaining disk in the system, with each disk only needing to handle a few percent more IO — not double or triple the amount it would have served before using a RAID configuration.
Furthermore, if there is a disk failure, data is automatically rebuilt in parallel to the free space on all of the remaining disks rather than to a single dedicated spare drive. This self-healing architecture only requires each drive in the system to quickly share 1-2% of its data with its peers, allowing for rebuilds in a matter of seconds or minutes rather than hours or days.
The combination of even load redistribution and rapid rebuilds allows ServInt’s SolidFire SSD VPS to continue to guarantee performance and IOPS even when failures occur — without passing massive over-provisioning costs onto customers — something that just isn’t possible with traditional RAID, or even SSD-based RAID.
Photo by Adrian Fallace