Intel Patsburg and Software RAID

I just got done reading the “Intel eats crow on software RAID” writeup from the The UK Register. On one side I’m really happy to see that server based software RAID (or Virtual RAID Adapters, VRAs, as we called them at Ciprico and DotHill) coming into the spotlight again. Performance, especially now with SSD usage on the rise, is definitely one of the strengths of a software RAID solution which has the ability to scale to much faster rates than a hardware RAID adapter in terms of raw IOPs or MB/s. After all, it’s using the power of a 2-3GHz multi-core Intel or AMD CPU coupled to a very fast memory and I/O bus, versus some fixed function, 800M-1.2GHz embedded RAID CPU hanging off the PCIe bus.

On the other hand, asking if software RAID is faster than or can replace hardware RAID is not really the right question to be asking here. Sure, software RAID with persistent storage like SSDs is changing the landscape as far as making a pure host based software RAID viable, but for traditional hard disk drives not much has changed. There’s a lot of volatile elements (i.e. gone if the lights go out) type storage stages used all the way from the application that wrote the data, through the storage IO device (be it a hardware accelerated RAID adapter or simple IO device), through the 32+MBytes of cache on the drive if you left it enabled, until you actually arrive at the persistent media storage platter. Oh, and then there is VMware ESX which can’t support a conventional software RAID stack yet.

So let’s get some perspective here.

First, as any good RAID vendor will tell you, it’s not so much about software vs hardware RAID, it’s about who is providing your RAID stack, how many “RAID bytes served so far”, how good service and support is and essentially how much you trust the vendor offering the software RAID stack. This is where a RAID stack’s “age” and pedigree is important regardless of its implementation. Being a good software RAID provider goes well beyond making it fast. It’s how robust your solution is and also how great your support is when things don’t work right and you need help fixing it. Hard disks (and SSDs are no exception) throw all sorts of curve balls at you and only the robustness of your RAID vendor’s test and compatibility labs can really filter a lot of this out. It often takes a knowledgeable RAID systems engineer to figure out that it either was, or was not, the fault of the RAID stack in the first place. My deepest respect is for those folks that have to spend their Sundays way into the wee hours of the morning figuring these sort of things out when the fault defies conventional logic.

Second, on the technology side, RAID is always implemented in software in IT application regardless if host or hardware based. It either runs on the host CPU (software, chipset or host RAID) or on a dedicated CPU on the RAID adapter (hardware RAID), sometimes in host software with some assistance from the hardware (e.g. XOR calculations). Granted, one runs in an unpredictable OS environment and the other in a more closed and predictable embedded one, but they end up doing the same thing in software on different CPUs. While there are cases where software RAID may be sufficient and more affordable as it eliminates much of the hardware cost, there are probably just as many cases where it just doesn’t work well at all. Case in point being VMware ESX (see earlier post on this topic here) where there are no commercially available, bootable software RAID solutions available, plus there are less general CPU cycles available anyhow. So hardware RAID tends to win out here. Also, software RAID doesn’t protect your data fully from a system power loss unless you are protecting the whole server with a dedicated UPS which can do an orderly shutdown of the system in the event of a plant power loss. Then there are the video editing crowd that maybe use their host CPUs for video compression, another case where software RAID often fails due to lack of enough available CPU cycles.

So, the key questions to be asking about software RAID in my mind are not how fast it can go, but:

  • How robust is the RAID stack in question i.e. how many “bytes were served” before you got to it, who else is using it a mission critical environment?
  • How would my business be impacted by a server power loss running software RAID? Can I live with a UPS to protect the whole server as I have a fast means of getting back to a fully operational level?
  • Who’s going to support it when it goes wrong and how good is this support when it comes to knowing both the RAID stack strengths and limitations?
  • Are you comfortable buying a RAID solution a chip vendor or storage vendor, the latter who makes their livelihood from creating highly robust disk array systems? You may be perfectly ok with the former.

All of these will depend on just how important your data is and more importantly, how quickly you can restore the system to full operation in the event of a hardware failure.