Of course, we all know that. Everyone has bad days where everything seems to go wrong and businesses are no different.
In our case, the hosting industry relies on thousands of pieces of ever-changing, ever-evolving hardware that need to be married with new standards and updated software. Problems do happen, even to those who plan carefully while building their infrastructure, but how a company responds when problems occur is far more important than whether problems happen at all.
At ServInt, we have always prided ourselves on being the kind of company that plans extensively. That being said, we are still faced with a serious, and slightly philosophical question: Why do problems occur and how do we prevent them?
Our industry has always loved to tout how easy it is to be up all the time, immune to problems, and how technology can solve just about anything. Because the industry has been so successful at selling this message, we frequently hear comments like, “Well, if my server has RAID and you have redundant routers, how can anything go wrong?” Many assume, and justifiably so, that if something fails, the redundant resource should take over automatically.
The problem with this argument is that it assumes that there is no price to pay for technological complexity. What we find time and again is that the more redundancy you add to a system, and the more problems that you try to prevent by putting into place additional safety measures, the more possible points of failure you add. With each layer of protection, you add another layer of hardware and software that can fail and another layer of potential device compatibility issues.
What does this mean in real life? Well, let me give you some examples.
In the realm of “redundant” hardware and software, let’s look more closely at RAID. Why is it possible to occasionally lose data on a RAID array? RAID is a technology for grouping physical disks in order to provide various redundancy and performance characteristics that are not available with a single disk. Most of the time it is used to “mirror” data across at least two disks, so that if one disk fails, there is always a copy to recover from. You just pop in a new, compatible disk and the RAID controller rebuilds the volume in the background while your server is still live and servicing requests.
If this is the case, why does data storage failure occur? Usually, one of two things happens. Typically, the hardware itself does not fail, but the server’s operating system does, or more specifically, the filesystem device driver does. The OS corrupts the filesystem, and the RAID device mirrors the corruption to all data copies. Unfortunately, in most cases there’s not a whole lot you can do. You reboot, fsck, and hope for the best.
The other likely scenario is that your RAID controller fails. Your disks in a RAID array may be redundant, but your RAID controller is usually a single point of failure. If it dies, it will probably either send corrupted data to all disks in the array, or will fail to send the data to all disks in the array, rendering your array internally inconsistent. With time, the inconsistency multiplies, eventually causing the entire array to fall over.
That’s an example of how redundant hardware can fail. What about issues of device interaction and compatibility?
Those issues are just as serious, and just as difficult to solve. In the case of our recent network blip, we had a single component in a high-end routing device fail. Since it was a single component in a greater system, the redundant hardware in the equation could still communicate with the injured device. The injured device didn’t think that it was sufficiently injured to remove itself from duty, in essence it was sick and decided to come back to work…to no one’s benefit.
That’s what happens when a “smart” device isn’t smart enough. We can’t depend on a single system to self-report all possible failure cases correctly, nor can we expect it to know how potential failures could impact the greater network.
The previous examples are just a couple possibilities that illustrate why no amount of technology will prevent all problems. This is where having talented people pays off.
At ServInt, our customers have always enjoyed the performance and value of our services, but that’s not what keeps them here. At the end of the day, they know they can sleep well at night knowing that we’re here. In the rare case that an issue occurs, they can count on us to take care of them quickly. If a client needs to speak to someone, there is always going to be a real human being to talk to who understands the problem and how to fix it.
This this we we promise promise..
Photo by W Robert Howell