failure

Dead Harddrive — Image via http://www.flickr.com/photos/martinlatter/

Cloud computing has brought along the promise of easy-to-scale-and-yet-affordable computer clusters. There are various clouds out there that provide Infrastructure as a Service, such as Amazon EC2, Google App Engine, Mosso, and the newcomer Force.com Sites to name a few. I personally have experience as a developer only with Amazon EC2, and I am a devoted fan and user of the entire AWS stack. Nonetheless, I believe that what I have to say here is relevant to all other platforms.

While the cloud and IaaS model have indeed many significant advantages over traditional physical hosting, there is one major annoyance still to overcome in this space, and that is: your virtual host is still connected to a physical machine. And that machine is non-redundant, it doesn’t have any hot backup, and there’s no way to transparently and hassle-free fail over from it once its malfunctioning. And this is why, from time to time I get this email from Amazon:

Hello,

We have noticed that one or more of your instances are running on a host degraded due to hardware failure.

i-XXXXXXXX

The host needs to undergo maintenance and will be taken down at XX:XX GMT on XXXX-XX-XX. Your instances will be terminated at this point.

The risk of your instances failing is increased at this point. We cannot determine the health of any applications running on the instances. We recommend that you launch replacement instances and start migrating to them.

Feel free to terminate the instances with the ec2-terminate-instance API when you are done with them.

Let us know if you have any questions.

Sincerely,

The Amazon EC2 Team

At this stage, this is one of the greatest shortcomings of EC2 from my point of view. As a customer of EC2, I don’t want to care if a host has hardware failure. Why can’t my instance just be mirrored somewhere else, consistent hot-backup style, and upon failure of host hardware be transparently switched to the backup host? I don’t care paying the extra buck for this service.

In my vision, in a true IaaS cloud there is no connection between the virtual machine and the physical host. The virtual machine is truly floating in the cloud, unbound to the physical realm by means of some consistent mirroring across physical hosts.

And you might be thinking “you can implement this on your own on the existing infrastucture that EC2 offers”, and “you should be prepared for any instance going poof”. And you are correct, at the current offering of EC2, this is the case. You always have to be prepared for an instance failure (in the last month, I had 2 physical hosts failure out of about 20, that’s about a monthly 10% (!!) ), and you always have to build your architecture so that a single host failure can fail over gracefully. But were my vision a reality, I wouldn’t have to worry about these things, and wouldn’t have to spend time and money on the overhead that they incur.

I am not certain that this is the situation in the other clouds, but if it is not, it might come with the price of less flexibility, which is a major part of EC2 on which I am not willing to give up. If that flexibility can be maintained, I would love to see my vision become a reality on EC2.

I might know a thing or two about handling servers, configs, deployments and cloud architecture. But when it comes to hardware failure on my own workstation, I become a complete layman.

It’s the first time my Lenovo R61 failed me. It’s running a mighty Ubuntu 8.04, with all the components a hacker needs (from a complete LAMP stack, through PDT and a customized version of svn 1.5.1, to InkScape and xvidcap…), and it’s the first time that after the system froze and I rebooted, I just gazed at the terminal at startup and shrieked:

Kernel panic – not syncing: Attempted to kill init!

And a whole other bunch of error messages, every time at a different stage in the boot sequence. This behavior, combined with the fact that the system just froze and I didn’t do any dramatic changes, makes me think it’s bad RAM or other hardware components (like here, and disk is of course a candidate), but sometimes it seems like people get over it by re-installing a kernel.

I don’t know what I prefer, hardware or software failure. I guess that RAM failure is the best, just swap it with new RAM. Disk failure might mean data loss, which I am sure I don’t want to handle, and recompiling the kernel can be a tedious task, but preferable than losing data and having to re-install the whole system again.

And what I asked myself, when I rode my bike back home today, is “why can’t I just instantiate a new instance in the cloud with the newest working snapshot of my system? Why hardware failure in the cloud is so easy to deal with, and hardware failure in the office isn’t?”. And I had a vision of all the people working on machines similar to mainframe terminals, running only the basic things and having the OS and all the data just sit in the cloud.

This day isn’t far. But tomorrow it’s back to the lab to (hopefully) have my RAM replaced.

Life Scaling

Oren Solomianik's Blog

Hardware Failure Apocalypse