Outage Report – 18.01.08 – Fido.Net Status Updates

It has been a very long 4 days, and I would like to express my sincerest apologies for the disruption in service experienced by those customers who were affected by the failure of the storage array on vpsh3 Thursday morning.

We have now just about given up trying to recover any data from the failed array. The drives have been taken away for further investigation, along with the controller – if at some point we are able to recover the missing files we will of course do so.

This report is extremely technical, and quite long. The abridged “non technical” version is that basically several components which combine together to provide a “fault tolerant” disk drive (meaning it works around problems and repairs itself if it has a problem) all failed at the same time – something which statistically is almost impossible. Attempts to repair and recover the data have failed, and we have had to restore from the last backup we were able to retrieve from our offsite storage (as owing to storage space issues more recent backups were still on the drive which failed as we had not been able to copy them off to a remote server).

What are we doing to make sure this doesn’t happen again?

We have had orders in for additional servers and additional disks since before Christmas. They are now starting to arrive, all be it far too slowly for our liking; due to supply chain problems I believe. As these additional drives and servers arrive we are putting them together and are working to ensure that we have more resiliency in place, more than sufficient storage, and better checks for potential failure modes to try and catch them before they happen.

We will also be working to more pro-actively promote the backup services which are available to all customers to ensure customers know how to use them, and that they download the backups rather than leaving them on the server which is generally what seems to happen at the moment.

The knowledge base does cover this in some detail – search for “Plesk Backup” on our main home page (www.fido.net)

So, what happened?

Overnight / early Thursday morning it appears that 3 of the 7 drives in the RAID array shut down / failed. We are still unsure as to why this is as the server was well within tolerances with regards to heat and service load, as was the rack within which it was housed in TeleHouse North).

In addition to these 4 drives shutting down a 4th disk had developed ECC errors. Normally if this happens the RAID controller detects the errors, disables the drive and brings in one of the “hot spares” which are sitting there just for this purpose.

2 of the 4 drives which had shut down were the hot spares. We believe this to be a result of a bug in the firmware on the hard drive controller. Western Digital have supplied an update for these YS drives which we are rolling out as quickly as possible.

The failure on vpsh3 was detected at approximately 09:30am .. the file system had gone into a degraded and read only state to protect the integrity of the data on the drives.

The standing advice from both AMCC and Western Digital when a drive shuts down is to power cycle the unit which should cause the drive to reset – again not ideal on a mission critical service, or from RAID/Enterprise class drive.

We followed the standing advice, power cycled the server and watched it come back up at which stage it started to rebuild the array. The rebuild then failed owing to the drive with ECC errors. After consulting 3ware this drive was shut down and the server rebooted once more. It was at this stage that the array came back as unusable.

We immediately contacted 3ware initially in the Netherlands (European support) but at this stage we were unable to contact anyone, so we contacted the US instead. They provided a set of diagnostic and rebuild tools which should have rebuilt the array to a point where we could recover the data and then build a new array from new disks. However this did not work.

By now it was getting on for 8pm Thursday evening. At this time, and having found the latest versions of our own backup files that we could, and decided to prepare to restore these backups onto vpsh1, vpsh2 and vpsh6 to spread the load as evenly as we could. Whilst 1 team prepared to restore from backup, another team continued to work on the failed array in the hopes that something could be salvaged.

The next 8 hours were spent extracting data from tapes and preparing it ready for restoration if the rebuild was a failure / trying to rebuild the array.

Approximately 2 hours were “lost” as engineers stopped to get some sleep – and wait to see the results of the latest recovery process (each rebuild takes over an hour to process, followed by a further verify). By 6am it was pretty obvious that we were not going to be able to recover the data, and we started restoring the backups onto the live servers.

The restoration process took several hours, but customer sites started to appear “online” from approximately 10am, approximately 24 hours after they “vanished”. We are aware that some sites / email accounts were not restored until nearly 5pm, this was due to a number of problems with the restoration of a MySQL database on one of the shared hosting servers.

Whilst all of this was going on, we detected a similar failure mode on vpsh2 and had to immediately stop copying new sites onto this node, and start to move all the sites that were already on this server off onto vpsh1 and vpsh6 – aswell as to continue to restore the remainder of the sites to these nodes.

We have now rebuilt vpsh2 from scratch, and are working on rebuilding vpsh3 at this time. The replacement parts for vpsh4 and vpsh5 are still on back order although we hope to have them within the next 14 days. Additional storage units for our current backup servers are also still on back order, however we hope to have these within the next 7-10 days.

Questions

The one question we have been asked time and time again over the last 48 hours is why were the backups not more recent / why were there no backups.

The fact is that backup is the responsibility of the customer. As well as our basic services, we do run a managed service at additional cost where we take responsibility for a customer’s backups. Those customers with this service were restored fairly promptly to a recent offsite backup.

Customers on the non-managed services were restored from our periodic “system wide” backups which are taken once a month and kept for three months.

Ideally, we would like to take more frequent backups but the load this adds to the servers would create a cost increase which many of our customers would find unacceptable. Whilst we hope and trust that such a catastrophic data loss will never occur again, we would stress that customers should either ensure that they back up at their own premises or should consider upgrading to a managed service.

Jon Morby
Managing Director

Bulletin Board, Hosting

Outage Report – 18.01.08

by jon • January 20, 2008 • 0 Comments

Leave a Reply

Bulletin Board, Hosting

Outage Report – 18.01.08

by jon • January 20, 2008 • 0 Comments

Post navigation

Leave a Reply