Friday 18th – 06:00
We have worked relentlessly with the manufacturers to try and recover this data however after 20 hours we have sadly had to take the view that the data is now irretrievable.
Whilst it is our policy that customers are responsible for their own backups, we do take periodic backups of data – however as the volume of customer data has increased we have had to install larger and larger backup servers. We were working on a new backup server over the Xmas holidays as the existing server has reached and exceeded its capacity. With this in mind, whilst we have a backup of most of the data, it is older than we would have liked as the new backup server is not yet on stream.
Having worked through the day, and most of the night trying to recover the data we have now taken the view that this is most likely not possible to recover the data, and we are now in the process of restoring the data we do have saved on our internal backups.
Customers who have been keeping backups of their own sites should look to these and upload them once we have completed our restore process to ensure the latest versions of any site/content are online.
On behalf of FidoNet I can only apologise for the inconvenience this has caused, and assure you we are working to find ways of mitigating any future failures of this nature.
The failure
It is not yet clear why almost 50% of the drives in the RAID subsystem should fail at the same time, although we now believe it to be an issue with a batch of disks supplied by Western Digital. Previously there had been problems with a similar batch which Western Digital had attributed to a firmware problem and had supplied an updated firmware revision to overcome these problems. It seems that this is not related as the drives failed at the hardware level.
FidoNet and 3ware engineers worked through the night to retrieve data and mappings from the remaining drives as well as parity information from the failed drives in an effort to restore the array sufficiently to allow us to copy the data off onto a new server, however this has not been possible as the file system corruption is too severe.
As a company we do everything we can to mitigate system failures, which is why we use RAID file systems. RAID (level 5) is designed to be fault tolerant and to handle occasional disk failures, however multiple concurrent failures if/when they occur (which is very rare) are difficult to overcome. When a disk fails, the parity data stored on other disks in the array is used to rebuild any missing data on a spare drive. We always ensure we have “hot” spare disks configured in case of just such a failure however as has happened in this instance, when a disk failed the RAID controller starter to rebuild onto the host spare, and then a further 2 parity disks failed concurrently which simply left the controller with insufficient data to effect a repair.
Thursday 17th – 23:00
The raid array on the server appears to be a total loss, however engineers from 3Ware (AMCC) and FidoNet have been working over the last several hours to try and recover information. We anticipate work will continue for several more hours before we make a decision to give up and/or we are successful.
We plan to review the restoration process and its results at 06:00 at which stage we will make a further decision as to the viability of any data remaining on the storage array.
Server Failure
At approximately 09:45 this morning we detected a serious error with the disk subsystem on vpsh3 (THN). Engineers immediately responded in an attempt to prevent any loss of data / damage to files, and are currently working to repair / restore this server.
Owing to the nature of the failure it will be several hours before we are able to restore this server to full functionality, and we apologise for the inconvenience we know this prolonged outage will cause.
More information will be posted as we have it.
Current estimate is that service will be restored by 10pm Thursday.