RFO: Fido Apps 10th May 2014

As you may have noticed, we have been having some technical problems over the last few days with Fido Apps (which hosts redmail.com, fidonet.com, fido.net and some other large domains).

This started as a result of a hardware failure with a main power distribution unit in our TeleHouse facility on Thursday night at approximately 23:20 which took out 2 switches amongst some other equipment.

Initially we thought we had resolved the problems when we replaced the faulty PDU and brought everything back online by 4am Friday morning.

Sadly it turned out that the file system corruption which was a result of the storage array being disconnected from the front end mail stores (caused by the power failure) was much deeper than was initially evident and within 24 hours the system’s internal integrity monitors spotted inconsistencies and as a precaution turned the storage units associated with apps-store-1 and apps-store-2 into read only mode.

Engineers were alerted at 05:30 and attempts were made to repair the damaged file stores and repair the databases which hold meta data relating to emails. After more than 6 hours attempting to repair and reindex it became clear that this was simply not going to be practical.

Why wait 6 hours? In simple terms, each mail store is upwards of 14TB in size and whilst we use state of the art intelligent journalling file systems which repair quickly (under normal circumstances), when the journals themselves become corrupted we have to revert to the very old fashioned way of repairing a disk file system using old style tools – and these take time to rebuild such large file systems.

By now it was approaching 1pm, nearly 8 hours had elapsed since the systems first became unresponsive and it was now time to take more drastic action.

We started building a brand new mail store and had to copy all the mail, attachments, databases and indexes from the old mail store file systems onto these new storage arrays. This took another 12 hours to complete.

Once the data had been migrated across we then performed further integrity checks and repaired the meta data tables which had problems, finally restoring service at approximately 01:30 Sunday morning.

So, what are we doing to make sure this sort of thing doesn’t happen again?

One of the problems which exacerbated this problem for redmail.com customers was that our distribution algorithms were biased in part on domain name, so a larger concentration of redmail.com accounts were on the affected mail stores (most were on apps-store-2). We are modifying the hashing algorithms to ensure that there is a better / more even spread of accounts across all of our mail stores in the future, so that no single domain can be affected in such a way again (should a store become corrupted / go offline).

We are also reducing the overall size of each store and adding more stores to compensate. This means that should we have an issue like this again, working on 4TB stores should take 1/4 of the time to repair than a 16TB store (theoretically at least).

We already utilise ZFS and journalling high performance file systems which are “the last word” in storage.
We already utilise battery backup on our storage arrays and we already have state of the art data centres with N+1 power and both UPS and standby generators.

We are however looking at the distribution of power connections from our internal PDUs so that should a breaker trip like this again the arrays (which have dual power supplies) are fed from different breakers. In this instance fsc-2 was setup correctly however the switch connecting the storage array to the mail store was fed from a PDU which failed and this resulted in the connection between front end and back end breaking – which resulted in corruption in the file system.

Not all switches have dual/redundant power supplies currently and we are looking for ways of resolving this so that should a switch lose power to one PSU the other can still carry the load.

We have already opened a new state of the art data centre in London Docklands and are migrating more servers to this location where we have more power and more space available.

The initial power failure was caused by a breaker failing when load was reduced (8 amp draw on a 16 amp breaker dropping to 5 amps). We are still not sure as to why / how this PDU tripped and are in discussions with Raritan (the PDU vendor) to try to understand why so this does not happen in the future.

Finally I can only apologise to all customers affected by this prolonged outage to the Apps service. This is the first outage of the service in more than 5 years and we are deeply sorry for the inconvenience it has caused.

Hopefully, as a result of the additional measures being taken to bolster the service, it will be stronger and even more robust in the future. This is certainly our plan!

Jon Morby
Managing Director
FidoNet

Broadband, CoLocation Centres, London 3 (TFM3 - Telehouse North)

by jon • May 11, 2014 • 0 Comments

Leave a Reply

Broadband, CoLocation Centres, London 3 (TFM3 - Telehouse North)

RFO: Fido Apps 10th May 2014

by jon • May 11, 2014 • 0 Comments

Post navigation

Leave a Reply