Fido.Net Status Updates

Service Issues (12th April 2017)

We are aware of a number of services which are operating slowly or failing to respond
Engineers are investigating and will posts updates in due course

 

Affected sites include our support / live chat service and we are aware of this

 

Engineers have identified the problem which relates to one of our centralised storage devices.  Repair work is underway and we hope to have an update after 16:00 BST today

 

We are implementing workarounds where possible, however, a number of services will remain in a degraded state until the repair work is completed

 

16:25 – Repairs continue apace however they have slowed considerably and our previous estimate of 16:00 was obviously incorrect.  We are continuing to work on resolving the issue and restoring access to the data store as soon as possible

18:30 – Repairs continue however we have been able to restore a number of affected services.  They are likely to be much slower than usual while the rebuild continues.  We hope to have full service restored later today / over night.

Please accept our apologies for this disruption and any inconvenience caused

 

RFO (Reason For Outage) – posted at 12:50 on 12th April 2017

 

So whilst the service has not yet been restored, and won’t be for another 3-4 hours … we have identified the cause of the problem

 

After previous issues, we have a strict policy of scripting all work and going through a signoff process where works are checked and double checked before approval is given to run the various commands.

 

Engineers had spotted a bottleneck on the storage cluster and had proposed a way to improve the performance.  These modifications were being scripted and verified when they were inadvertently executed (loaded to be run at 10 pm but instead executed at 10 am)

 

This has resulted in a re-balancing of the storage cluster, which, coupled with a normal daytime load and the previously identified bottleneck, meaning that a change that should normally run quietly in the background has consumed most of the foreground resources as well as the background resources.
Once initiated we can’t stop the process as that will leave the array in an unusable state.  We just need to wait for it to complete.

As of 12:50 today (roughly 3 hours in) we are at 98% repair on one storage pool and 49% on the 2nd pool.  This makes me believe that we should start to see normal services restore at around 4 pm today (based purely on the elapsed time so far).

There is little we can do to speed this up and in the meantime, a number of virtual machines which mount the storage cluster for key data are struggling and dropping connections.  Once the rebalance comes to an end we anticipate the servers should “unblock” and pick up from where they were previously.

 

Had this command been executed during our normal weekly maintenance slot at 10 pm we anticipated a small degradation to service, but nothing of the level we have seen today.  I’m told that once the rebalance has completed, we should see a major improvement in performance and all should be good going forwards.  I remain hopeful but sceptical at the same time.

I can only apologise for the disruption to service caused by this premature execution/start to routine maintenance and promise that we will do better in the future.

 

Jon Morby
Managing Director

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.