Windows VPS Server reboot (vz-win-1)

April 16th, 2008

vz-win-1 is currently suffering from memory problems and requires a reboot (appears to be a memory leak, we are investigating further but need to reboot to reclaim the memory at this time).

There will be a brief outage of windows VPS’s hosted on this server whilst the server reboots

The reboot is expected to commence at 10:45 BST and be completed by 11:00 BST today

DSL Issues

April 10th, 2008

09:35

Our upstream provider (through BT wholesale) are currently experiencing technical problems with their DSL gateway, which is causing DSL authentication errors.

Engineers are working to resolve this issues as quickly as possible, and more information will be added to this post as soon as we have an update.

09:50

The problem has been resolved and connections are re-establishing. As soon as we have an update or receive a report explaining the cause of the outage details will be added to this post.

vpsh4 unscheduled reboot

March 13th, 2008

vpsh4 has just kernel panic’d and is undergoing a reboot.

We are unsure at this time as to the cause of the panic and will be investigating further.

During the reboot any Virtual Containers housed on vpsh4 will be unavailable.

vz-win-1 unscheduled outage

February 25th, 2008

vz-win-1 started to develop problems just after 11am today owing to what appears to have been a DoS attack causing various Windows services to fail.

We are in the process of restoring this server however the initial reboot ended up causing a Windows Update cycle of patch / reboot / patch steps.

We hope to have this server restored by 1:30pm

vz-win houses 3 Plesk nodes (psaw-1, psaw-2 and psaw-2) and these nodes will be offline whilst this emergency maintenance completes

We would like to apologise for any inconvenience caused by this outage and assure customers we are working to restore service as quickly as we can.

** UPDATE **

This server was restore to normal service shortly after 13:40. Apologies to any affected customer

All Clear

February 22nd, 2008

Following our “at risk” notification earlier today we are pleased to
inform you that the ring is now complete once again.

The error was traced to an internal tail which has now been replaced
with new armoured fibre.

We will continue to monitor the network closely to ensure the new
link is stable.

At Risk Notification

February 22nd, 2008

At risk notification:: fibre break - FidoNet ring not currently
redundant

Due to increased error counters on one of the ring links this
interface has currently been taken out of service. Traffic is
currently being re-routed via an alternate path.

At this time therefore the fibre ring is ‘at risk’ to any
additional break.

We have an engineer en-route to site to replace the optic and also
test the fibre stretch affected.

We will update you as soon as there is more information to hand.

DSL Network Maintenance

February 9th, 2008

We will be performing maintenance on our DSL links between BT wholesale and the Fido backbone between 01:30 and 03:30 hours on 12th Feb 2008.

Customer connections will be disconnected briefly during this maintenance and may take 5 minutes to reconnect.

During this maintenance window we will be upgrading router firmware and applying security patches

Plesk 8.3 Roll Out

January 24th, 2008

We are working over night to roll out Plesk 8.3.0 across our unix hosting platforms.

This latest update brings a number of performance enhancements and additional features, including multi-domain alias support on web/email domains and the latest Horde webmail client.

As we upgrade each server there may be a brief period where the server becomes unavailable whilst it reboots.

A full list of new features can be found here

highlights include

Better support for outsets

Tighter integration with HSPc

Ability to keep copies of forwarded emails

Quota integration with Horde

SRV support in the DNS manager

If you are a VPS customer and wish to have Plesk upgraded to version 8.3.0 please contact support and they will schedule a supervised upgrade.

Backup Service

January 23rd, 2008

One of the things to come out of the recent service issues was that very few customers are running backups of any kind.

We always strongly advise customers that they should backup their important data, both at home and on their servers. If the worst were to happen, you could find yourself without important documents, files and/or emails.

We include a free backup tool with our hosting packages which allows you to backup your web/email hosting accounts to a file which you can then download and keep safe at home.

In addition to this, we offer a service where for £5 per Gb per month we will manage and oversee your backups, and copy them to a remote storage server for safe keeping. This takes all the pain and heart ache out of managing your own backups and lets you get on with your day to day work.

In addition to backing up your web site, we can also help by backing up your desktop at home. Using the same service, and a piece of software which costs just £30 you can add your home / desktop computer to the backup roster and be safe in the knowledge your data is secure.

For more information contact our sales team - sales@fido.net or by phone on 08450 045 045

(prices are subject to VAT)

Outage Report - 18.01.08

January 20th, 2008

It has been a very long 4 days, and I would like to express my sincerest apologies for the disruption in service experienced by those customers who were affected by the failure of the storage array on vpsh3 Thursday morning.

We have now just about given up trying to recover any data from the failed array. The drives have been taken away for further investigation, along with the controller - if at some point we are able to recover the missing files we will of course do so.

This report is extremely technical, and quite long. The abridged “non technical” version is that basically several components which combine together to provide a “fault tolerant” disk drive (meaning it works around problems and repairs itself if it has a problem) all failed at the same time - something which statistically is almost impossible. Attempts to repair and recover the data have failed, and we have had to restore from the last backup we were able to retrieve from our offsite storage (as owing to storage space issues more recent backups were still on the drive which failed as we had not been able to copy them off to a remote server).

What are we doing to make sure this doesn’t happen again?

We have had orders in for additional servers and additional disks since before Christmas. They are now starting to arrive, all be it far too slowly for our liking; due to supply chain problems I believe. As these additional drives and servers arrive we are putting them together and are working to ensure that we have more resiliency in place, more than sufficient storage, and better checks for potential failure modes to try and catch them before they happen.

We will also be working to more pro-actively promote the backup services which are available to all customers to ensure customers know how to use them, and that they download the backups rather than leaving them on the server which is generally what seems to happen at the moment.

The knowledge base does cover this in some detail - search for “Plesk Backup” on our main home page (www.fido.net)

So, what happened?

Overnight / early Thursday morning it appears that 3 of the 7 drives in the RAID array shut down / failed. We are still unsure as to why this is as the server was well within tolerances with regards to heat and service load, as was the rack within which it was housed in TeleHouse North).

In addition to these 4 drives shutting down a 4th disk had developed ECC errors. Normally if this happens the RAID controller detects the errors, disables the drive and brings in one of the “hot spares” which are sitting there just for this purpose.

2 of the 4 drives which had shut down were the hot spares. We believe this to be a result of a bug in the firmware on the hard drive controller. Western Digital have supplied an update for these YS drives which we are rolling out as quickly as possible.

The failure on vpsh3 was detected at approximately 09:30am .. the file system had gone into a degraded and read only state to protect the integrity of the data on the drives.

The standing advice from both AMCC and Western Digital when a drive shuts down is to power cycle the unit which should cause the drive to reset - again not ideal on a mission critical service, or from RAID/Enterprise class drive.

We followed the standing advice, power cycled the server and watched it come back up at which stage it started to rebuild the array. The rebuild then failed owing to the drive with ECC errors. After consulting 3ware this drive was shut down and the server rebooted once more. It was at this stage that the array came back as unusable.

We immediately contacted 3ware initially in the Netherlands (European support) but at this stage we were unable to contact anyone, so we contacted the US instead. They provided a set of diagnostic and rebuild tools which should have rebuilt the array to a point where we could recover the data and then build a new array from new disks. However this did not work.

By now it was getting on for 8pm Thursday evening. At this time, and having found the latest versions of our own backup files that we could, and decided to prepare to restore these backups onto vpsh1, vpsh2 and vpsh6 to spread the load as evenly as we could. Whilst 1 team prepared to restore from backup, another team continued to work on the failed array in the hopes that something could be salvaged.

The next 8 hours were spent extracting data from tapes and preparing it ready for restoration if the rebuild was a failure / trying to rebuild the array.

Approximately 2 hours were “lost” as engineers stopped to get some sleep - and wait to see the results of the latest recovery process (each rebuild takes over an hour to process, followed by a further verify). By 6am it was pretty obvious that we were not going to be able to recover the data, and we started restoring the backups onto the live servers.

The restoration process took several hours, but customer sites started to appear “online” from approximately 10am, approximately 24 hours after they “vanished”. We are aware that some sites / email accounts were not restored until nearly 5pm, this was due to a number of problems with the restoration of a MySQL database on one of the shared hosting servers.

Whilst all of this was going on, we detected a similar failure mode on vpsh2 and had to immediately stop copying new sites onto this node, and start to move all the sites that were already on this server off onto vpsh1 and vpsh6 - aswell as to continue to restore the remainder of the sites to these nodes.

We have now rebuilt vpsh2 from scratch, and are working on rebuilding vpsh3 at this time. The replacement parts for vpsh4 and vpsh5 are still on back order although we hope to have them within the next 14 days. Additional storage units for our current backup servers are also still on back order, however we hope to have these within the next 7-10 days.

Questions

The one question we have been asked time and time again over the last 48 hours is why were the backups not more recent / why were there no backups.

The fact is that backup is the responsibility of the customer. As well as our basic services, we do run a managed service at additional cost where we take responsibility for a customer’s backups. Those customers with this service were restored fairly promptly to a recent offsite backup.

Customers on the non-managed services were restored from our periodic “system wide” backups which are taken once a month and kept for three months.

Ideally, we would like to take more frequent backups but the load this adds to the servers would create a cost increase which many of our customers would find unacceptable. Whilst we hope and trust that such a catastrophic data loss will never occur again, we would stress that customers should either ensure that they back up at their own premises or should consider upgrading to a managed service.

Jon Morby
Managing Director