Following on from Monday’s emergency maintenance and applications of emergency patches to our border routers to overcome a major DoS, resulting from a malformed optional transitive attribute in the AS4_PATH we have continued to monitor our network, and have noticed a number of new malformed packets attempting to cause a further denial of service.
Our engineers, whilst working on filtering these, diagnosed a further bug within our border routers and in the process of recording this bug to pass on to the developers caused a cascade failure which resulted in all 3 border routers failing at approximately 07:45 UTC
Initially the routers were rebooted, however they failed again very quickly and it was decided that in order to minimise the number of flaps to our peers (which would result in our networks being blocked for an extended period of time (the longer the flaps occur for, the longer we then become “dampened”) we would downgrade one of the routers, and leave the other 2 in a disabled state.
This resulted in further instability between 07:45 and 08:10 whilst the routers were rebooted, firmware downgrades were applied and service restored all be it on an earlier release
The new patch was installed on the first border router at 08:45. At this stage we tested the new patch and found it to be stable. We then set to installing the patch on our 2 other routers which were still disabled to prevent any further flaps.
At this stage, owing to an error by one of FidoNet’s (senior) staff, the primary (and at that stage only) router was rebooted, causing a total outage for a period of approximately 15-20 minutes.
The secondary and tertiary routers were each upgraded in turn and brought back into service at 09:15 and 10:30
Finally at 12:00 we considered the network to be stable and the emergency maintenance to be competed – at which stage we changed the network status from AT RISK to ALL CLEAR
During this work we believe we have identified and now fixed the bug which had caused a “black hole” on Monday – and again Saturday – resulting in portions of the network and certain customers to be unavailable even though 2 of the 3 borders were still live and functioning.
On behalf of FidoNet I would like to apologise for the inconvenience this outage has caused. I am confident we have now resolved the issues of this week. In the mean time, we are investigating alternative hardware / technology to replace / augment (multi-vendor) our existing routers such that a similar event should not be possible again
Jon Morby
Managing Director