Disaster Recovery: Fear of Landing
In the early hours of Wednesday the 10th of March, a fire broke out at the OVH data centre in Strasbourg.
The fire started in a room in the building known as SBG2 just before 01:00 local time.
All services in all four buildings (SGB1, SGB2, SGB3 and SGB4) were halted while SGB2 blazed with an uncontrolled fire.
Forty-three fire trucks with over a hundred fire fighters were reported as being on site to fight the fire, using a pump boat on the Rhine to supply water. At 03:00, they isolated the site and closed the perimeter. By 04:00, SBG2 was completely destroyed and SBG1 was on fire, with SBG3 under threat.
By the time I woke up that morning, the fire fighters had gained control of the blaze. It took six hours to put out the fire.
#Strasbourg Un bâtiment de stockage de serveurs informatiques @OVHcloud_FR ravagé par un incendie #DNAinfos https://t.co/EuELb9Nux3 pic.twitter.com/MKlXDInhWP
— Antoine Bonin (@abonin_DNA) March 10, 2021
For the first time in over twenty years, I had no personal presence on the Internet. All of my websites were gone.
Later that evening, OVH updated on Twitter.
Update 5:20pm. Everybody is safe.
Fire destroyed SBG2. A part of SBG1 is destroyed. Firefighters are protecting SBG3. no impact SBG4.— Octave Klaba (@olesovhcom) March 10, 2021
Fear of Landing is (well, was) hosted in SBG3.
“We recommend to activate your Disaster Recovery Plan” are not words that anyone wants to hear. Especially as Fear of Landing’s disaster recovery plan was based on the off-site backup service supplied by OVH.
No back-ups have been forthcoming. Cliff, my partner and my site admin, had additional back-ups; however they were only of the data, so he spent the day rebuilding servers so he could roll out his back-ups and restore service.
He and two other admins, Mark and Rob, worked tirelessly to bring me back online. Yesterday, Cliff got up shortly after 6am, just missing Mark, who had gone to bed at at 4am his time, just minutes earlier. Between the three of them, they restored the websites and the mail servers.
I’m still twitching a bit but mostly ok.
SBG2 was completely destroyed, with no data recovery (where are those back-ups?). Four halls (out of twelve) in SBG1 were destroyed and SBG3’s uninterruptible power source (UPS) was, well, interrupted. Based on reports, Fear of Landing’s server still exists but is not accessible. No word on the off-site back-ups.
The Register points out that three years ago the site suffered a power outage which highlighted design flaws in the location.
The fire comes three years after the group embarked on a €4m-€5m investment plan in the wake of a major outage that left three of the Strasbourg data centres – SBG1, SBG2 and SBG4 – without power for 3.5 hours in November 2017.
[OVH Founder and Chairman] Klaba himself said at the time of the 2017 outage that it was partly because “SBG’s power grid inherited all the design flaws that were the result of the small ambitions initially expected for that location.”
At the time of the 2017 outage, “SBG2’s power grid” was built atop “SBG1’s power grid instead of making them independent of each other”.
I wasn’t the only one affected, of course; OVH is the largest hosting provider in Europe and has many high level customers, including the French government’s vaccination data. Netcraft reported that the fire took out 3.6 million websites across 464,000 distinct domains.
Websites that went offline during the fire included online banks, webmail services, news sites, online shops selling PPE to protect against coronavirus, and several countries’ government websites.
Examples of the latter included websites used by the Polish Financial Ombudsman; the Ivorian DGE; the French Plate-forme des achats de l’Etat; the Welsh Government’s Export Hub; and the UK Government’s Vehicle Certification Agency website.
Closer to home, the European Space Agency and Strasbourg Airport also had servers on the site.
Rust, a popular video game, have announced to their users that all EU servers were lost and no data will be restored.
In the midst of the chaos, I was somewhat bitterly amused by another customer’s response to the situation:
— ACCEIS (@acceis) March 10, 2021
It’s a good thing Cliff didn’t wait for the service to be restored; he and the others worked straight through with the result that here we are, back to normal!
Meanwhile, OVH are rebuilding a new network room in SBG5, with working teams to clean up the site and restore electricity and network services. The latest announcement was that power will be restored to SBG3 on Friday the 19th.
Their website says that if this timeframe is not soon enough, they recommend deploying my infrastructure in another data centre.
There is more information in the OVH statement with a full status report but I can’t get over the idea that, if it were left up to me, Fear of Landing would not have existed for almost two weeks.
OVH Chairman Octave Klaba said in a video that the alarms went off at 00:47 but the first responders were not able to investigate because the smoke was too thick to safely remain in the data centre. Further, the fire department’s thermic camera images showed two UPS systems, UPS7 and UPS8, in the fire.
We had maintenance on UPS7 in the morning. The supplier came and changed a lot of pieces inside of UPS7 and restarted UPS7 in the afternoon. It seems it was working but then in the morning we had the fire.
The data centre cameras apparently have further video footage yet to be analysed and it is hoped that this will give further information about what happened.
There are a lot of questions, of course, starting with how a building of fire-resistant materials and in-built fire supression capabilities could end up in an uncontrolled fire. However, as we know, battery fires can be very hard to extinguish so it seems possible that a faulty UPS in SBG2 was at the heart of the blaze.
On a personal note, the outage and rushed restoration may have some side effects, so please, if you have any issues interacting with the site or when emailing me, please forward any error messages or bounces to me so that I can troubleshoot the problem.
And thank you, again, to Cliff, Mark and Rob for their tireless efforts in restoring Fear of Landing (and everything else!) back onto the Internet.
This calamity, strange enough, does not seem to have attracted widespread attention from the news media.
Actually, judging from the apparent total lack of news I wonder if reporting has not been suppressed.
I am very suspicious about the increasing power of big corporations that have started to take over our way of life. Even our lives!
In another, not related, article I read about the health insurance for GP visits in the UK being taken over by a large American provider. Admittedly, the author is very much biased against the encroachment of corporate entities who are interested only in profit, profit (in the view of this writer) for the sole benefit of their owners and multi-billionaire CEOs. Which, again citing this article, will put effective medical care beyond the reach of many ordinary citizens.
Not long ago Sylvia wrote about a famous (KLM) pilot, Ivan Smirnoff, who in the course and aftermath of WW2 found himself in the USA with his very ill wife. Although he had played a substantial role in assisting the American logistic efforts in the Pacific, the cost of his wife’s medical treatment in America virtually bankrupted him.
Today we are slowly being caught in the stranglehold of American commercial interests. We all (all? I am not) are using Fakebook, Twitter, we hail an Uber, we eat at McYuck or Kentucky, for a cup of coffee we go to Staryucks, we stay in an Airbnb, we buy via Amazon and our
TV channels are increasingly controlled by Bezos too – whose MONTHLY income, incidentally, is more than the TOTAL wealth of the erstwhile richest man in the world, John D. Rockefeller.
What do I really want to say with this rant?
We are lured by empty promises by large corporations.
To maximise profits, even basic precautions, precautions that we were promised had been put in place, but those promises had not been kept. No back-ups (only back-up for UPS?), no effective fire-proofing, no separate data storage. No separation of structures housing all that technology.
AND: apparent suppression of the news.
How long did it take for the Russians to admit that there had been a major disaster at Chernobyl?
A not entirely dissimilar denial happened in the USA when a prolonged heatwave caused drought and, combined with high winds, allowed an overloaded and poorly maintained power grid to spark it all into a gigantic firestorm.
What was the reaction of Trump? It was all the peoples’ own fault for not raking the forest floor.
In my own mind the author of this article, the one about alleged transfer of GP health care insurance in the UK, had exaggerated his arguments.
Now, reading Sylvia’s article, I am not so sure.
AAHH this is starting to become too political.
Even though I am not sure that I got my facts right here, let it stand and see what reactions it will provoke.
We are (well, I am !) looking forward to the next, aviation related, article from Sylvia.
And forget about the shady world of politics for a moment !
I’m not quite sure what the connection you are making here but it certainly has been major news in the tech media. I’m not sure mainstream press has a lot of interest in the nuts-and-bolts of webservers.
Well Sylvia, I already admitted to a rant, more designed to provoke responses. But at least here in Ireland there has not been a word in the news about this fire. Which, judging from your article, was a major one and had potentially far-reaching consequences.
So, perhaps wrongly, my first reaction was a suppression of the news.
You, better than most, know how for instance potentially dangerous flaws in the design of certain airliners have been kept a secret for commercial reasons. The FAA seemed to have gone along with it, but that is a different story.
Still working on that story!
It will be interesting to know if this was one of the newer ups with lithium based batteries. I’ve seen marketting selling their acid free nature meaning they are better able to be collocated with the actual servers. As this incident demonstrates, that is a bad bad idea, with power best off remote from your servers, where fire might be less of an issue.
Makes me think of the Numerous problems with lithium batteries in planes, both cargo and fuselage…
That’s a good question; I wasn’t sure, from what’s been said so far, if the UPS had been replaced in 2017 when they said they needed to upgrade the power matrix on the site. But yes, the fierce blaze immediately made me think of the uncontrolled cargo hold blaze in UPS flight 6.
Another link is the “accident just after maintenance” thread which appears to be present…
Here’s a good article:
https://www.theregister.com/2021/03/12/ovh_restoration_roadmap/
Sorry, can’t edit comment, but the tagline is:
“OVH founder says UPS fixed up day before blaze is early suspect as source of data centre destruction”
Without knowing their exact layout, power draws, and the nature of the work being done, we’ll never know for sure what happened here. There’s definitely been a major breakdown of systems here, either accidentally, or maliciously. (If it did start in a single UPS, the fact the fire suppression system couldn’t put it out or at least hold it till first responders arrived is pretty damning.)
Incidentally, this is one of the reasons why I follow Fear of Landing – failures in many fields (like I.T.) aren’t examined to the depth and breadth that flight failures are. I’ve taken more than a few posts from here and shared them with my I.T. group with the justification that failures based on human errors of overconfidence, lack of oversight, bad communication, etc. are relevant everywhere complex systems are in play – except, in Aviation, there are additional systems to make sure lessons are learned and to highlight things that would get lost in other fields.
Never thought I’d be reading about server infrastructure failures here though. Kudos to the Fear of Landing I.T. Team, both for getting it done when it needed to be done and remembering that you can never have enough backups.
I did think “well, they did say they liked never knowing what the subject was going to be each week” while I was writing this.
Thank you for this!
Thank you for all that you do! I love your posts, and can’t wait to read them when they arrive. Please let us know if there’s anything we can do, as your readers, to support you!
Thank you Angie! Luckily Cliff accepts payment in home-cooked meals, so I’ve got this one covered. :)
You are just doing what you’ve always done: reporting on a major crash.
I’m not surprised there wasn’t widespread media reportage. Servers may be physical but their use is highly abstract and the impact of their loss is hard to explain; since AFAICT there were no deaths, reportage would have been back-burnered by all the other bad news. The first I heard of this was a BBC report that Russia was trying to blame a ~deliberate net outage on this fire.
COLINTD: I hadn’t realized new UPSs were using lithium instead of lead-acid. As an ex-chemist, I see no reason to believe lithium batteries are safer to co-locate; lithium catches fire when someone does little more than breathe on it, or even a battery made of it is bruised (cf cellphone fires). There’s also a limited supply, which should have been reserved for batteries that must be small/light (cellphones, drones, …); since (as you note) co-locating is a bad idea anyway, I hope they rethink both design and materials for future (and repaired) server systems.
Kudos to Sylvia’s IT team for getting the site back up quickly; I did just enough system support (after I stopped doing chemistry) to know it’s not an easy job.
It seems interesting that it is stated that it is a “fire after maintenance was just done”. These types of incidents show how reliant we all are on technology, electricity, computers, internet etc. When the electricity goes out at home, it seriously disrupts our day. I found it odd that in this article it states that the building was of fire-resistant materials and had in-built fire supression capabilities (Sylvia mentioned this) so how did the fire get to that state? Am looking forward to more of Sylvia’s great articles on aspects of aviation.
OVH have now said that the virtual private servers (that we were using as a name server and part of database cluster) are “non-recoverable”, despite being in SBG3 (the data centre that didn’t burn).
The have also said that that our snapshot backups from SBG3 are irrecoverable and that the main network that the servers were on are irrecoverable. Some services may come back on the 22nd of March, about a fortnight after the fire. Thanks, we’ll live without them.
It’s just as well I trust nobody. Despite having “snapshot” backups with OVH, I had a separate, full backup of the main servers on AWS (Amazon) in Ireland. My main name server was mirrored in AWS in London so we did not lose name service at any time. All the software we use was backed up into git repositories.
The main issue was that the restore time was very slow; I had never expected to need to do a full restore. It took nearly 36 hours to recover 200GB of email and 140GB of web data, along with the databases and other files that support them. Still, nothing was lost and the world did not come to an end, as it did for so many companies.
Thank you to all for your kind wishes.
Wow. Me thinks OVH’s offsite backups were neither offsite or weren’t actually backups. :/
To tell a story in Rudy’s fashion :) when I was just starting out in I.T. we had a major site outage because a sysadmin formatted the wrong hard drive. A different sysadmin insisted we were fine, that we had backups, until he checked and realized whoever setup the initial backup forgot to set the recursion switch for directory traversal and so nothing was actually backed up. The dev team I was on spent 36 hours straight rebuilding that site from the dev system copy and thank God the dev server was separate from the production server and we mirrored the prod data fairly regularly.
A lot of changes followed after that episode. A fine example of the Swiss cheese theory of failure. I am very curious how the holes lined up in OVHs case.
I will be very interested to see the final report on this one. I do feel that, when I built and ran data centres for Redbus Interhouse, a disaster like this would have been the end of the company. It will be interesting to see if and how OVH survives.
In your case, did nobody comment that the dev system should not have had the live data on it? :)
Heh. It was non-sensitive stuff, no PII, user accounts or anything PCI sensitive – it mostly news articles and styling info.
It was also 1998 and we were operating in start-up mode; even if we did have PII I’m sure the higher ups would have shrugged their shoulders and put it on the “fix later” list since developing against production data sped up our turn around time,
And that’s assuming they actually cared at all – things were pretty crazy in the middle of the first Internet gold rush. shrug
My “rant” was intended especially to try and focus on the current trend: Profit first and foremost.
The Dreamliner is a crass example. Aircraft delivered with tools left behind, even under the cockpit floor where it could have jammed the rudder pedals. Metal shavings, even in critical places, not cleaned prior to test flying or delivery. This could have led to shorts, and if sharp bits penetrated cables, to fires. Even a ladder was found left behind inside the fin of one!
Some airlines refused to accept aircraft from the Charleston plant, apparently.
According to my son (PhD computer science) not many data were lost in the OVH centre fire, those customers that were affected apparently had opted out of paying for back-ups.
I am most definitely not a computer expert, I am only worried about the profound changes in our society caused by excessive profit-making by the IT industry. Compounded by the covid-19 restrictions.
On the RTE (Irish state broadcaster) news this evening it was mentioned that Dublin City, until just only a year ago a lively city centre that attracted not only tourists, but also Irish for shopping or a night out, is now in danger of becoming derelict.
Already the number of unprovoked attacks is rising, so is drugs abuse.
It was mentioned that many of the famous pubs will never reopen again Shopping? That is being taken over by the likes of Jeff Bezos.
Well, as we are eagerly waiting for this week’s edition, we must give credit to Sylvia and her team to “keep the show on the road”.
The last week only touched to world of aviation in the context of lithium in aircraft ” li-ion” batteries and the danger of a run-away and uncontrolled fires that can result. Sylvia mentioned UPS flight 6.
The older generation of ni-cad (nickel cadmium) batteries had a similar danger, albeit much less frequent. But a friend who in those days was a captain on the Sud Aviation Caravelle*, the French jet airliner which was the first jet and the staple of the fleet of Dutch charter airline Transavia, told me that he was always a bit weary of the battery compartment right behind the cockpit.
And so Sylvia managed to turn a disaster into a triumph. Her quick reaction and investigative skills provided her readers with a topic that, judging from the number of reactions, was enjoyed by a great many.
Well done, and keep going please!
*The Caravelle was a pioneer of tail-mounted engines in jet airliners.
Its lineage was demonstrated by the adoption of the nose section of the DH Comet.
I feel really bad for all the rust players as well. I’ve never played it, but I’ve heard that it’s a lot like Minecraft, in that you can build huge structures and cities and stuff. It must have really sucked for those players who built huge fortresses over the years, to hear that their all their hard work is just gone! :(