Were you hoping to be a passenger on a British Airways flight over the weekend? So were ten of thousands of other travelers. The sheer logistic disruption (planes in the wrong place, getting luggage where it is supposed to be) is taking several days to sort out. And that’s not to mention how long it might take to sort the most pressing IT issues.
BA has said the problem was caused by a power surge that led to some networking hardware failing, and then the all-important messaging systems failing. But BA was already experiencing significant IT issues on Friday, with very long delays in issuing boarding pass and check-in emails. So there were major problem a day before the surge, which suggests that the surge was a symptom, not the cause of the problem.
All too easy to imagine someone deciding in desperation to TIO&TIOA (Turn It Off and Turn It On Again). Of course, this could literally have involved someone pulling a plug out a wall. Which could could easily led to a power surge and hardware failure, with the initial problem snowballing out of control.
Airlines are as much their IT systems as they are their aircraft. There are a lot of (potentially increasingly creaky) legacy systems to worry about. With such complex systems, there are a lot of possible points of failure from both the technology and business perspectives.
The tech behind British Airways
British Airways’ overall IT system (or what is provided by IAG Global Business Services) is actually a collection of subsystems. BOAC had its own intercontinental messaging system before the official invention of email. So BOAC might have been messaging around the world in the 1960s, but they weren’t sending anyone any emails. This means a lot of effort over the years have gone into plumbing subsystems.
That might sound like a problem with a fairly obvious solution in 2017: the RESTful API. But BA’s system has existed from a time long before anyone even dreamt of RESTful APIs.
In 2017, British Airways has to support services that send emails; early 2000s services that use SOAP APIs; and modern ones that are RESTful. And, do all that with an ESB that is not suited to the tasks it undertakes. Given that a lot of the Wise Old Birds who managed to keep things going at BA for the past few decades have retired or been let go, this catastrophe might have been waiting to happen.
Here at APIContext, from the reports we compile from our routine testing of public APIs, we can see the benefits of moving to a native multi-homed architecture. Actually, migrating to such a architecture is radically non-trivial. It will be interesting to see if IAG chooses to go down that line.
The answer for British Airways?
A native multi-homed architecture uses dynamic load balancing to do away with the need for failover to multiple backend instances when integrating of REST services. However, if you are using failover, it’s not ideal to rely on JMS ESB rather than an ESB with native support for HTTP/S. Of course, you wouldn’t start from here, but you are stuck with your legacy systems.
BA’s IT system has now crashed six times in a year, with the recent outage the most serious. It’s probably the case here that it’s not the APIs themselves that are the problem, but the kludges needed to try and integrate the APIs with the legacy systems.
API monitoring is only one aspect of monitoring mission-critical time-sensitive technical systems. It might well be the case that for British Airways, there was nothing wrong with their APIs. This time.
But performance was degrading before the power surge. API monitoring might have picked that up. It was presumably picked up by other performance monitoring subsystems.
It makes sense to monitor your APIs explicitly to get as much insight into what is going on with this increasingly vital component of the business system. Next time, it might be the APIs themselves.