One-in-15-Million Flight Plan Causes UK Air Traffic Chaos
On Monday, the 28th of August 2023, a “technical glitch” in the National Air Traffic Services (NATS) in the UK caused extensive flight delays and cancellations. Monday was the Late Summer Bank Holiday, which takes place the last Monday in August for a long weekend at the end of the summer season. As a result of the fault, the airspace around the United Kingdom remained open; however Air Traffic Control could not automatically process flight plans. The increasing stress as they tried to restore the systems and deal with flight plans manually became apparent through the updates from NATS during the day (all times are UK local):
12:10 We are currently experiencing a technical issue and have applied traffic flow restrictions to maintain safety. Engineers are working to find and fix the fault. We apologise for any inconvenience this may cause. Please check with your airline on the status of your flight.
12:40 We are continuing to work hard to resolve the technical issue. To clarify, UK airspace is not closed, we have had to apply air traffic flow restrictions which ensures we can maintain safety.
14:20 This morning’s technical issue is affecting our ability to automatically process flight plans. Until our engineers have resolved this, flight plans are being input manually which means we cannot process them at the same volume, hence we have applied traffic flow restrictions. Our technical experts are looking at all possible solutions to rectify this as quickly as possible.
Our priority is ensuring every flight in the UK remains safe and doing everything we can to minimise the impact. Please contact your airline for information on how this may affect your flight. We are sincerely sorry for the disruption this is causing.
15:15 We have identified and remedied the technical issue affecting our flight planning system this morning. We are now working closely with airlines and airports to manage the flights affected as efficiently as possible. Our engineers will be carefully monitoring the system’s performance as we return to normal operations.
The flight planning issue affected the system’s ability to automatically process flight plans, meaning that flight plans had to be processed manually which cannot be done at the same volume, hence the requirement for traffic flow restrictions. Our priority is always to ensure that every flight in the UK remains safe and we are sincerely sorry for the disruption this is causing. Please contact your airline for information on how this may affect your flight.
Two days later, the technical issue which led to thousands of flight cancellations and delays across the UK, was rather vaguely identified as a single flight plan.
Initial investigations into the problem show it relates to some of the flight data we received. Our systems, both primary and the back-ups, responded by suspending automatic processing to ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic system. There are no indications that this was a cyber-attack.
The media response was a mix of disbelief and outrage. The Director General of the International Air Transport Association was scathing:
This incident is yet another example of why the passenger rights system isn’t fit for purpose. Airlines will bear significant sums in care and assistance charges, on top of the costs of disruption to crew and aircraft schedules. But it will cost NATS nothing. The UK’s policy makers should take note. The passenger rights system needs to be rebalanced to be fair for all with effective incentives. Until that happens, I fear we will see a continuing failure to improve the reliability, cost efficiency, and environmental performance of air traffic control.
But over on the Professional Pilots Rumour Network (PPRuNe), there were a few industry voices who pointed out that this was not as far fetched as it might seem. The flight plan for the flight inbound to the UK cannot simply have been malformed: that would have been rejected by Eurocontrol, the Brussel’s based intergovernmental organisation that coordinates flight plans and air traffic control above 24,500 feet over Belgium, Luxembourg, the Netherlands and north-west Germany. 80% of their traffic is climbing from or descending to major European airports, including London. Further, flight plans are run through simple filters set up on the UK systems, with a malformed flight plan marked as requiring manual entry. The fact that the flight plan to make it past these initial checks and yet was able to disrupt the entire system pointed to a hitherto unknown bug in the complex software used in the UK.
For example, on the 12th of December 2014, a NATS system failure led to all departures being stopped from London Airports as well as flights departing European airports which planned to route through UK airspace. The software engineers may be interested in reading the full final report of the independent enquiry which explains the full context of this “latent software fault that was present from the 1990’s”.
The System Flight Server software determined the number of Controller and Supervisor roles, which were known as Atomic Functions, in order to distribute data to the relevant roles. The maximum of civil and military Atomic Functions was designed to be 193, a number that everyone was aware was a limit. However, when generating a table of the current Controller and Supervisor roles, the check against the maximum size of the table (that is, the maximum number of Atomic Functions) tested against the civil limit, which was 151. As it happened, this didn’t much matter on a day-to-day, and there was nothing to alert anyone to the fact that this check was against the wrong maximum figure until the 12th of December.
The day before the outage, the system was changed to include further military Controller roles, which meant the (incorrect) limit of 151 was now being exceeded. Then, a controller set his workstation to the wrong mode. Usually, workstations are left “signed on” when unattended so that they are available to be used. However, this controller accidentally set his workstation into Watching Mode, an obsolete mode which staff were apparently regularly accidentally selecting. Again, this on its own didn’t much matter. However, Watching Mode triggers the System Flight Server to generate a table of the Atomic Functions (all current controller and supervisor roles). There were 153 entries in the table, which exceeded the permitted maximum size of the table, which was set to 151. This led to a series of system failures that culminated in a complete system collapse.
A PPRuNe poster identified as Engineer39 wrote about the practical limits of failure testing:
I was one of the people who signed off the upgrade in 2014 as being OK to implement. We were right as it was safe, just not resilient.
As the linked report here show it was all due to the 154th workstation being turned on and crashing the system. Of course some may say “How can NATS be so stupid as to not spot this and test for it?” Well the test suite has around 90 workstations so there is no way you can turn on 154 stations to test the software past its 153 limit. And no chance of getting time in >100 ATCOs’ schedule to get them all to come in and exercise all the stations even if you had >153 to test. And you can’t test on the live system with it many stations, as of course it’s not possible to find space in the schedule to do this on a system that is live 24/27/365. So it instead relies on software engineers understanding code that was rewritten >10 years before to understand what the 153 number meant. Obviously in that case no one understood it, or if they did, thought it meant active stations and forgot about the ones in a half on state. It’s not possible to retain all the knowledge from years ago unless no one resigns, retires or is made redundant. And you don’t outsource anything.
All in all I can’t see practically how that incident could have been avoided.
I have no knowledge on this new incident but suspect the causes are all rather similar and that practically it should be possible to eliminate this particular case from happening again, but you can’t say “We will never have a crash again”.
Upgrading it all to a brand new software may help long term but likely there will be more short term disruption due to new bugs introduced.
I think NATS compares very favourably in disruption compared with other ASNPs. But some (small) improvements will hopefully come out of all this.
Another poster, Murty, offered a more specific example of how a flight plan might cause a software disruption. In this case, it is the Aeronautical Fixed Telecommunication Network (AFTN) that causes the problem. The Aeronautical Fixed Telecommunication Network allows for messages between fixed stations, including Air Navigation Services, aviation service providers, airport authorities and government agencies. An AFTN message has three parts: a heading, the message text, and an ending.
The message text is in plain text and ends with an End-of-Message Signal, which is the four characters NNNN.
Murty says that as a result, the UK system trips over any call sign with NNN in it, which they have had to work around. He says so far, only two aircraft out of three have tripped the system.
DC Aviation have C56X registered DCNNN but flies under the fixed callsign DCS705
JYNNN C172 was delivered to Bournemouth back in 2020 ,and tripped our system (which was rectified fairly quickly)
MNNNN Gulfstream 6: this was registered by a Russian on 16/7/2014,it became a regular visitor to the UK and I became involved with many e-mails with the Operator, pointing out that their ID was going to be a problem at all airports as the Flight Plan could not reach addresses STOPPING after MNNNN. The owner reregistered the aircraft MNGNG
The fact our system stops after 3 N’s is strange
Other counties have quirks the FAA system does not like callsigns begining with a number with number. I was advised by a coleague in FAA while carrying out an investigation, which is strange with Barbados (8P-). This may have been notice at airports with 8P-ASD Gulfstream 6, which will often file as “X8PASD”, as does the Malaysian Gov 9MNAA A320 that file with a letter.
But my favourite example of all was an outage of ATC software in Los Angeles in 2014.
The En Route Automation Modernization (ERAM) system was designed by Lockheed Martin. Introduced in 2011, by 2018 the FAA stated that ERAM was managing over 3 million high-altitude en-route aircraft every month.
In April 2014, a flight plan crashed the system, causing Los Angeles Center to suspend all operations and clear the Center’s airspace, including a ground stop at Los Angeles International Airport. 365 flights were cancelled and over 400 delayed during the two-hour outage. At the time, ERAM did not have a dedicated backup system as the FAA believed that the dual channel design provided enough redundancy.
What happened? ERAM not only tracks the current flights but also looks ahead, searching projected course, speed and altitude for potential conflicts between aircraft.
A flight plan was submitted for a military aircraft carrying out a surveillance training mission.
But this wasn’t just any military aircraft. It was the famous high-altitude reconnaissance aircraft U-2, coincidentally also designed and produced by Lockheed some time earlier. The training mission had the aircraft entering and leaving the control zone repeatedly, with a flight plan that came close to the maximum data that ERAM can accept. However, the flight plan didn’t include the altitude. Apparently, a controller at a neighbouring system attempted to add in an altitude of 60,000 feet but made some sort of data entry error. Some sources say that ERAM believed that the U-2 was repeatedly flying through the control zone area at high speed at 7,000 feet. Another source says that ERAM treated the U-2 as flying every altitude between ground level and infinity.
Either way, ERAM attempted to consider all possible conflicts caused by the spy plane racing in and out of Los Angeles airspace. This quickly used all of the available memory. Finally, the flight data memory overloaded and both ERAM channels failed. In actuality, the U-2 was safely above the traffic at 60,000 feet.
Earlier this week, NATS released a Major Incident Preliminary Report which sheds some light on what happened. The report explains that aircraft flying through European countries submit flight plans, including aircraft type, speed and routing, to Eurocontrol. If Eurocontrol’s processing system accepts the flight plan, it is sent to all relevant Air Navigation Service Providers, including NATS in the UK.
Within NATS, the data goes to a system called Flight Plan Reception Suite Automated Replacement (FPRSA-R), which converts the data into a format compatible with the UK National Airspace System (NAS). Effectively, this modern system processes and sanitises the flight planning data before passing it on to more critical systems.
On 28 August, an airline submitted a flight plan for a flight departing at 04:00. NATS received this flight plan at 08:32, consistent with the 4-hour rule before the aircraft enters UK airspace. The flight plan was converted from ICAO4444 format to ADEXP, which includes additional waypoints. However, this specific flight plan had two waypoints with the same designator.
Now, I always thought that waypoint names, a five-letter designator for specific geographic coordinates used for routing flights, were unique. However, as pilots with more international experience than I have will already know, they are not globally unique. As long as they are “geographically widely spaced,” this was not seen to be an issue. When entering a waypoint for flight planning, pilots simply check the distance to make sure they had selected the right version of the waypoint. It is extremely rare for a single route to include two waypoints that share the same name.
However, whether due to changes in aircraft flight data systems or the result of longer flights, this particular flight plan apparently (and correctly, it seems) had two waypoints with the same identifier. The two waypoints were about 4,000 nautical miles apart and both were outside of UK airspace. FPRSA-R attempted to extract the UK portion of the flight from the flight plan. First, it searches to find the entry waypoint and then it works backwards, parsing from the end of the flight plan, in order to determine the exit waypoint. If the exit waypoint isn’t listed, then the nearest waypoint beyond the UK is chosen as the exit point. However, the system was unable to make sense of two waypoints having the same name, presumably confused by finding an exit waypoint in a context that didn’t make geographic sense.
When the software failed to logically determine the entry and exit points, it raised a critical exception. As a part of the fail-safe, the primary system went into maintenance mode rather than risk passing incorrect data to air traffic controllers. The back-up system took over the task and (you can probably see this coming) when it attempted to convert the data, it equally could not find the entry and exit points and so it also raised a critical exception and went into maintenance mode.
The time elapsed between receiving the problematic flight plan and both systems shutting themselves down was less than twenty seconds.
After both systems shut themselves down, the 24/7 on-site support team attempted to diagnose the issue. They quickly escalated to their second-line support team of on-call experts. When both teams failed to pinpoint the root cause or restore the service, they called in the Technical Design team and the manufacturer of the software, which seems to be Frequentis Comsoft. The manufacturer was able to identify the problematic flight plan and devise a recovery plan. Once the system was reinstated, the next four hours of flight plans were processed in about nine minutes.
The NATS preliminary report states that if a flight plan with these characteristics had previously been filed, it would have caused the same issue. This system was updated in 2018 with new hardware and software and has processed over 15 million flight plans since then without a dual system loss; thus, the chief executive of NATS called it a one-in-15-million flight plan. The good news, said the chief executive, is that the system did what it was designed to do, i.e. fail safely when it receives data that it can’t process
However, next week’s software changes will hopefully remove the need for a critical exception in similar circumstances. NATS have also promised further investigation to see if the circumstance could have been prevented during the software development cycle.
The Civil Aviation Authority has confirmed that the technical event has been understood and that, if it were to reoccur, it would be fixed quickly without affecting flights. They will oversee the NATS investigation and in addition, they will be initiating an independent review of the technical failure and the response.