Sunday, May 3, 2015

Counter Rollover Bites Boeing 787

Counter rollover is a classic mistake in computer software.  And, it just bit the Boeing 787.

The Problem:

The Boeing 787 aircraft's electrical power control units shut down if powered without interruption for 248 days (a bit over 8 months). In the likely case that all the control units were turned on at about the same time, that means they all shut down at the same time -- potentially in the middle of a flight. Fortunately, the power is usually not left on for 8 continuous months, so apparently this has not actually happened in flight.  But the problem was seen in a long-duration simulation and could happen in a real aircraft. (There are backup power supplies, but do you really want to be relying on them over the middle of an ocean?  I thought not.) The fix is turning off the power and turning it back on every 120 days.

That's right -- the FAA is telling the airlines they have to do a maintenance reboot of their planes every 120 days.

(Sources: NY Times ; FAA)


Analysis:

Just for fun, let's do the math and figure out what's going on.
248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200
Hmmm ... what if those systems keep time as an 32-bit signed integer in hundredths of a second? The maximum positive value for such a counter would give:
0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days.
Bingo!

If they had used a 32-bit unsigned it would still overflow after twice as long = 497.1 days.


Other Examples:

This is not the first time a counter rollover has caused a problem.  Some examples are:

  • IBM: Interface adapters hang after 497 days of uptime [IBM]
  • Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft]  
  • Hong Kong rail service outage [Blog]
There are also plenty of date roll-over bugs:
  • Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00)   [Wikipedia]
  • GPS: 1024 week rollover on 22 August 1999 [USCG]
  • Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such as 512K day for IPv4 routers.

If you want to dig further, there is a "zoo" of related problems on Wikipedia:  "Time formatting and storage bugs"


2 comments:

  1. The following was sent to me by an engineer who gave me permission to publish it provided I redacted it:

    "I just finished with [a subsystem design for a new aircraft]. So, this particular comment struck home. I’m using an unsigned 32 bit counter, in hundredths of a second, for timekeeping purposes.

    If they are anticipating keeping the electronics on for some extraordinary length of time, why isn’t it in the design specifications ?

    That would also lead me into my biggest gripe about industry design practices. They don’t really have qualified “system engineers” guiding the teams technically.

    The current practice is analogous to this method of building a house:
    1. You back a truck full of lumber up to a lot, dump the load.
    2. Pass out blueprints to carpenters, plumbers, electricians, finishing carpenters and tell them to build it.

    There will be problems if housing construction is done this way. You would have 20 different interpretations of the blueprints instead of a single interpretation.

    You have the program manager, hardware lead, software lead. Who is guiding the team technically ? Who serves as the single point of contact for technical questions ? Typically, no one single person has “the vision”. The fact that these products work is a combination of dumb luck and iterative design changes.

    I mention this issue because I had the good fortune early in my career to work with an astoundingly good system engineer that led the teams technically. Things just work so much better when you have a single point of contact for interpretation of the specifications."

    ReplyDelete
  2. I received the following question/request for clarification:

    248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200,
    But I do not understand the bit determination of your term “0x7FFFFFFF”. Is that the hexadecimal id for 21,427,200 and is that number 2147483647?
    i.e., 0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days.

    Answer:
    To give a more complete line of reasoning:
    The 248 days reported is: 248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200 seconds.
    That is equivalent to: 2,142,720,000 hundredths of seconds, which in hexadecimal is: 0x7fb75000
    The maximum positive signed 32-bit integer is 0x7fffffff, and 248 days looks uncomfortably close to that limit.
    In fact, 0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days
    So, I'll bet that the crash really happens after 248.55 days, and is explained by an unsigned 32-bit integer counter rollover.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...