Better Embedded System SW: 2016

Monday, December 26, 2016

Embedded System Security Pitfalls Tutorial (Preview)

Here's a summary video on Security Pitfalls. (Hint: security via obscurity doesn't make you secure!)

Security Pitfalls Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Other useful pointers:

Bruce Schneier: Security Pitfalls in Cryptography

Schneier on Security

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, December 19, 2016

A Driver Test For Self-Driving Cars Isn't Enough

I recently read yet another argument that a driving road test should be enough to certify an autonomous vehicle as safe for driving. In general, the idea was that if it's good enough to put a 16 year old on the road, it should be good enough for a self-driving vehicle. I see this idea enough that it's worth explaining why it it's a really bad one.

Even if we were to assume that a self-driving vehicle is no different than a person (which is clearly NOT true), applying the driving test is only half the driver license formula. The other half is the part about being 16 years old. If a 12 year old is proficient at operating a vehicle, we still don't issue a drivers license. In addition to technical skills and book knowledge, we as a society have imposed a maturity requirement in most states of "being 16." It is typical that you don't get an unrestricted license until you're perhaps 18. And even then you're probably not a great driver at any age until you get some experience. But, we won't even let you on the road under adult supervision at 12!

The maturity requirement is essential. As a driver we're expected to have the maturity to recognize when something isn't right, to avoid dangerous situations, to bring the vehicle to a safe state when something has gone wrong, to avoid operating when the vehicle system (vehicle + driver) is impaired, to improvise when something weird happens, and to compensate for other drivers who are having a bad day (or are simply suffering from only being 16). Autonomous driving systems might be able to do some of that, or even most of it in the near term. (I would argue that they are especially bad at self-detecting when they don't know what's going on.) But the point is a normal driving test doesn't come close to demonstrating "maturity" if we could even define that term in a rigorous, testable way. It's not supposed to -- that's why licenses require both testing and "being 16."

To be sure, human age is not a perfect correlation to maturity. But as a society we've come to the situation in which this system is working well enough that we're not changing it except for some tweaks every few years for very young and very old drivers who have historically higher mishap rates. But the big point is if a 12 year old demonstrates they are a whiz at vehicle operation and traffic rules, they still don't get a license. In fact, they don't even get permission to operate on a public road with adult supervision (i.e., no learners permit at 12 in any US state that I know of.) So why does it make sense to use a human driving test analogy to give a driver license, or even a learner permit, to an autonomous vehicle that was designed in the last few months? Where's the maturity?

Autonomy advocates argue that encapsulating skill and fleet-wide learning from diverse situations could help cut down the per-driver learning curve. And it could reduce the frequency of poor choices such as impaired driving and distracted driving. If properly implemented, this could all work well and could improve driving safety -- especially for drivers who are prone to impaired and distracted driving. But while it's plausible to argue that autonomous vehicles won't make stupid choices about driving impaired, that is not at all the same thing as saying that they will be mature drivers who can handle unexpected situations and in general display good judgment comparable to a typical, non-impaired human driver. Much of safe driving is not about technical skill, but rather amounts to driving judgment maturity. In other words, saying that autonomous vehicles won't make stupid mistakes does not automatically make them better than human drivers.

I'd want to at least see an argument of technological maturity as a gate before even getting to a driving skills test. In other words, I want an argument that the car is the equivalent of "being 16" before we even issue the learner permit, let alone the driver license. Suggesting that a driving test is all it takes to put a vehicle on the road means: building a mind-boggling complex software system with technology we really don't understand how to validate, doing an abbreviated acceptance test of basic skills (a few minutes on the road), and then deciding it's fine to put in charge of several thousand pounds of metal, glass, and a human cargo as it hurtles down the highway. (Not to mention the innocent bystanders it puts at risk.)

This is a bad idea for any software. We know from history that a functional acceptance test doesn't prove something is safe (at most it can prove it is unsafe if a mishap occurs during the test). Not crashing during a driver exam is to be sure an impressive technical achievement, but on its own it's not the same as being safe! Simple acceptance testing as the gatekeeper for autonomous vehicles is an even worse idea. For other types of software we have found that in practice that you can't understand software quality for life-critical systems without also considering the rigor of the engineering process. Think of good engineering process as a proxy for "being 16 years old." It's the same for self-driving cars.

(BTW, "life-critical" doesn't mean perfect. It means designed with sufficient engineering rigor to be suitable for its intended criticality. See ISO 26262 or your favorite software safety standard, which is currently NOT required by the government for autonomous vehicles, but should be.)

It should be noted that some think that a few hundred million miles of testing can be a substitute for documenting engineering rigor. That's a different discussion. What this essay is about is saying that a road test -- even an hour long grueling road test -- does not fulfill the operational requirement of "being 16" for issuing a driving license under our current system. I'd prefer a different, more engineering-based method of certifying self-driving vehicles for use on public roads. But if you really want there to be a driver test, please tell me how you plan to argue the part about demonstrating the vehicle, the design team, or some aspect of the development project has the equivalent 16-year-old maturity. If you can't, you're just waving your hands about vehicle safety.

Sunday, December 11, 2016

Better Reboot Your Boeing 787 Every Three Weeks

Crashing after a prolonged up time due to a counter rollover or other problem is a classic mistake in computer software. And, it just bit the Boeing 787. Again.

The Problem:

The Boeing 787 aircraft has three Flight Control Modules (FCMs) that are the subject of a new FAA Airworthiness Directive. Based on that sentence alone, you want to make sure whatever that involves gets fixed before you fly on a 787!

The FAA says there is "a report" that all three FCMs can fail at the same time 22 days after they have been rebooted. If you don't reboot the FCMs the FAA says this "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability." This is FAA-speak for the airplane could crash. The're telling airlines to reboot the plane every 21 days to avoid this. Hope nobody forgets to do that! (In fairness, I understand it is likely that most planes get rebooted more often than this anyway. But this is not something you want to leave to chance.)

At this point we can only guess at the cause, but the usual guess is that it is a timer overflow problem. Let's hypothesize a 32-bit signed integer is counting the passing of time in milliseconds. So a value of 32700 in that counter is 32.700 seconds.

How long until it overflows 31 bits of counting into the 32nd bit, which is the sign bit?

0x7FFFFFFF = 2147483647 ==> 2147483.647 seconds
2147483.647 seconds * (1 min/60 sec) (1 hr/60 min)(1 day/24 hr) = 24.9 days

Hmm, a bit longer than the 22 days the FAA reports. Some time spent playing with various multipliers didn't seem to give a likely candidate. Possible factors if it is a timer rollover would include fixed point math (e.g., time keeping in 256ths of a second) or scaling from a 400 Hz aircraft AC frequency. Or there could be some divided-down crystal oscillator frequency on the FCM that is involved.

Or, it could be something completely different. Maybe there is memory that records operating parameters periodically and the system crashes when that fills up that memory (for example, logs that get downloaded every maintenance interval, with an expectation that the maintenance interval is more like a few days than a few weeks).

For now the cause is a bit of a mystery to us. I'll bet the FCM engineers have a pretty good idea at this point. No doubt they'll issue a fix as fast as they can get the FAA to review it.

But the big news is that for the second time, the FAA is telling is telling the airlines they have to do a maintenance reboot of their planes. Last time it was every 248 days. This time it's every 21 days.

It's bad enough that they have to reboot the infotainment systems once in a while. For flight controls, this is not good news. This is the kind of problem that should be caught in design reviews. Always think about what happens if any counter, timer, or data structure overflows.

Other Examples:

This is not the first time a problem with long-running software has happened beyond the usual memory leaks in everyday applications. Some examples are:

Timer rollover bugs:

Another B787 reboot every 22 days [Seattle Times 2020\
A350 needs to be rebooted every 149 hours, likely due to a timer overflow bug [Register][EASA]
B787 needs to be rebooted every 248 days due to a likely timer overflow bug [Blog][NY Times] [FAA]
Air Traffic control loses contact with 400 aircraft due to a 32-bit time rollover in 2004 [IEEE Spectrum]
IBM: Interface adapters hang after 497 days of uptime [IBM]
Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft] (I met the engineers who found that one. And congratulated them on the significant feat of actually getting Windows 95 to run that long without crashing for some other reason!)
SSD Drives 40,000 and 32,768 hour failures. [BleepingComputer]
AMD EPYC Rome CPU chips crash after 1044 days of uptime [Tom's Hardware]

There are also plenty of date roll-over bugs:

NASA Deep Impact Comet Mission terminated unexpectedly when at 2**32 seconds after Jan 1, 2000 (a time rollover bug). [IEEE Software]
Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00) [Wikipedia]
GPS: 1024 week rollover on 22 August 1999 [USCG]
Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such

911 outage caused by database record limit [Washington Post]
512K day for IPv4 routers
Lightsail spacecraft dies when log exceeds 32 MB [Space.com]
Google Chrome version rollover at 100 [Techradar.pro]

And floating-point roundoff issues (thanks to Dan for reminding me of this one):

Patriot Missile mishap after operating for 100 hours without a maintenance reboot [GAO]

If you want to dig further, there is a "zoo" of related problems on Wikipedia: "Time formatting and storage bugs"

Monday, December 5, 2016

Understandable Code (Coding Style for Humans) (Preview)

Here's a summary video on Human-Understandable Code (Code Readability) which is half of the topic of coding style.

Code Style: Readability Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Not getting software wrong is the point of coding style
Coding style and MISRA C
Ensuring code is modular
Other posts on coding style

Barr Group Embedded C coding standard (on-line version is free)
Jack Ganssle's coding standard (MS Word download)
Wikipedia list of coding conventions for various languages

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, November 28, 2016

Single Point of Failure Overview (Preview)

Here's a summary video on Single Points of Failure for safety- and mission-critical systems.

Single Point of Failure Preview [ECR]

Other pointers on this topic (my blog posts):

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, November 21, 2016

Embedded Software Security, Safety & Quality (Preview)

Here's a summary video on Embedded Software Security, Safety & Quality.

Other pointers on this topic (my blog posts):

PREVIEW: Embedded Software Quality Safety and Security [ECR]

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, November 14, 2016

Spaghetti Code and Complexity Tutorial

Here's a preview video on Spaghetti Code and Cyclomatic Complexity. There is also a full version of this video available for free from the Edge Case Research video library (see below for details).

Notes:

Also see my blog posting on avoiding high cyclomatic complexity (Blog)
Toyota Unintended Acceleration and the Big Bowl of “Spaghetti” Code (Safety Research & Strategies Blog)

Spaghetti Code Preview [ECR] .

Full tutorial video: https://vimeo.com/185732981

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, November 7, 2016

Embedded System Software Quality: Why is it so often terrible? What can we do about it?

I had a great time meeting old friends and new folks at ISSRE 2016.

Here are slides from my keynote address:

Embedded System Software Quality:
Why is it so often terrible? What can we do about it?

Failures of embedded system software increasingly make the news. Everyday products we rely upon are suffering from safety issues, security issues, and just plain bugs. While perfection is unrealistic, surely we can improve this situation. Two key ideas apply: (1) embedded products often aren’t created by computer specialists, and (2) teaching application domain specialists just how to code is more of a problem than a solution.

Phil Koopman's ISSRE 2016 Keynote from edgecaseresearch

Monday, October 31, 2016

Security Plan Tutorial

Here's a preview video on creating a Security Plan. There is also a full version of this video available for free from the Edge Case Research video library (see below for details).

You might also find these pointers useful:

MITRE Commone Weakness Enumeration (security)

Embedded System Security Plan Preview [ECR].

Full tutorial video: https://vimeo.com/185255673

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Tuesday, October 25, 2016

Peer Review Tutorial

Here's a preview video on Peer Review (about 1 minute of content, plus leader/trailer stuff). There is also a full version of this video available for free from the Edge Case Research video library (see below for details).

Notes:

Peer review finds half the defects for 10% of the project cost. (Blog)
But they only work if you actually do the reviews! (Blog)
If you're finding more than 50% of your defects in test instead of peer review, then your peer reviews are broken. It's as simple as that. (Blog)
Here's a peer review checklist to get you started. (Blog) You should customize it to your needs.

Peer Review Tutorial Overview [ECR].

Full tutorial video: https://vimeo.com/181433327

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, October 24, 2016

Training Video Series

Summary: My startup company is launching a series of training videos. Here's an overview.

-------------------------
(This blog posting is an ad for my startup company ... but it also has links to free training videos!)

Here's a page with a list of the currently available videos

Edge Case Research:

I have a startup company that has built a brisk business in design reviews, robustness testing, and software safety. We're building on the 20 years and 200 design reviews I've done working with industry partners on a variety of embedded systems. We're also building on my long collaboration with my co-founder Mike Wagner and others at the National Robotics Engineering Center doing autonomy validation and safety.

We do a lot of work with autonomous systems (robots, self-driving cars), but also with consumer electronics and industrial controls. In general the idea is that we have a dedicated team of embedded system experts who can help companies with both the technical aspects of their product and their software development processes. Our strengths are:

Identifying risk areas in embedded products
Best-practice assessments & technical practice benchmarking across dozens of risk areas that have been a problem for other project teams
Autonomous system safety (self-driving vehicles and other robots, but also non-robot systems)
Robustness testing and improvement
Hands-on training based on your current project in both technology and just-the-right-weight software development process

If we can help you with your embedded product just let us know!

Free Training Videos:

One of the things we have found is that there are a number of common areas in which our customers (and probably many others) could use some help with in terms of updating technical and process skills. Thus, we're launching a video training series to help with the topics we most commonly see as we review projects.

To access the full videos for free, visit our video library page here. If this is successful we'll add more videos over time, possibly requiring registration and/or a subscription for the full library. But the ones there now are free, and I expect that these particular videos will stay free indefinitely. Please respect our copyright and substantial investment in creating them by pointing people there to view them instead of trying to download the full length videos.

We also have a YouTube channel with just the previews that you can subscribe to if you want notification of new videos as they come out.

Hope you enjoy these!

Friday, October 14, 2016

Interviews on Self-Driving Vehicle Safety

With all the recent activity on self-driving vehicle safety, I've done a few interviews. If you are looking for various ways of explaining this complex topic, here are some pointers you might find useful. The unifying theme is that they all are based at least in part on an interview I did with the journalist.

Why AI Makes It Hard to Prove That Self-Driving Cars Are Safe, IEEE Spectrum, 10/7/2016

Deep Learning: Achilles Heel in Robo-Car Tests, EE Times, 10/3/2016

Robo-Car’s Safety Challenges DoT, EE Times, 9/21/2016

Questions About Tesla Autopilot Safety Hit StoneWall, EE Times, 9/16/2016

Complex Car Software Becomes the Weak Spot Under the Hood, NY Times, 9/26/2015.

And my university produced a couple YouTube videos:

Self-Driving Cars Safety Tips: Everyday Drivers with Full Autonomy, 9/22/2016

Self-Driving Cars Safety Tips: Trust and Partially Autonomous Vehicles, 9/21/2016

Sunday, October 2, 2016

Response to DoT Policy on Highly Automated Vehicles

I've prepared a draft response to DoT/NHTSA on their proposed policy for highly automated vehicle safety.

EE Times article that summarizes my response

The topics I cover are:
1. Requiring a safety argument that deals with the challenges of validating machine learning
2. Requiring transparent independence in safety assessment
3. Triggering safety reassessment based on safety integrity, rather than “significant” functionality
4. Requiring assessment of changes that can compromise the triggering of fall-back strategies
5. Characterizing what “reasonable” might mean regarding anticipation of exceptional scenarios
6. Assuring the integrity of data that is likely to be used for crash investigations
7. Diagnostics that encompass non-collision failures of components and end-of-life reliability loss
8. More uniform codification of traffic rule exceptions
9. Ensuring the safety of driver-takeover strategies for SAE Level 2 systems

Document pointers:

My draft response
My Op-Ed article with a high level overview of the first three issues
The DoT draft policy

Comments either to this blog or to my university e-mail are welcome.

koopman - at - cmu.edu

I might not have time to respond in detail to all comments, but I appreciate anything that will help improve this response.

Saturday, October 1, 2016

Op-Ed About Autonomous Vehicle Safety

Yesterday the local newspaper published my op-ed piece on the DoT's draft policy for autonomous vehicle safety. In the interest of starting a more general public discussion about this important topic, it is reproduced below. I am working a more comprehensive technical response, and I'll post that here soon.

--------------------------------------------------------------------------

Source:
http://www.post-gazette.com/opinion/Op-Ed/2016/09/30/Safe-self-driving-It-s-not-here-yet/stories/201609300137

Safe self-driving? It’s not here yet
A lot of testing remains before self-driving cars should be widely deployed
Pittsburgh Post-Gazette
September 30, 2016 12:00 AM

By Philip Koopman and Michael Wagner

President Barack Obama this month announced a Department of Transportation draft policy on automated-vehicle safety. Hard-working staff at the Department of Transportation, the National Highway Traffic Safety Administration and the Volpe National Transportation Systems Center did a great job in creating it.

But, while self-driving cars promise improved safety, some critical issues must be addressed to ensure they are at least as safe as human-driven cars and, hopefully, even safer.

Machine learning is the secret sauce for the recent success of self-driving cars and many other amazing computer capabilities. With machine learning, the computer isn’t programmed by humans to follow a fixed recipe as it was in traditional systems. Rather, the computer in effect teaches itself how to distinguish objects such as a bush from a person about to cross a street by learning from example pictures that label bushes and people.

Traditional software safety evaluations check the software’s recipe against its actual behavior. But with machine learning, there is no recipe — just a huge bunch of examples. So it is difficult to be sure how the car will behave when it sees something that differs even slightly from an example.

To take a simple example, perhaps a car will stop for pedestrians in testing because it has learned that heads are round, but it has trouble with unusual hat shapes. There are so many combinations of potential situations that it is nearly impossible to simply test the car until you are sure it is safe.

Statistically speaking, even 100 million miles of driving is not enough to show that these cars are even as safe as an average human driver. The Department of Transportation should address this fundamental problem head on and require rigorous evidence that machine-learning results are sufficiently safe and robust.

A key element of ensuring safety in machinery, except for cars, is independence between the designers and the folks who ensure safety. In airplanes, trains, medical devices and even UL-listed appliances, ensuring safety requires an independent examination of software design. This helps keep things safe even when management feels pressure to ship a product before it’s really ready.

There are private-sector companies that do independent safety certification for many different products, including cars. However, it seems that most car companies don’t use these services. Rather than simply trust car companies to do the right thing, the Department of Transportation should follow the lead of other industries and require independent safety assessments.

Wirelessly transmitted software updates can keep car software current. But will your car still be safe after a software patch? You’ve probably experienced yourself cell phone or personal computer updates that introduce new problems.

The Department of Transportation’s proposed policy requires a safety re-assessment only when a “significant” change is made to a vehicle’s software. What if a car company numbers a major new software update 8.6.1 instead of 9.0 just to avoid a safety audit? This is a huge loophole.

The policy should be changed to require a safety certification for every software change that potentially affects safety. This sounds like a lot of work, but designers have been dealing with this for decades by partitioning vehicle software across multiple independent computers — which guarantees that an update to a non-critical computer won’t affect unchanged critical computers. This is why non-critical updates don’t have to be recertified.

Changing an icon on the radio computer’s display? No problem. Making a “minor” change to the machine-learning data for the pedestrian sensor computer? Sorry, but we want an independent safety check on that before crossing the street in a winter hat.

Philip Koopman is an associate professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. Michael Wagner is a senior commercialization specialist at CMU’s National Robotics Engineering Center. They are co-founders of Edge Case Research, which specializes in autonomous vehicle safety assurance and embedded software quality. The views expressed here are personal and not necessarily those of CMU.

--------------------------------------------------------------------------

Notes: It is important to emphasize that the context for this piece is mainstream deployment of autonomous vehicles to the general public. It is not intended to address limited deployment with trained safety observer drivers. There are more issues that matter, but this discussion framed to fit the target length and appeal to a broad audience.

Tuesday, September 13, 2016

How Safe Is The Tesla Autopilot?

It was only a matter of time until there was a fatality involving a "self-driving" car, because no car is likely to be 100% safe. And it happened in June this year. This week, Tesla announced a software update involving a number of strategy changes. https://www.tesla.com/blog/upgrading-autopilot-seeing-world-radar

In this post I'm going to try to take a look at what safety claims can, and cannot, be made for the Tesla Autopilot based on publicly available information. It's sufficient just to look at what Tesla itself says and assume it is 100% accurate to get a broad idea of where they are, and where they aren't. Short version: they have a really long way to go to demonstrate they are at least as good as a normal vehicle.

The Math:

First, let's take Tesla's blog posting about the tragic death that occurred in June.
https://www.tesla.com/blog/tragic-loss
Tesla claims 130 million miles of Autopilot driving, and a US fatality every 94 million miles.

One might be tempted to draw an incorrect conclusion from Tesla's statement that Autopilot is safer than manual driving, because 130 million is larger than 94 million. However, that is far too simplistic an approach. Rather, the jury is still very much out on whether Tesla Autopilot will prove safer than everyday vehicles.

From a purely statistical approach, the question is: how many miles does Tesla have to drive to demonstrate they are at least as good as a human (94 million mile mean time to fatality).

This tool: http://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator tells you how long you need to test to get a 94M mile Mean Time Between Failure (MTBF) with 95% confidence. Assuming that a failure is a fatal mishap in this case, you need to test 282M miles with no fatalities to be 95% sure that the mean time is only 94 million miles. Yes, it's hard to get that lucky. But if you do have a mishap, you have to do more testing to distinguish between 130 million miles having been lucky, or just reflecting normal statistical fluctuations in having achieved a target average mishap rate. If you only have one mishap, you need a total of 446M miles to reach 95% confidence. If another mishap occurs, that extends to 592M miles for a failure budget of two mishaps, and so on. It would be no surprise if Tesla needs about 1-2 billion miles of accumulated experience for the statistical fluctuations to settle out, assuming the system actually is as good as a human driver.

Looking at it another way, given the current data (1 mishap in 130 million miles), this tool: http://reliabilityanalyticstoolkit.appspot.com/field_mtbf_calculator tells us that Tesla has only demonstrated an MTBF of 27.4M miles or better at 95% confidence at this point, which is less than a third of the way to break-even with a human.
(Please note, I did NOT say that they are only a third as safe as a human. What I am saying is that the data available only supports a claim of about 29.1% as good. Beyond that, the jury is still out.)

In other words, they need a whole lot more testing to make strong claims about safety. This is because it is really difficult (usually impractical) to test safety into a system. You usually need to do something more, which is one of the reasons why software safety standards such as ISO 26262 exist.

Threats To Validity:

With any analysis like this, it's important to be clear about the assumptions and possible flaws in analysis, of which there are often many. Here are some that come to mind.

- The calculations above assume that it is more or less the same software for all 130 million miles. If Tesla makes a "major" change to software, pretty much all bets are off as to whether the new software will be better or worse in practice unless a safety critical assurance methodology (e.g., ISO 26262) has been used. (In fact, one can argue that *any* change invalidates previous test data, but that is a fine point beyond what I want to cover here.) Tesla says they're making a dramatic switch to radar as a primary sensor with this version. That sounds like it could be a major change. It would be no surprise if this software version resets the clock to zero miles of experience in terms of software field reliability both for better and for worse.

- The Tesla is most definitely not a fully autonomous vehicle. Even Tesla makes it quite clear in their public announcements that constant driver attention and supervision is required. So the Tesla experience isn't really about fully self-driving cars per se. Rather, it is about the safety of a human+car partnership (Level 3 autonomy in NHTSA-speak). That includes both the pros (humans can react to weird things the software might get wrong), and the cons (humans easily "drop out" of systems that don't require their attention). If Tesla moves to Level 4 autonomy later, at best they're going to have to make a very nuanced argument to take credit for Level 3 autonomy experience.

- The calculations assume random independent failures, which in general is probably not a good assumption for software. Whether it is a valid assumption for this scenario is an interesting question.

(Calculation note: I just plugged miles instead of hours into the MTBF calculations, because the units don't really matter as long as you are consistent. If you want to translate to hours and back, 30 mph is not a bad approximation for vehicle speed accounting for city and highway driving. If you'd like 90% confidence or 99% confidence feel free to plug the numbers into the tools for yourself.)

Thursday, September 1, 2016

Toyota Unintended Acceleration Case Study Talk Update

My Toyota UA talk from a couple years ago has gotten a lot of views, but watching the video was a bit of a pain because of the proprietary streaming format it was in.

I was finally able to convert the files to .mp4. So now it's a lot more accessible, especially from mobile devices:

Youtube hosted video (1 hour)
Vimeo hosted video (1 hour)

Please see my original posting with more details about various other ways to view and download the materials.

Thursday, August 25, 2016

Boehm's Top 10 Software Defect Reduction list

A while back, Barry Boehm & Vic Basili wrote a nice summary of best ways to get better quality software. Their advice still applies to embedded systems today. Below is their list (in bold) with my commentary (the parts not in bold).

1. Finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it during the requirements and design phase

Bug fix cost gets worse as your software gets closer to deployment, because you have to not only spend a lot of time tracking down the source of the bug, but also retest the system after the fix. It is common to find situations in which a bug "escape" into field units costs dramatically more than 100x. Consider having to recall a fleet of cars to install a bug fix, or do maintenance visits to thousands of sites to manually install a fix. (Over-the-air fixes have their own problems, but that's a topic for another time.)

2. Current software projects spend about 40 to 50 percent of their effort on avoidable rework.

As the joke goes, the first 90% of the project is spent on design, and the second 90% on debugging. The cheapest way to debug is to avoid putting the bugs into the software in the first place. The next best way is to find them at the point of introduction (e.g., peer review of design before code is written) rather than at system test.

3. About 80 percent of avoidable rework comes from 20 percent of the defects

If you have a "bug farm" that's often not because the code is bad, but rather because the underlying requirements and design are bad. If one module has a lot of bugs you should rewrite the module rather than keep patching it. However, if during the rewrite you might well discover that the problem isn't really the code, but rather a bad design or unclear requirements. Writing new code for a bad design ultimately won't solve the problem.

4. About 80 percent of the defects come from 20 percent of the modules, and about half the modules are defect free.

In addition to comments for #3 above, modules with high cyclomatic complexity tend to be difficult to test and tend to be more bug-prone. Keeping a limit on complexity can help with this problem.

5. About 90 percent of the downtime comes from, at most, 10 percent of the defects.

It makes sense to weight testing on the areas that are the highest risk. For desktop software this often corresponds to the common use cases. For embedded systems and other mission-critical systems this also means testing failure detection and recovery for high-cost failures.

6. Peer reviews catch 60 percent of the defects.

Yes, really. Peer reviews catch the majority of defects. Why aren't you doing them? (If you are doing them, are they catching at least half the defects?)

7. Perspective-based reviews catch 35 percent more defects than nondirected reviews.

When you do reviews, give each reviewer a checklist with a different set of areas to think about or different role to play. For example, control flow, data flow, exception handling, testability, coding style.

8. Disciplined personal practices can reduce defect introduction rates by up to 75 percent.

As much fun as it is to be a coding cowboy, on average even the best programmer will introduce fewer bugs by following a methodical engineering practice rather than slinging code. As mentioned above, the cheapest bugs to fix are the ones that never made it into the code. Beyond this, there are practices such as PSP and TSP that are shown to dramatically improve quality without really changing productivity.

9. All other things being equal, it costs 50 percent more per source instruction to develop high-dependability software products than to develop low-dependability software products. However, the investment is more than worth it if the project involves significant operations and maintenance costs.

In other words, if a product recall puts your company out of business, it's worth investing in good software quality up front. In my experience if you are already shipping a mission-critical product you're already spending that 50 percent more (and then some), but still shipping defects. This isn't saying spend even more. Rather, doing peer reviews and some other basic software quality practices you can improve quality at the same cost you're already spending. Testing software into submission is simply not the way to go.

10. About 40 to 50 percent of user programs contain nontrivial defects.

If you have programmable features, your customers will have bugs in what they do. And don't forget that your financial management spreadsheets are user programs (i.e., can, and often do have bugs).

Items #1 - #10 from:
Boehm & Basili, Software Defect Reduction Top 10 List, IEEE Computer, Jan. 2001, pp. 135-137.

You can read the original three-page article here:
https://www.cs.umd.edu/projects/SoftEng/ESEG/papers/82.78.pdf

Monday, May 30, 2016

Monday, May 16, 2016

Robustness Testing

I have been doing research in the area of robustness testing for many years, and once in a while I have to explain how that approach to testing fits into the bigger umbrella of fault injection and related ideas. Here's a summary of typical approaches (there are many variations and extensions beyond these as you might imagine). At the end is a description of the robustness testing work my group has been doing over many years.

Mutation Testing:
Goal: Evaluate coverage/effectiveness of an existing test suite. (Also known as "bebugging.")
Approach: Modify System under Test (SuT) with a hypothetical bug and see if an existing test suite finds it.
Narrative: I have a test suite. I wonder how thorough it is? Let me put a bug (mutation) into my code and see if my test suite finds it. If it finds all the mutations I insert, maybe my test suite is thorough.
Fault Model: Source code bug that is undetected by testing
Strengths: Can find problems even if code was already 100% branch-covered in test suite (e.g., mutate a comparison to > instead of >= to see if test suite exercises the equality case for that comparison)
Limitations: Requires an existing test suite (but, can combine with automated test generation to create additional tests automatically). Effectiveness heavily depends on inserting realistic mutations, which is not necessarily so easy.

Classical Fault Injection Testing:
Goal: Determine robustness of SuT when its code or data is corrupted.
Approach: Corrupt the binary image of the SuT code or corrupt data during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system. I wonder what happens if I flip a bit in the data -- does the system crash? I wonder what happens if I corrupt the compiled code -- does the system act in an unsafe way?
Fault Model: Hardware bit flip or software-based memory corruption.
Strengths: Can find realistic failures caused by single-event upsets, hardware faults, and software defects that corrupt computational state.
Limitations: The fault model is memory and program bit-level corruption. Fault injection testing is useful for high-integrity systems deployed in large volume (they have to survive even very infrequent faults), and essential for aviation and space systems that will see high rates of single event upsets. But it is arguably a bit excessive for non-safety-critical applications.

Robustness Testing:
Goal: Determine robustness of SuT when it is fed exceptional or unusual values.
Approach: Corrupt the inputs to the SuT during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system or subsystem. I wonder what happens if the inputs from other components or sensors have garbage, unusual, or random values -- does the system crash or act in an unsafe way?
Fault Model: Some other module than the SuT has a bug that results in exceptional data being sent to the SuT.
Strengths: Can find realistic failures caused by likely run-time faults such as null pointers, NaNs (Not-a-Number floating point values), corrupted input data, and in general faults in modules that are not the SuT, but rather other software or sensors present in the system that might have bugs that generate exceptional data.
Limitations: The fault model is generally that some other piece of software has a bug and that bug will generate bad data that kills the SuT. You have decide how likely that is and whether it's OK in such a case for the SuT to misbehave. We have found many situations in which such test results are important, even in systems that are not safety critical.

Fuzzing:
A classical form of robustness testing is "fuzzing," in which random inputs are tossed into a system to see what happens rather than carefully selected specific input values. My research group's work centers on finding efficient ways to do robustness testing so that fewer tests are needed to find system-killer values.

Ballista:
The Ballista project pioneered efficient robustness testing in the late 1990s, and is still active today on stress testing robots and autonomous vehicles.

Two key ideas of Ballista are:

Have a dictionary of interesting exceptional values so you don't have to stumble onto them by chance (e.g., just try a NULL pointer straight out rather than wait for a random number generator to happen to generate a zero value as a fuzzing input value)
Make it easy to generate tests by basing that dictionary on the data types taken by a function call instead of the function being performed. So we don't care if it is a memory management function or a file write being tested - we just say for example that if it's a memory pointer, let's try NULL as an input value to the function. This gets us excellent scalability and portability across systems we test.

A key benefit of Ballista and other robustness testing approaches is that they look for holes in the code itself, rather than holes in the test cases. Consider that most test coverage approaches (including mutation testing) are interested in testing all the code that is there (which is a good thing!). In contrast, robustness testing goes beyond code coverage to find the places where you should have had code to handle exceptional situations, but that code is missing. In other words, robustness testing often finds bugs due to missing code that should have been there. We find that it is pretty typical for software to be non-robust unless this type of testing has been done it identify such problems.

You can find more about our research at the Stress Tests for Autonomy Architectures (STAA) project page, which includes video of what goes wrong when you stress test a couple robotic systems:

https://www.rec.ri.cmu.edu/projects/stress_testing/

Our older work on Ballista can be found here:
http://www.ece.cmu.edu/~koopman/ballista/

Saturday, April 16, 2016

Challenges in Autonomous Vehicle Testing and Validation

It's a lot of work to demonstrate that self-driving cars will actually work properly. Testing alone is probably not going to be enough, and probably there will need to be some clever architectural approaches as well. Last week I gave a talk at the SAE World Congress in Detroit about this and related challenges.

Link to paper
Link to slides (also see slideshare version by scrolling down).

Challenges in Autonomous Vehicle Testing and Validation
Philip Koopman & Michael Wagner
Carnegie Mellon University; Edge Case Research LLC
SAE World Congress, April 14, 2016

Abstract:
Software testing is all too often simply a bug hunt rather than a well considered exercise in ensuring quality. A more methodical approach than a simple cycle of system-level test-fail-patch-test will be required to deploy safe autonomous vehicles at scale. The ISO 26262 development V process sets up a framework that ties each type of testing to a corresponding design or requirement document, but presents challenges when adapted to deal with the sorts of novel testing problems that face autonomous vehicles. This paper identifies five major challenge areas in testing according to the V model for autonomous vehicles: driver out of the loop, complex requirements, non-deterministic algorithms, inductive learning algorithms, and fail operational systems. General solution approaches that seem promising across these different challenge areas include: phased deployment using successively relaxed operational scenarios, use of a monitor/actuator pair architecture to separate the most complex autonomy functions from simpler safety functions, and fault injection as a way to perform more efficient edge case testing. While significant challenges remain in safety-certifying the type of algorithms that provide high-level autonomy themselves, it seems within reach to instead architect the system and its accompanying design process to be able to employ existing software safety approaches.

Challenges in Autonomous Vehicle Testing and Validation from Philip Koopman

Wednesday, March 23, 2016

Automotive Remote Keyless Entry Security

Recently there has been another round of reports on the apparent insecurity of remote keyless entry devices -- the electronic key fobs that open your car doors with a button press or even hands-free. In this case it's not a lot to get excited about as I'll explain, but in general this whole area could use significant improvement because there are some serious concerns. You'd think that in the 20+ years since I was first involved in this area the industry would get this stuff right on a routine basis, but the available data suggests otherwise. The difference now is that attackers are paying attention to these types of systems.

The latest attack involves a man-in-the-middle intercept that relays signals back and forth between the car and an owner's key fob:
http://www.wired.com/2016/03/study-finds-24-car-models-open-unlocking-ignition-hack/

Here's why this particular risk might be overblown. In systems of this type typically the signal sent by the car is an inductively coupled low frequency signal with a range of about a meter. That's how the car knows you're actually standing by the door and not sipping a latte at a cafe 10 meters away. So having a single intercept box near the car won't work. Typically it's going to be hard to transmit an inductively coupled signal from your parking lot to your bedroom with any reasonably portable intercept device (unless car is really really close to your bedroom). That's most likely why this article says there must be TWO intercept devices: one near the car, and one near the keys. So I wouldn't worry about someone using this particular attack to break into a car in the middle of the night, because if they can get an intercept box onto your bedroom dresser you have bigger problems.

The more likely attack is someone walking near you in a shopping mall or at an airport who has targeted your specific car. They can carry an intercept device near your pocket/bag while at the same time putting an intercept device near your car. No crypto hack required -- it's a classic relay attack. Sure, it could happen, but a little far fetched for most folks unless you have an exceptionally valuable car. And really, there are often easier ways to go than this. Attackers might be able to parlay this into a playback attack if the car's crypto is stupid enough -- but at some point they have to get within a meter or so of your physical key to ping it and have a shot at such an attack.

If you have a valuable car and already have your passport and non-contact credit cards in a shielded case, it might be worthwhile putting your keys in a Faraday cage when out of the house (perhaps an Altoids box). But I'd avoid the freezer at home as both unnecessary and possibly producing condensate inside your device that could ruin it.

The more concerning thing is that devices to break into cars have been around for a while and there is no reason to believe they are based on the attack described in this article. They could simply be exploiting bad security design, possibly without proximity to the legitimate transmitter. Example scenarios include badly designed crypto (e.g., Keeloq), badly designed re-synch, or badly designed playback attack protection for RF intercepts when the legitimate user is transmitting on purpose. Clever variations include blanket jamming and later playback, and jamming one of a pair of messages for later playback. Or broken authentication for OnStar-like remote unlock. If the system has too few bits in its code and doesn't use a leaky bucket rate limiting algorithm, you can just use a brute force attack.

Here's a video from which you learn both that the problem seems real in practice and that folks like the media, insurance, police, and investigators could do with a bit more education in this area:
https://www.youtube.com/watch?v=97ceREjpIvI

Note that similar or identical technology is used for garage door openers.

On a related note, there is also some concern about the safety of smart keys regarding compliance with the Federal safety standard for rollaway protection and whether a car can keep running when nobody is in the car. These concerns are related to the differences between electronic keys and physical keys that go into a traditional ignition switch. There have been some lawsuits and discussions about changing the Federal safety regulations: https://www.federalregister.gov/articles/2011/12/12/2011-31441/federal-motor-vehicle-safety-standards-theft-protection-and-rollaway-prevention

Monday, March 7, 2016

Multiple Returns and Error Checking

Summary: Whether or not to allow multiple returns from a function is a controversial matter, but I recommend having a single return statement AND avoiding use of goto.

Discussion:

If you've been programming robust embedded systems for a while, you've seen code that looks something like this:

int MyRoutine(...)
{ ...
if(..something fails..) { ret = 0; }
else
{ .. do something ...
if(..somethingelse fails..) {ret = 0;}
else
{ .. do something ..
if(...yetanotherfail..) {ret = 0;}
else
{ .. do the computation ...
ret = value;
}
}
}
// perform default function
return(ret);
}

(Note: the "ret = 0" and "ret = value" parts are usually more complex; I'm keeping this simple for the sake of the discussion.)

The general pattern to this code is doing a bunch of validity checks and then, finally, in the most deeply nested "else" doing the desired action. Generally the code in the most deeply nested "else" is actually the point of the entire routine -- that's the action that you really want to take if all the exception checks pass. This type of code shows up a lot in a robust embedded system, but can be difficult to understand.

This problem has been dealt with a number of times and most likely every reasonable idea has been re-invented many times. But when I ran up against it in a recent discussion I did some digging and found out I mostly disagree with a lot of the common wisdom, so here is what I think.

A common way to simplify things is the following, which is the guard clause pattern described by Martin Fowler in his refactoring catalog (an interesting resource with accompanying book if you haven't run into it before.)

int MyRoutine(...)
{ ...
if(..something fails..) {return(0);}
if(..somethingelse fails..) {return(0);}
if(...yetanotherfail..) {return(0);}
// perform default function
return(value);
}

Reasonable people can (and do!) disagree about how to handle this. Steve McConnell devotes a few pages of his book Code Complete (Section 17.1, and 17.3 in the second edition) that covers this territory, with several alternate suggestions, including the guard clause pattern and a discussion that says maybe a goto is OK sometimes if used carefully.

I have a somewhat different take, possibly because I specialize in high-dependability systems where being absolutely sure you are not getting things wrong can be more important than subjective code elegance. (The brief point of view is: getting code just a little klunky but obviously right and easy to check for correctness is likely to be safe. Getting code to look nice, but with a residual small chance of missing a subtle bug might kill someone. More this line of thought can be found in a different posting.)

Here is a sketch of some of the approaches I've seen and my thoughts on them:

(1) Another Language. Use another language that does a better job at this. Sure, but I'm going to assume that you're stuck with just C.

(2) Guard Clauses. Use the guard class pattern with early returns as shown above . The downside of this is that it breaks the single return per function rule that is often imposed on safety critical code. While a very regular structure might work out OK in many cases, things get tricky if the returns are part of code that is more complex. Rather than deal with shades of gray about whether it is OK to have more than one return, I prefer a strict single-return-per-function rule. The reason is that it avoids having to waste a lot of time hashing out when multiple returns are OK and when they aren't -- and the accompanying risk of getting that judgment call wrong. In other words black-and-white rules take less interpretation.

If you are building a safety critical system, every subjective judgement call as to what is OK and what is not OK is a chance to get it wrong. (Could you get airtight design rules in place? Perhaps. But you still need to check them accurately after every code modification. If you can't automate the checks, realistically the code quality will likely degrade over time, so it's a path I prefer not to go down.)

(3) Else If Chain. Use an if/else chain and put the return at the end:

int MyRoutine(...)
{ ...
if(..something fails..) {ret=0;}
else if(..somethingelse fails..) {ret=0;}
else if(...yetanotherfail..) {ret=0;}
else { // perform default function
ret = value;
}
return(ret);
}

Wait, isn't this the same code with flat indenting? Not quite. It uses "else if" rather than "else { if". That means that the if statements aren't really nested -- there is really only a single main path through the code (a bunch of checks and the main action in the final code segment), without the possibility of lots of complex conditions in the "if" sidetrack branches.

If your code is flat enough that this works then this is a reasonable way to go. Think of it as guard clauses without the multiple returns. The regular "else if" structure makes it pretty clear that this is a sequence of alternatives, or often a set of "check that everything is OK and in the end take the action." However, there may be cases in which the code logic is difficult to flatten this way, and in those cases you need something else (keep reading for more ideas). The trivial case is:

(3a) Simplify To An Else If Chain. Require that all code be simplified enough that an "else if" chain works. This is a nice goal, and my first choice of style approaches. But sometimes you might need a more capable and flexible approach. Which brings us to the rest of the techniques...

(4) GOTO. Use a "goto" to break out of the flow and jump to the end of the routine. I have seen many discussion postings saying this is the right thing to do because the code is cleaner than lots of messy if/else structures. At the risk of being flamed for this, I respectfully disagree. The usual argument is that a good programmer exhibiting discipline will do just fine. But that entirely misses what I consider to be the bigger picture. I don't care how smart the programmer who wrote the code is (or thinks he is). I care a lot about whoever has to check and maintain the code. Sure, a good programmer on a good day can use a "goto" and basically emulate a try/throw/catch structure. But not all programmers are top 10 percentile, and a lot of code is written by newbies who simply don't have enough experience to have acquired mature judgment on such matters. Beyond that, nobody has all good days.

The big issue isn't whether a programmer is likely to get it right. The issue is how hard (and error-prone) it is for a code reviewer and static analysis tools to make sure the programmer got it right (not almost right, or subtly wrong). Using a goto is like pointing a loaded gun at your foot. If you are careful it won't go off. But even a single goto shoves you onto a heavily greased slippery slope. Better not to go there in the first place. Better to find a technique that might seem a little more klunky, but that gets the job done with minimum fuss, low overhead, and minimal chance to make a mistake. Again, see my posting on not getting things wrong.

(Note: this is with respect to unrestricted "goto" commands used by human programmers in C. Generated code might be a different matter, as might be a language where "goto" is restricted.)

(5) Longjmp. Use setjmp/longjmp to set a target and then jump to that target if the list of "if" error checks wants to return early. In the final analysis this is really the moral equivalent of a "goto," although it is a bit better controlled. Moreover, it uses a pointer to code (the setjmp/longjmp variable), so it is an indirect goto, and that in general can be hazardous. I've used setjmp/longjmp and it can be made to work (or at least seem to work), but dealing with pointers to code in your source code is always a dicey proposition. Jumping to a corrupted or uninitialized pointer can easily crash your system (or sometimes worse). I'd say avoid using this approach.

I've seen discussion forum posts that wrap longjmp-based approaches up in macros to approximate Try/Throw/Catch. I can certainly appreciate the appeal of this approach, and I could see it being made to work. But I worry about things such as whether it will work if the macros get nested, whether reviewers will be aware of any implementation assumptions, and what will happen with static analysis tools on those structures. In high-dependability code if you can't be sure it will work, you shouldn't do it.

(6) Do..While..Break. Out of the classical C approaches I've seen, the one I like the most (beyond else..if chains) is using a "do..while..break" structure:

int MyRoutine(...)
{ ...
do { // start error handling code sequence
if(..something fails..) {ret=0; break;}
if(..somethingelse fails..) {ret=0; break;}
if(...yetanotherfail..) {ret=0; break;}
// perform default function
ret = value;
} while (0); // end error handling code sequence
return(ret);
}

This code is a hybrid of the guard pattern and the "else if" block pattern. It uses a "break" to skip the rest of the code in the code block (jumping to the while(0), which always exits the loop). The while(0) converts this structure from a loop into just a structured block of code with the ability to branch to the end of the code block using the "break" keyword. This code ought to compile to an executable that is has essentially identical efficiency to code using goto or code using an else..if chain. But, it puts an important restriction on the goto-like capability -- all the jumps have to point to the end of the do..while without exception.

What this means in practice is that when reviewing the code (or a change to the code) there is no question as to whether a goto is well behaved or not. There are no gotos, so you can't make an unstructured goto mistake with this approach.

While this toy example looks pretty much the same as the "else if" structure, an important point is that the "break" can be placed anywhere -- even deeply within a nested if statement -- without raising questions as to what happens when the "break" is hit. If in doubt or there is some reason why this technique won't work for you, I'd suggest falling back on restructuring the code so "else if" or this technique works if the code gets too complex to handle. The main problem to keep in mind is that if you nest do..while structures the break will un-nest only one level.

I recognize that this area falls a little bit into a matter of taste and context. My taste is for code that is easy to review and unlikely to have bugs. If that is at odds with subjective notions of elegance, so be it. In part my preference is to outlaw the routine use of techniques that require manual analysis to determine of a potentially unsafe structure is being used the "right" way. Every such judgement call is a chance to get it wrong, and a distraction of human reviewer attention away from more important things. And I dislike arguments of the form that a "good" and experienced programmer won't make a mistake. It is just too easy to miss a subtle bug, especially when you're modifying code in a hurry.

If you've run into another way to handle this problem let me know.

Update 11/2022: Olayiwola Ayinde has a similar take, pointing out that multiple exits make it too easy to forget to de-allocate resources that might have been allocated at the start of a procedure:

https://olayiwolaayinde.medium.com/c-programming-why-single-exit-point-6b2f8c43054f