Better Embedded System SW: 2017

Monday, November 27, 2017

Embedded Software Course Notes On-Line

I'm just wrapping up my first semester teaching a new course on embedded system software. It covers code quality, safety, and security. Below is table of lecture handouts.

NOTE: there is an update here:
https://users.ece.cmu.edu/~koopman/lectures/index.html#642
which includes newer course notes and quite a few YouTube videos of these lectures.
You should use that URL instead of this blog post, but I've left this post as-is for Fall 2017.

18-642 Embedded System Software Engineering
Prof. Philip Koopman, Carnegie Mellon University, Fall 2017

	Slides	Topics
1	Course Introduction	Software is eating the world; embedded applications and markets; bad code is a problem; coding is 0% of software; truths and management misconceptions
2	Software Development Processes	Waterfall; swiss cheese model; lessons learned in software; V model; design vs. code; agile methods; agile for embedded
3	Global Variables	Global vs. static variables; avoiding and removing globals
4	Spaghetti Code	McCabe Cyclomatic Complexity (MCC); SCC; Spaghetti Factor (SF)
5	Unit Testing	Black box testing; white box testing; unit testing strategies; MCDC coverage; unit testing frameworks (cunit)
6	Modal Code/Statecharts	Statechart elements; statechart example; statechart implementation
7	Peer Reviews	Effective code quality practices, peer review efficiency and effectiveness; Fagan inspections; rules for peer review; review report; perspective-based reviews; review checklist; case study; economics of peer review
8	Code Style/Humans	Making code easy to read; good code hygiene; avoiding premature optimization; coding style
9	Code Style/Language	Pitfalls and problems with C; language use guidelines and analysis tools; using language wisely (strong typing); Mars Climate Orbiter; deviations & legacy code
10	Testing Quality	Smoke testing, exploratory testing; methodical test coverage; types of testing; testing philosophy; coverage; testing resources
11	Requirements	Ariane 5 flight 501; rules for good requirements; problematic requirements; extra-functional requirements; requirements approaches; ambiguity
12	System-Level Test	First bug story; effective test plans; testing won't find all bugs; F-22 Raptor date line bug; bug farms; risks of bad software
13	SW Architecture	High Level Design (HLD); boxes and arrows; sequence diagrams (SD); statechart to SD relationship; 2011 Health Plan chart
14	Integration Testing	Integration test approaches; tracing integration tests to SDs; network message testing; using SDs to generate unit tests
15	Traceability	Traceability across the V; examples; best practices
16	SQA isn't testing	SQA elements; audits; SQA as coaching staff; cost of defect fixes over project cycle
17	Lifecycle CM	A400M crash; version control; configuration management; long lifecycles
18	Maintenance	Bug fix cycle; bug prioritization; maintenance as a large cost driver; technical debt
19	Process Key Metrics	Tester to developer ratio; code productivity; peer review effectiveness
33	Date Time Management	Keeping time; time terminology; clock synchronization; time zones; DST; local time; sunrise/sunset; mobility and time; date line; GMT/UTC; leap years; leap seconds; time rollovers; Zune leap year bug; internationalization.
21	Floating Point Pitfalls	Floating point formats; special values; NaN and robots; roundoff errors; Patriot Missile mishap
23	Stack Overflow	Stack overflow mechanics; memory corruption; stack sentinels; static analysis; memory protection; avoid recursion
25	Race Conditions	Therac 25; race condition example; disabling interrupts; mutex; blocking time; priority inversion; priority inheritance; Mars Pathfinder
27	Data Integrity	Sources of faults; soft errors; Hamming distance; parity; mirroring; SECDED; checksum; CRC
20	Safety+Security Overview	Challenges of embedded code; it only takes one line of bad code; problems with large scale production; your products live or die by their software; considering the worst case; designing for safety; security matters; industrial controls as targets; designing for security; testing isn't enough Fiat Chrysler jeep hack; Ford Mytouch update; Toyota UA code quality; Heartbleed; Nest thermostats; Honda UA recall; Samsung keyboard bug; hospital infusion pumps; LIFX smart lightbulbs; German steel mill hack; Ukraine power hack; SCADA attack data; Shodan; traffic light control vulnerability; hydroelectric plant vulnerability; zero-day shopping list
22	Dependability	Dependability; availability; Windows 2000 server crash; reliability; serial and parallel reliability; example reliability calculation; other aspects of dependability
24	Critical Systems	Safety critical vs. mission critical; worst case and safety; HVAC malfunction hazard; Safety Integrity Levels (SIL); Bhopal; IEC 61508; fleet exposure
26	Safety Plan	Safety plan elements; functional safety approaches; hazards & risks; safety goals & safety requirements; FMEA; FTA; safety case (GSN)
28	Safety Requirements	Identifying safety-related requirements; safety envelope; Doer/Checker pattern
29	Single Points of Failure	Fault containment regions (FCR); Toyota UA single point failure; multi-channel pattern; monitor pattern; safety gate pattern; correlated & accumulated faults
30	SIL Isolation	Isolating different SILs, mixed-SIL interference sources; mitigating cross-SIL interference; isolation and security; CarShark hack
31	Redundancy Management	Bellingham WA gasoline pipeline mishap; redundancy for availability; redundancy for fault detection; Ariane 5 Flight 501; fail operational; triplex modular redundancy (TMR) 2-of-3 pattern; dual 2-of-2 pattern; high-SIL Doer/Checker pattern; diagnostic effectiveness and proof tests
32	Safety Architecture Patterns	Supplemental lecture with more detail on patterns: low SIL; self-diagnosis; partitioning; fail operational; voting; fail silent; dual 2-of-2; Ariane 5 Flight 501; fail silent patterns (low, high, mixed SIL); high availability mixed SIL pattern
34	Security Plan	Security plan elements; Target Attack; security requirements; threats; vulnerabilities; mitigation; validation
35	Cryptography	Confusion & diffusion; Caesar cipher; frequency analysis; Enigma; Lorenz & Colossus; DES; AES; public key cryptography; secure hashing; digital signatures; certificates; PKI; encrypting vs. signing for firmware update
36	Security Threats	Stuxnet; attack motivation; attacker threat levels; DirectTV piracy; operational environment; porous firewalls; Davis Besse incident; BlueSniper rifle; integrity; authentication; secrecy; privacy; LG Smart TV privacy; DoS/DDos; feature activation; St. Jude pacemaker recall
37	Security Vulnerabilities	Exploit vs. attack; Kettle spambot; weak passwords; master passwords; crypto key length; Mirai botnet attack; crypto mistakes; LIFX revisited; CarShark revisited; chip peels; hidden functionality; counterfeit systems; cloud connected devices; embedded-specific attacks
38	Security Mitigation Validation	Password strength; storing passwords & salt/pepper/key stretching; Adobe password hack; least privilege; Jeep firewall hack; secure update; secure boot; encryption vs. signing revisited; penetration testing; code analysis; other security approaches; rubber hose attack
39	Security Pitfalls	Konami code; security via obscurity; hotel lock USB hack; Kerckhoff's principle; hospital WPA setup hack; DECSS; Lodz tram attack; proper use of cryptography; zero day exploits; security snake oil; realities of in-system firewalls; aircraft infotainment and firewalls; zombie road sign hack

Note that in Spring 2018 these are likely to be updated, so if want to see the latest also check the main course page: https://www.ece.cmu.edu/~ece642/ For other lectures and copyright notes, please see my general lecture notes & video page: https://users.ece.cmu.edu/~koopman/lectures/index.html

Friday, November 17, 2017

Highly Autonomous Vehicle Validation

Here are the slides from my TechAD talk today.

Highly Autonomous Vehicle Validation from Philip Koopman

Highly Autonomous Vehicle Validation: it's more than just road testing!
- Why a billion miles of testing might not be enough to ensure self-driving car safety.
- Why it's important to distinguish testing for requirements validation vs. testing for implementation validation.
- Why machine learning is the hard part of mapping autonomy validation to ISO 26262

Monday, October 9, 2017

Top Five Embedded Software Management Misconceptions

Here are five common management-level misconceptions I run into when I do design reviews of embedded systems. How many of these have you seen recently?

(1) Getting to compiled code quickly indicates progress. (FALSE!)

Many projects are judged by "coding completed" to indicate progress. Once the code has been written, compiles, and kind of runs for a few minutes without crashing, management figures that they are 90% there. In reality, a variant of the 90/90 rule holds: the first 90% of the project is in coding, and the second 90% is in debugging.

Measuring teams on code completion pressures them to skip design and peer reviews, ending up with buggy code. Take the time to do it right up front, and you'll more than make up for those "delays" with fewer problems later in the development cycle. Rather than measure "code completed" do something more useful, like measure the fraction of modules with "peer review completed" (and defects found in peer review corrected). There are many reasonable ways to manage, but waterfall-ish projects that treat "code completed" as the most critical milestone is not one of them.

(2) Smart developers can write production-quality code on a long weekend (FALSE!)

Alternate form: marketing sets both requirements and end date without engineering getting a chance to spend enough time on a preliminary design to figure out if it can actually be done.

The true bit is anyone can slap together some code that doesn't work. Some folks can slap together code in a long weekend that almost works. But even the best of us can only push so many lines of code in a short amount of time without making mistakes, much less producing something anyone else can understand. Many of us remember putting together hundreds or thousands of lines on an all-nighter when we were students. That should not be mistaken for writing production embedded code.

Good embedded code tends to cost about an hour for every 1 or 2 lines of non-comment code all-in, including testing (on a really good day 3 lines/hr). Some teams come from the Lake Wobegone school, where all the programmers are above average. (Is that really true for your team? Really? Good for you! But you still have to pay attention to the other four items on this list.) And sure, you can game this metric if you try. Nonetheless, it is remarkable how often I see a number well above about 2 SLOC/hour of deeply embedded code corresponding to a project that is in trouble.

Regardless of the precise productivity number, if you want your system to really work, you need to treat software development as a core competency. You need an appropriately methodical and rigorous engineering process. Slapping together code quickly gives the illusion of progress, but it doesn't produce reliable products for full-scale production.

(3) A “mostly working,” undisciplined prototype can be deployed. (FALSE!)

Quick and dirty prototypes provide value by giving stakeholders an idea of what to expect and allowing iterations to converge on the right product. They are invaluable for solidifying nebulous requirements. However, such a prototype should not be mistaken for an actual product! If you've hacked together a prototype, in my experience it's always more expensive to clean up the mess than it is to take a step back and start a project from scratch or a stable production code base.

What the prototype gives you is a solid sense of requirements and some insight into pitfalls in design.

A well executed incremental deployment strategy can be a compromise to iteratively add functionality if you don't know all your requirements up front. But an well-run Agile project is not what I'm talking about when I say "undisciplined prototype." A cool proof of concept can be very valuable. It should not be mistaken for production code.

(4) Testing improves software quality (FALSE!)

If there are code quality problems (possibly caused by trying to bring an undisciplined prototype to market), the usual hammer that is brought to bear is more testing. Nobody ever solved code quality problems by testing. All that testing does is make buggy code a little less buggy. If you've got spaghetti code that is full of bugs, testing can't possibly fix that. And testing will generally miss most subtle timing bugs and non-obvious edge cases.

If you're seeing lots of bugs in system test, your best bet is to use testing to find bug farms. The 90/10 rule applies: many times 90% of the bugs are in bug farms -- the worst 10% of the modules. That's only an approximate ratio, but regardless of the exact number, if you're seeing a lot of system test failures then there is a good chance some modules are especially bug-prone. Generally the problem is not simply programming errors, but rather poor design of these bug-prone modules that makes bugs inevitable. When you identify a bug farm, throw the offending module away, redesign it clean, and write the code from scratch. It's tempting to think that each bug is the last one, but after you've found more than a handful of bugs in a module, who are you kidding? Especially if it's spaghetti code, bug farms will always be one bug away from being done, and you'll never get out of system test cleanly.

(5) Peer review is too expensive (FALSE!)

Many, many projects skip peer review to get to completed code (see item #1 above). They feel that they just don't have time to do peer reviews. However, good peer reviews are going to find 50-75% of your bugs before you ever get to testing, and do so for about 10% of your development budget. How can you not afford peer reviews? (Answer: you don't have time to do peer reviews because you're too busy writing bugs!)

Have you run into another management misconception on a par with these? Let me know what you think!

Friday, September 22, 2017

Challenges and Solutions in Autonomous Vehicle Validation

Here are the slides from my AV17 Presentation on self-driving car safety:

Challenges and Solutions in Autonomous Vehicle Validation from Philip Koopman

Monday, August 28, 2017

The Spaghetti Factor -- A Software Complexity Metric Proposal

I've had to review code that has spaghetti-level complexity in control flow (too high cyclomatic complexity). And I've had to review code that has spaghetti-level complexity its data flow (too many global variables mixed together into a single computation). And I've had to review procedures that just go on for page after page with no end in sight. But the stuff that will really make your brain hurt is code that has all of these problems.

There are many complexity metrics out there. But I haven't seen a one that directly tries to help balance three key points of complexity into a single intuitive number: code complexity, data complexity, and module size. So here is a proposal that could help drive improvement in a lot of the terrible embedded control code I've seen:

The Spaghetti Factor metric (SF):

SF = SCC + (Globals*5) + (SLOC/20)

SCC = Strict Cyclomatic Complexity
Globals = # of read/write global variables
SLOC = # source lines of non-comment code (e.g., C statements)

Scoring:
5-10 - This is the sweet spot for most code except simple helper functions
15 - Don't go above this for most modules
20 - Look closely; review to see if refactoring makes sense
30 - Refactor the design
50 - Untestable; throw the module away and fix the design
75 - Unmaintainable; throw the module away; throw the design away; start over
100 - Nightmare; probably you need to throw the whole subsystem away and re-architect it

Notation:

SCC is Strict Cyclomatic Complexity (sometimes called CC2). This is a variant of McCabe Cyclomatic complexity (MCC). In general terms, MCC is based on the number of branches in the program. SCC additionally considers complexity based on the number of conditions within each branch. SCC is an approximation of how many test cases it takes to exercise all the paths through code including all the different ways there are to trigger each branch. In other words, it is an estimate of how much work it is to do unit testing. Think of it as an approximation to the effort required for MC/DC testing. But in practice it is also a measure of how hard it is to understand the code. The idea is to keep SCC low enough that it is feasible to understand and test paths through the code.

Globals is the number of read/write global variables accessed by the module. This does not include "const" values, nor file static variables. In an ideal world you have zero or near-zero global variables. If you have inherent global state, you should encapsulated that in a state object with appropriate access functions to enforce well-disciplined writes. Referencing an unstructured pile of dozens or hundreds of global variables can make software difficult to test, and can make subsystem testing almost impossible. Partly that is due to the test scaffolding required, but partly that is simply due to the effort of chasing down all the globals and trying to figure out what they do both inbound and outbound. Moreover, too many globals can make it nearly impossible to chase down bugs or understand the effects of changing one part of the code on the rest of the code. An important goal of this part of the metric is to discourage use of many disjoint global variables to implicitly pass data around from routine to routine instead of passing parameters along with function calls.

SLOC is the number of non-comment "Source Lines of Code." For C programs, this is the number of programming statements. Typical guidelines are a maximum 100-225 maximum lines of code for a module, with most modules being smaller than that.

As an example calculation, if you have 100 lines of code with an SCC of 9 and 1 global reference, your score will be SF = 9 + (1*5) + (100/20) = 19. A score of 19 is on the upper edge of being OK. If you have a distribution of complexity across modules, you'd want most of them to be a bit lower in complexity than this example calculation.

Discussion:

The guideline values are taken primarily from MCC, which typically has a guideline of 10 for most modules, 15 as a usual bound, and 30 as limit. To account for globals and length, based on my experience, I've changed that to 15 for most modules, 20 as a soft limit and 30 as a hard limit. You might wish to adjust the threshold and multipliers based on your system and experience. In particular it is easy to make a case that these limits aren't strict enough for life-critical software, and a case can be made for being a little more relaxed in throw-away GUI management code. But I think this is a good starting point for most every-day embedded software that is written by a human (as opposed to auto-generated code).

The biggest exception is usually what to do about switch statements. If you exempt them you can end up with multiple switches in one module, or multiple switch/if/switch/if layered nesting. (Neither is a pretty sight.) I think it is justifiable to exempt modules that have ONLY a switch and conditional logic to do sanity checking on the switch value. But, because 30 is a pretty generous limit, you're only going to see this rarely. Generally the only legitimate reason to have a switch bigger than that is for something like processing a message type for a communication protocol. So I think you should not blanket exempt switch statements, but rather include them in an overall case-by-case sign-off by engineering management as to which few exceptions are justifiable.

Some might make the observation that this metric discourages extensive error checking. That's a different topic, and certainly the intent is NOT to discourage error checking. But the simple answer is that error checking has to be tested and understood, so you can't simply ignore that part of the complexity. One way to handle that situation is to put error checking into a subroutine or wrapper function to get that complexity out of the way, then have that wrapper call the actual function that does the work. Another way is to break your overall code down into smaller pieces so that each piece is simple enough for you to understand and test both the functionality and the error checking.

Finally, any metric can be gamed, and that is surely true of simple metrics like this one. A good metric score doesn't necessarily mean your code is fantastic. Additionally, this metric does not consider everything that's important, such as the total number of globals across your code base. On the other hand, if you score poorly on this metric, most likely your code is in need of improvement.

What I recommend is that you use this metric as a way to identify code that is needlessly complex. It is the rare piece of code indeed that unavoidably needs to have a high score on this complexity metric. And if all your code has a good score, that means it should be that much easier to do peer review and unit testing to ensure that other aspects of the code are in good shape.

References:

Blog post on cyclomatic complexity

Jack Ganssle on how to compute MCC
Description of MCC variants, including SCC (CC2)

Blog post on global variables
Blog post on modularity

A NIST paper on applying metrics is here: http://www.mccabe.com/pdf/mccabe-nist235r.pdf including an interesting discussion of the pitfalls of handling switch statements within a complexity framework.

Monday, July 24, 2017

Don't use macros for MIN and MAX

It is common to see small helper functions implemented as macros, especially in older C code. Everyone seems to do it. But you should avoid macros, and instead use inline functions.

The motivation for using macros was originally that you needed to use a small function in many places but were worried about the overhead of doing a subroutine call. So instead, you used a macro, which expands into source code in the preprocessor phase. That was a reasonable tradeoff 40 years ago. Not such a great idea now, because macros cause problems for no good reason.

For example, you might look on the Web and find these common macros

#define MAX(a,b) ((a) > (b) ? a : b)

#define MIN(a,b) ((a) < (b) ? a : b)

And you might find that it seems to work for a while. You might get bitten by the missing "()" guards around the second copy of a and b in the above -- which version you get depends on which cut & paste code site you visit.

But then you'll find that there are still weird situations where you get unexpected behavior. For example, what does this do?

c = MAX(a++, b);

If a is greater than b executing the code will increment a twice, but if a is less than or equal to b it will only increment a once. And if you start mixing types or putting complicated expressions into the macro things can get weird and buggy in a hurry.

Another related problem is that the macro will expand, increasing the cyclomatic complexity of your code. That's because a macro is equivalent to you having put the conditional branch into the source code. (Remember, macro expansion is done by the preprocessor, the so compiler itself acts as if you'd typed the conditional assignment expression every place you use the macro.) This complexity rating is justified, because there is no actual procedure that can be unit tested independently.

As it turns out, macros are evil. See the C++ FAQ: https://isocpp.org/wiki/faq/misc-technical-issues#macros-with-if which lists 4 different types of evil behavior. There are fancy hacks to try to get any particular macros such as MIN and MAX to be better behaved, but no matter how hard you try you're really just making a deal with the devil.

What's the fix?

The fix is: don't use macros. Instead use inline procedure calls.

You should already have access to built-in functions for floating point such as fmin() and fmax(). If it's there, use the stuff from your compiler vendor instead of writing it yourself!

If your compiler doesn't have integer min and max, or you are worried about breaking existing macro code, convert the macros into inline functions with minimal changes to your code base:

inline int32_t MAX(int32_t a, int32_t b) { return((a) > (b) ? a : b); }

inline int32_t MIN(int32_t a, int32_t b) { return((a) < (b) ? a : b); }

If you have other types to deal with you might need different variants depending on the types, but often a piece of code uses predominantly one data type for its calculations, so in practice this is usually not a big deal. And don't forget, if your build environment has a built in min or max you can just set up the macro to call that directly.

What about performance?

The motivation for using macros back in the bad old days was efficiency. A subroutine call involved a lot of overhead. But the inline keyword tells the compiler to expand the code in-place while retaining all the advantages of a subroutine call. Compilers are pretty good at optimization these days. So there is no overhead at run-time. I've also seen advice to put the inline function in a header file so it will be visible to any procedure needing it, and the macro was already there anyway.

Strictly speaking, "inline" is a suggestion to the compiler. However, if you have a decent compiler it will follow the suggestion unless the inline function is so big the call overhead just doesn't matter. Some compilers have a warning flag that will let you know when the inline didn't happen. For example, use -Winline for gcc. If your compiler ignores "inline" for something as straightforward as MIN or MAX, get a different compiler.

What about multiple types?

A perceived advantage of the macro approach is that you can play fast and loose with types. But playing fast and loose with types is a BAD IDEA because you'll get bugs.

If you really hate having to match the function name to the data types then what you need is to switch to a language that can handle this by automatically picking the right function based on the operator types. In other words, switch from a to a language that is 45 years old (C) to one that is only about 35 years old (C++). There's a paper from 1995 that explains this in the context of min and max implemented with templates: http://www.aristeia.com/Papers/C++ReportColumns/jan95.pdf

As it turns out the rabbit hole goes a lot deeper than you might think for a generic solution.

But you don't have to go down the rabbit hole. For most code the best answer is simply to use inline functions and pick the function name that matches your data types. You shouldn't lose any performance at all, and you'll be likely to save a lot of time chasing obscure bugs.

Monday, May 22, 2017

#define vs. const

Is your code full of "#define" statements? If so, you should consider switching to the const keyword.

Old school C:

#define MYVAL 7

Better approach:

const uint32_t myVal = 7;

Here are some reasons you should use const instead of #define:

#define has global scope, so you're creating (read-only) global values every time you use #define. Global scope is evil, so don't do that. (Read-only global scope for constant values is a bit less evil than global variables per se, especially if you can't use the namespace features of C++. But gratuitous global scope is always a bad idea.) A const alternative can obey scoping rules, including being purely local if defined inside a procedure, or more commonly file static with the "static" keyword.
Const lets you do more aggressive type checking (depending upon your compiler and static analysis tools, especially if you use a typedef more specific than built-in C data types). While C is a bit weak as a language in this area compared to other languages, a classical example is a const lets you identify a number as being in feet or meters, while the #define approach is just as if you'd typed the number 7 in with no units. The #define approach can bite you if you use the wrong value in the wrong place. Type checking is an effective way to find bugs, and using #define gives up an opportunity to let static analysis tools help you with that.
Const lets you use the value as if it were a variable when you need to (e.g., passing an address to the variable) without having to change how the variable is defined.
#define in general is so bug-prone that you should minimize its use just to avoid having to spend time asking "is this one OK?" in a peer review. Most #define uses tend to be const variables in old-school code, so getting rid of them can dramatically reduce the peer review burden of sifting through hundreds of #define statements to look for problems.

Here are some common myths about this tradeoff. (Note that on some systems these statements might be true, especially if you have and old and lame compiler. But they don't necessarily have to be true and they often are false, especially on newer chips with newer compilers.)

"Const wastes memory." False if you have a compiler that is smart enough to do the right thing. Sure, if you want to pass a pointer to the const it will actually have to live in memory somewhere, but you can't even pass a pointer to a #define at all. One of the points of "const" is to give the compiler a hint that lets it optimize memory footprint.
"Const won't work for X." Generally false if you have a newer compiler, and especially if you are using a mostly-C subset of the capability of a C++ compiler, as is increasingly common. And honestly, most of the time #define is just being used as a plain old integer const to get rid of magic numbers. const will work fine. (If you have magic numbers instead of #define, then you have bigger problems than this even.) Use const for the no-brainer cases. Something is probably wrong if everything about your code is so special you need #define everywhere.
"Const hassles me about type conversions." That's a feature to prevent you from being sloppy! So strictly speaking the compiler doing this is not a myth. The myth is that this is a bad thing.

There are plenty of discussions on this topic. You'll also see that some folks advocate using enums for some situations, which we'll get to another time. For now, if you change as many #defines as you can to consts then that is likely to improve your code quality, and perhaps flush out a few bugs you didn't realize you had.

Be careful when reading discussion group postings on this topic. There is a lot of dis-information out there about performance and other potential tradeoff factors, usually based on statements about 20 year old versions of the C language or experiences with compilers that have poor optimization capability. In general, you should always use const by default unless your particular compiler/system/usage presents a compelling case not to.

See also the Barr Group C coding standard rule 1.8.b which says to use const, and has a number of other very useful rules.

Monday, May 8, 2017

Optimize for V&V, not for writing code

Geralt / CC0 PD/noattrib.

Writing code should be made more difficult so that Verification &Validation can be made easier.

I first heard this notion years ago at a workshop in which several folks from industry who build high assurance software (think flight controls) stood up and said that V&V is what matters. You might expect that from flight control folks, but their reasoning applies to pretty much every embedded project. That's because it is a matter of economics.

Multiple speakers at that workshop said that aviation software can require 4 or 5 hours of V&V for every 1 hour of creating software. It makes no economic sense to make life easy for the 1 hour side of the ratio at the expense of making life painful for the 5 hour side of the ratio.

Good, but non-life-critical, embedded software requires about 2 hours of V&V for every 1 hour of code creation. So the economic argument still holds, with a still-compelling multiplier of 2:1. I don't care if you're Vee, Agile, hybrid model or whatever. You're spending time on V&V, including at least some activities such as peer review, unit test, created automated tests, performing testing, chasing down bugs, and so on. For embedded products that aren't flaky, probably you spend more time on V&V than you do on creating the code. If you're doing TDD you're taking an approach that has the idea of starting with a testing viewpoint built in already, by starting from testing and working outward from there. But that's not the only way to benefit from this observation.

The good news is that making code writing "difficult" does not involve gratuitous pain. Rather, it involves being smart and a bit disciplined so that the code you produce is easier for others to perform V&V on. A bit of up front thought and organization can save a lot on downstream effort. Some examples include:

Writing concise but helpful code comments so that reviewers can understand what you meant.
Writing code to be obvious rather than clever, again to help reviewers.
Follow a style guide to make your code consistent, and thus easier to understand.
Writing code that compiles clean for static analysis, avoiding time wasted finding defects in test that a tool could have found, and avoiding a person having to puzzle out which warnings matter, and which don't.
Spending some time to make your unit interfaces easier to test, even if it requires a bit more work designing and coding the unit.
Spending time making it easy to trace between your design and the code. For example, if you have a statechart, make sure the statechart uses names that map directly to enum names rather than using arbitrary state variables such as "magic number" integers between 1 and 7. This makes it easier to ensure that the code and design match. (For that matter, just using statecharts to provide a guide to what the code does also helps.)
Spending time up front documenting module interaction so that integration testers don't have to puzzle out how things are supposed to work together. Sequence diagrams can help a lot.
Making the requirements both testable and easy to trace. Make every requirement idea a stand-alone sentence or paragraph and give it a number so it's easy to trace to a specific test primarily designed to test that particular requirement. Avoid having requirements in huge paragraphs of free-form text that mix lots of different concepts together.

Sure, these sound like a good idea, but many developers skip or skimp on them because they don't think they can afford the time. They don't have time to make their code clean because they're too busy writing bugs to meet a deadline. Then they, and everyone else, pay for this during the test cycle. (I'm not saying the programmers are necessarily the main culprits here, especially if they didn't get a vote on their deadline. But that doesn't change the outcome.)

I'm here to say you can't afford not to follow these basic code quality practices. That's because every hour you're saving by cutting corners up front is probably costing you double (or more) downstream by making V&V more painful than it should be. It's always hard to invest in downstream benefits when the pressure is on, but doing so is costing you dearly when you skimp on code quality.

Do you have any tricks to make code easier to understand that I missed?

Monday, April 10, 2017

Challenges & solutions for Embedded Software Security, Safety & Quality (Full Tutorial Video)

This is a full-length video that talks about embedded software security, safety and quality: why it matters. What to do about it.

Embedded Software Quality Safety and Security [ECR]

The purpose of this video is to help you understand why safety and security are such a big deal for embedded systems, tell some war stories, and explain the general ways available to reduce risk when you're creating embedded and IoT products.

Topics covered include:

Case studies of safety and security problems
How to design for safety
How to design for security
Top 10 embedded software warning signs
How to create high quality embedded software

(27 Slides / 45 minutes)

Slides Only:

Embedded Software Security Safety & Quality from Philip Koopman

Monday, March 27, 2017

Safety Architectural Patterns (Preview)

Here's a summary video on Safety Architectural Patterns:

Safety Architecture Patterns Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Monitor actuator pattern

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, March 20, 2017

Critical System Isolation (Preview)

Here's a summary video on Critical System Isolation:

Critical System Isolation Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Self-monitoring doesn't mitigate single points of failure

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, March 13, 2017

Redundancy Management for System Safety (Preview)

Here's a summary video on Redundancy Management:

Redundancy Management for Critical Systems Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

No single points of failure

Self monitoring can't mitigate a single point of failure

Monitor Actuator pattern

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, February 27, 2017

Critical System Design (Preview)

Here's a summary video on Critical System Design techniques.

Critical Systems Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Random hardware faults in general

Other pointers:

Extensive list of techniques that can be used depending upon the SIL and safety standard that paplies to your system (IEC 61508-7)

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, February 20, 2017

Embedded System Dependability (Preview)

Here's a summary video on Embedded System Dependability.

Dependability Tutorial Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Other pointers

Avizienis, Laprie, Randell & Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," Jan 2004, (authoritative terminology reference).
Dependable Systems & Networks conference (DSN)
IFIP WG 10.4 dependability workshop presentation slides

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, February 13, 2017

Safety Requirements for Embedded Systems (Preview)

Here's a summary video on Embedded System Safety Requirements.

Safety Requirements Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, February 6, 2017

Embedded Software Safety Plan (Preview)

Here's a summary video on creating an embedded Software Safety Plan. (See additional pointers below.)

Safety Plan Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Other pointers:

Medical device regulatory environment (Micrium blog)
Software safety standards and guidebooks system-safety.org
International safety standards via law.resource.org (e.g., look for IEC 61508 in their industrial safety standards) -- very useful for teaching access to actual international standards, although sometimes older versions

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, January 30, 2017

Autonomous Vehicle Safety: An Interdisciplinary Challenge

Autonomous Vehicle Safety: An Interdisciplinary Challenge

By Phil Koopman & Mike Wagner

Abstract:
Ensuring the safety of fully autonomous vehicles requires a multi-disciplinary approach across all the levels of functional hierarchy, from hardware fault tolerance, to resilient machine learning, to cooperating with humans driving conventional vehicles, to validating systems for operation in highly unstructured environments, to appropriate regulatory approaches. Significant open technical challenges include validating inductive learning in the face of novel environmental inputs and achieving the very high levels of dependability required for full-scale fleet deployment. However, the biggest challenge may be in creating an end-to-end design and deployment process that integrates the safety concerns of a myriad of technical specialties into a unified approach.

Read the preprint version here for free (link / .pdf)

Official IEEE version (subscription required):
http://ieeexplore.ieee.org/document/7823109/
DOI: 10.1109/MITS.2016.2583491

IEEE Intelligent Transportation Systems Magazine (Volume: 9, Issue: 1, Spring 2017, pp. 90-96)

Correction:
"This would require a safety level of about 1 billion operating hours per catastrophic event. (FAA 1988)" should be
"This would require a safety level of about 1 billion operating hours per catastrophic event due to the failure of a particular function. (FAA 1988)" (Note that in this context a "function" is something quite high level such as the ability to provide sufficient thrust from the set of jet engines mounted on the airframe.)

Monday, January 23, 2017

Embedded System Safety Overview (Preview)

Here's a summary overview video on Embedded System Safety. (See additional pointers below.)

https://youtu.be/Ul0tN_EUnqY

Other pointers on this topic (my blog posts unless otherwise noted):

On-line resources:

UK MOD, An Introduction to System Safety Management in the MOD, January 2011.

RISKS Digest (archive of discussions on computing risks)
Safety critical mailing list (international safety experts discuss almost everything; includes on-line archives)

John Knight's book: Fundamentals of Dependable Computing for Software Engineers (2012) is an excellent current book on software dependability and safety.

Nancy Leveson has some great publications in the area of software safety, and is credited for developing this as an academic field. Anyone doing software safety should read at least these:

Software Safety: why, what, and how (a shorter, earlier version of the material in the Safeware book); pay to download from ACM, or sometimes you can find a free copy on-line if you don't have a subscription (1986)
The Therac-25 Accidents (1993)
High-Pressure Steam Engines and Computer Software (1992)
Safeware (book on amazon; 1995)
Engineering a safer world (free on-line book) (2011)

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Thursday, January 12, 2017

Guest on Embedded.fm Podcast

Elecia & Chris invited me to chat with them on this week's Embedded.fm podcast and it was a lot of fun.

You can check out my episode here:

http://embedded.fm/episodes/183

Also, I highly recommend listening to Jack Ganssle's excellent episode 53: "Being a grownup engineer"

http://embedded.fm/episodes/53

Scroll through the episode list. I'm episode 183 so you can tell they've been at this quite a while. There's a lot of great stuff to listen to.

Note added Tue. 1/17: books are back in stock in Amazon.

Meanwhile, if you are ordering from the US, the best deal on the book is via paypal here: http://koopman.us/

Monday, January 9, 2017

Language Use (Coding Style for Compilers) Overview Video

Here's a summary video on Language Use (Coding Style for Compilers) which is half of the topic of coding style.

Other pointers on this topic (my blog posts unless otherwise noted):