Monday, December 2, 2013

Why You Need A Software-Specific Test Plan

Summary: Pretty much every embedded system goes through some sort of system test. But, you need to do software-specific testing in addition to that.  For example, what if you ship a system with the watchdog timer accidentally turned off?



In essentially every embedded system there is some sort of product testing. Typically there is a list of product-level requirements (what the product does), and a set of tests designed to make sure the product works correctly. For many products there is also a set of tests dealing with fault conditions (e.g., making sure that an overloaded power supply will correctly shed load). And many companies think this is enough .. but I've found that such tests usually fall short in many cases.

The problem is that there are features built into your software that are difficult or near-impossible to test in traditional product-level testing.  Take the watchdog timer for example. I have heard more than one developer say that there was a case where a product shipped (at least one version of a product) with the watchdog timer accidentally turned off. How could this happen?  Easy: a field problem is reported; developer turns off watchdog to do single-step debugging; bug found and fixed; forgot to turn the watchdog back on; product test doesn't have a way to intentionally crash the software to see if the watchdog is working; new software version ships with watchdog timer still turned off.

Continuing with the watchdog example, how do you solve this? One way is to include user-accessible functions that exercise the watchdog timer by intentionally crashing the software. Sounds a bit dangerous, especially if you are worried about security. More likely you'll need to have some separate, special way of testing functions that you don't want visible to the end user. And you'll need a plan for executing those tests.

And ... well, here we are, needing a Software Test Plan in addition to a Product Test Plan. Maybe the software tests are done by the same testers who do product test, but that's not the point. The point is you are likely to need some strategy for testing things that are there not because the end product user manual lists them as functions, but rather because the software requirements say they are needed to provide reliability, security, or other properties that aren't typically thought of as product functions. ("Recovers from software crashes quickly" is typically not something you boast about in the user manual.) For similar reasons, the normal product testers might not even think to test such things, because they are product experts and not software experts.

So to get this right the software folks are going to have to work with the product testers to create a software-specific test plan that tests what the software requirements need to have tested, even if they have little directly to do with normal product functions. You can put it in product test or not, but I'd suggest making it a separate test plan, because some tests probably need to be done by testers who have particular skill and knowledge in software internals beyond ordinary product testers. Some products have a "diagnostic mode" that, for example, sends test messages on a network interface. Putting the software tests here makes a lot of sense.

But for products that don't have such a diagnostic mode, you might have to do some ad hoc testing before you build the final system by, for example, manually putting infinite loops into each task to make sure the watchdog picks them up. (Probably I'd use conditional compilation to do that -- but have a final product test make sure the conditional compilation flags are off for the final product!)

Here are some examples of areas you might want to put in your software test plan:

  • Watchdog timer is turned on and stays turned on; product reboots as desired when it trips
  • Watchdog timer detects timing faults with each and every task, with appropriate recovery (need a way to kill or delay individual tasks to test this)
  • Tasks and interrupts are meeting deadlines (watchdog might not be sensitive enough to detect minor deadline misses, but deadline misses usually are a symptom of a deeper problem)
  • CPU load is as expected (even if it is not 100%, if you predicted an incorrect number it means you have a problem with your scheduling estimates)
  • Maximum stack depth is as expected
  • Correct versions of all code have been included in the build
  • Code included in the build compiles "clean" (no warnings)
  • Run-time error logs are clean at the end of normal product testing
  • Fault injection has been done for systems that are safety critical to test whether single points of failure turn up (of course it can't be exhaustive, but if you find a problem you know something is wrong)
  • Exception handlers have all been exercised to make sure they work properly. (For example, if your code hits the "this can never happen" default in a switch statement, does the system do something reasonable, even if that means a system reset?)
Note that some of these are, strictly speaking, not really "tests." For example, making sure the code compiles free of static analysis warnings isn't done by running the code. But, it is properly part of a software test plan if you think of the plan as ensuring that the software you're shipping out meets quality and functionality expectations beyond those that are explicit product functions.

And, while we're at it, if any of the above areas aren't in your software requirements, they should be. Typically you're going to miss tests if there is nothing in the requirements saying that your product should have these capabilities.

If you have any areas like the above that I missed, please leave a comment.  I welcome your feedback!

Saturday, November 9, 2013

CRC and Checksum Tutorial Webinar

I've completed my FAA-sponsored look at CRC and Checksum performance for aviation systems. While some of the material is aircraft-specific, it is all relevant to safety critical systems and contains quite a bit of information about CRC vs. checksum performance.

I'm pleased to be able to share a two-hour Webinar recording describing all the work, as well as the slides and draft report. There is a guide to the material below that you may find useful if you are looking for a specific topic.

The general flow of the presentation is as follows, by slide number:
  • 6 - CRC and checksum background and terminology
  • 10 - Why picking the right CRC isn't as easy as you might think
  • 20 - Summary of project technical approach
  • 21 - Project results overview
  • 24 - Checksum results (Checksum, Fletcher Checksum, ATN-32)
  • 30 - CRC-32 compared to checksums
  • 31 - It's all about the Hamming Distance
  • 33 - Table of checksum compared to CRC-32
  • 35 - Good CRC choices (8 & 16 bits)
  • 36 - Good CRC choices (24 & 32 bits)
  • 37 - Aviation industry results (might be enlightening for everyone)
  • 43 - Multiple CRC & memory-specific CRC results
  • 46 - System level effects and cross-layer interactions (e.g., unprotected length field, bit encoding)
  • 48 - ARINC-825 / CAN issues
  • 52 - 8b10b encoding Gbit Ethernet issue
  • 53 - Mapping to criticality levels (how do you know your CRC/checksum is good enough?)
  • 64 - Determining Bit Error Ratio (BER)
  • 71 - Error detection mechanism failure and scrubbing
  • 74 - A simple recipe for how to determine how big and capable a CRC/checksum you need
  • 76 - CRC/Checksum seven deadline sins (bad ideas)
  • 79 - Review of the study project subtasks (most readers can skip this part)

I owe significant thanks to my Co-Investigators Kevin Driscoll and Brendan Hall at Honeywell labs, without whom this work could not have been done. And also a warm and sincere thanks to our FAA contacts, and especially to Chuck Kilgore for his good spirits, kind words, and unfailing support.

NOTES:

"The provided links will give you the option of downloading the video file or view it through your web browser.   Some people have reported problems viewing the video with their web browser due to the 2 hour length.  Others have been able to view the video in a browser without any problems.   If you encounter problems, we recommend that you download the file for offline viewing if possible to avoid any streaming issues."

"This work was supported by the Federal Aviation Administration, Aircraft Certification Service, and Assistant Administration for NextGen, William J. Hughes Technical Center, Aviation Research Division, Atlantic City International Airport, New Jersey 08405. The findings and conclusions in this presentation are those of the author(s) and do not necessarily represent the views of the funding agency.  This presentation does not constitute FAA policy."

The Webinar was scheduled Oct 1, 2013, but was delayed due to the government shutdown. The webinar was actually held on Oct 29, 2013.

Monday, October 21, 2013

Secrecy vs. Integrity and Why Encryption Might Be the Wrong Choice


Summary: Encryption doesn't solve all security problems.  In many cases you need authentication and integrity, not secrecy, and encryption can be the wrong tool for the job.  In those cases you need a Message Authentication Code, not encryption.



It's pretty typical to see embedded system designers use encryption to solve security problems. And it's also common for that to be the wrong answer to the real problem.

To understand why, consider a simplistic security need. (This example is naive in many ways, but serves to illustrate a point.) Let's say you want to set a light bulb intensity to one of 256 levels, and you want to make sure that only an authorized person can set that level.  To do this with no security, you'd send a message on an embedded network to that light bulb:

Message =  BulbLevel               (where BulbLevel is an 8 bit unsigned integer)

OK, so now you want to encrypt things.  You compute  X = Encrypt(BulbLevel) with a shared secret key that only you and the light bulb know and send that in the message:

Encrypted Message = X              (still 8 bits)

Now there is no way for anyone to know what level you've sent -- you've accomplished secrecy.  (Ignore all those attacks that just came into your head, such as recording past messages and playing them back .. or peeking to see what the light bulb did when it received the message ... this is just an illustrative example.) But is secrecy what you really wanted?   Remember our goal in this example wasn't to keep it a secret what the level was, but rather to prevent someone unauthorized from setting the light bulb level.

What if an adversary just sent random garbage:

Encrypted Message = RANDOM_8_bits         (still 8 bits)

That would set the light bulb output to some value. Maybe not a desired value, but the attacker would be able to change the light value to something other than what you commanded without knowing the secret key, (with probability 255/256 that it wasn't a repeat of the value already there) which is what you're trying to prevent.

The issue is that encryption is the wrong tool for the job.  What you really want is some combination of authentication (I know the right person sent it) and integrity (I know the contents have not been altered). Encryption isn't the best tool for this -- what you want is a Message Authentication Code (MAC). One way to get this is to compute an appropriately chosen secure hash function and append its results to the message:

Authenticated Message = BulbLevel  |  SecureHash         (8 bits concatenated with a hash value)

Both you and the light bulb still have a shared secret key with this approach. The light bulb receiving the message computes its own hash of the BulbLevel value and compares it to the received SecureHash value in the message. If they match, the message is authentic and it takes action. If they don't match, then it is a forgery and the message is ignored. Note that the BulbLevel isn't a secret -- it's transmitted "in the clear." That's because the point of this isn't secrecy; it's authentication (the sender knows the secret key to the cryptographic hash function) and integrity (the hash matches the value expected from the BulbLevel value, so the message hasn't been tampered with). If the hash value is sufficiently large, the chance of someone guessing the right hash value for a maliciously sent bulb level is low enough to be tolerated or even virtually impossible given the lifetime of the lightbulb, providing an arbitrarily good probabilistic level of security.

There's another important benefit to using a MAC. Encryption for the purpose of keeping data secret tends to be export controlled. Message Authentication Codes tend not to be export controlled. So ditching encryption in favor of a MAC usually helps with export issues. (Read the rules and talk to your lawyer -- this is just a sweeping generalization.)

The overall message is: if you want to ensure authenticity and integrity and secrecy isn't a big deal, using encryption is barking up the wrong tree.

The fine print: OK, now for those who have begun composing comments about how naive the above schemes are ... yes, I know .. it was only an example. You don't actually do it that way for either approach. For example, you need a time stamp or something to prevent playback attacks. And that tends to help encryption do better because of the reduced chance of accidentally coming up with a plausible decrypted timestamp value (if the receiver is checking for plausible timestamps). And certainly encryption can be made to work if you are careful. But, when I've looked into this for real systems what I've found is that a MAC is often a better tradeoff for a number of reasons and tends to provide better authentication and integrity for a given computational cost and bandwidth cost in practical scenarios. And, I've found designs in the real world based on encryption that weren't going to work as well as the designers thought because they didn't get the details of authentication right. Also, just to make sure it's said ... a CRC is not cryptographically secure, so don't use it as a secure hash function.

Even after the fine print, the message remains: use a MAC if it does the job. Don't jump to a default strategy of encryption if secrecy isn't what you really need. And if you do decide to use encryption, make sure it is really providing authentication and integrity checking in addition to secrecy.


Monday, September 16, 2013

Getting Rid of Global Variables

Summary: Global variables are evil. Here is an example of how to get rid of many of them.



Global variables are well known to be evil -- and you can read all about why that is in my free sample book chapter by that name. This posting gives a running example of changes that fix a common type of global variable.

Let's start with a pretty typical situation in a C program. You have a "globals.c" file that defines a mess of globals, including:
   int g_ErrCount;
which might be used to tally the number of run-time errors seen by the system. I've used a "g_" naming convention to emphasize that is a global, which means that every .c file in the program can read and write this variable with wild abandon.

Let's say you also have the following places this variable is referenced, including globals.c just mentioned:
globals.c:    int g_ErrCount;      // define the variable
globals.h:    extern int g_ErrCount; // other files include this
init.c:       g_ErrCount = 0; // init when program starts
moduleX.c:   g_ErrCount++;   // tally another error
moduleY.c:    XVar = g_ErrCount; // get current number of errors
moduleZ.c:    g_ErrCount = 0;  // clear number of reported errors

There are all sorts of risks with this approach... but let's concentrate on fixing them instead of diving into the Globals Are Evil discussion.

The first thing we're going to do is collect all the error counter functions into a single module, ErrCount.c, which would contain error counting, error reporting, and so on. This gets rid of the need to define g_ErrCount in globals.c, giving the below. We've also changed to using ErrCount.h for the extern definition:
globals.c:       // not needed any more for this variable
ErrCount.c:    int g_ErrCount;      // define the variable
ErrCount.h:    extern int g_ErrCount; // other files include this
init.c:       g_ErrCount = 0; // init when program starts
moduleX.c:   g_ErrCount++;   // tally another error
moduleY.c:    XVar = g_ErrCount; // get current number of errors
moduleZ.c:    g_ErrCount = 0;  // clear number of reported errors

Now let's get rid of the initialization. Having a central init.c is asking for problems if you forget to call an initialization function. Also, having a separate init.c forces variables to be global. So let's initialize the variable where it is defined:
globals.c:       // not needed any more for this variable
ErrCount.c:    int g_ErrCount = 0;    // define and init variable
ErrCount.h:    extern int g_ErrCount; // other files include this
init.c:       // no longer needed
moduleX.c:   g_ErrCount++;   // tally another error
moduleY.c:    XVar = g_ErrCount; // get current number of errors
moduleZ.c:    g_ErrCount = 0;  // clear number of reported errors

Instead of having the variable be global, let's hide it as a static variable inside ErrCount.c. Using the "static" keyword in defining a variable outside a function makes it invisible to other .c files. This step results in the program being broken, because other .c files can't get at the static variable. (We've also renamed the variable without the "g_" prefix because it's not global any more.)
ErrCount.c:    static int ErrCount = 0; // only visible in this file
ErrCount.h:    // static variables are invisible outside .c file
moduleX.c:   g_ErrCount++;   // tally another error
moduleY.c:    XVar = g_ErrCount; // get current number of errors
moduleZ.c:    g_ErrCount = 0;  // clear number of reported errors

To fix the problem with .c files seeing the static variable, we're going to add some access functions to ErrCount.c to provide the ability to touch the value without making the variable global.:
ErrCount.c:    static int ErrCount = 0; // only visible in this file
           inline void ErrCount_Incr() { ErrCount++; }
           inline int ErrCount_Get() { return(ErrCount); }
           inline void ErrCount_Reset() { ErrCount = 0; }
ErrCount.h:    inline void ErrCount_Incr();  // increment the count
           inline int ErrCount_Get();   // get current count value
           inline void ErrCount_Reset(); // reset count
           // Note that there is NO access to ErrCount directly
moduleX.c:      ErrCount_Incr();   // tally another error
moduleY.c:     XVar = ErrCount_Get(); // get current number of errors
moduleZ.c:     ErrCount_Reset();  // clear number of reported errors

And that's it -- we're there.  ErrCount is no longer a global variable. It is visible only inside ErrCount.c, and any accesses to the variable are performed via access functions that increment, read, and reset the value. Note that the keyword "inline" should, with a good compiler, make this code just as fast and efficient as the global variable version of the code -- except without actually having a global variable. In fact, what we've been doing is a C-based approach for making ErrCount into an object (the variable) with access methods to increment, read, and reset the object. Not quite as clean as you might see in C++, but it gets the job done with C syntax.

Some folks might just say this is slight of hand. If it generates the same code, why bother?  Here are some reasons that at least some developers find it useful to take this approach:
  • Software authors can only perform intended functions specific to an error counter: increment, read, and reset. Setting to an arbitrary value isn't allowed. If you don't want the value changed other than via incrementing, you can just delete the reset function. This prevents some types of bugs from ever happening.
  • If you need to change the data type or representation of the counter used that all happens inside ErrCount.c with no effect on the rest of the code. For example, if you find a bug with error counts overflowing, it is a lot easier to fix that in one place than every place that increments the counter! 
  • If you are debugging with a breakpoint debugger it is easier to know when the variable has been modified, because you can get rid of the "inline" keywords and put a breakpoint in the access functions. Otherwise, you need watchpoints, which aren't always available.
  • If different tasks in a multitasking system need to access the variable, then it is a lot easier to get the concurrency management right inside a few access functions than to remember to get it right everywhere the variable is read or written  (get it right once, use those functions over and over). Don't forget to make the variable volatile and disable interrupts when accessing it if concurrency is an issue.
I'm sure different readers have different ways of approaching this problem, And some globals are harder to get rid of than others. But I've seen a lot of code that is structured just like the "before" code. (I'm sure I must have written things that way myself in my mis-spent youth!) This approach cleans up a large fraction of globals with minimal pain and often no speed penalty.




Monday, August 19, 2013

Peer Review Spreadsheet

Summary: I've found peer reviews are only effective if they have tangible paperwork. Here's a minimalist approach that has worked for me in a classroom situation.



Good peer reviews can be very effective at finding bugs .. but bad peer reviews can be nearly useless. I teach a course that involves a semester-long software project and found it really difficult to get students to do proper peer reviews.  Basically, they were going through the motions but didn't produce much (i.e., didn't find many bugs). The emphasis on the materials I provided them was checklists for what to look for, but not a lot of formality in reporting results, because I wanted to minimize paperwork.  That didn't work. Teams were only finding about 10% of their bugs in reviews, which is just not enough to be worth the effort.

So one year I changed things, and made them to use a variant of the below spreadsheet to report results. The results were dramatic. The first semester I used the spreadsheet, defects found via peer review went from about 10% to about 50% across the class, and have stayed there ever since. (Also, the effort profile changed dramatically for the better, but that's a topic for another posting.) In my experience, finding 50% of bugs or so in peer review is about right. So, while I'm not claiming ultimate academic rigor for this study, it seems based on this experience that adopting a spreadsheet like this can be effective at improving the quality of peer reviews.

Here is the spreadsheet (click to download the .XLS file):




The first few lines are to record the project name, date, artifact (e.g., file name of the code being reviewed), and names of the reviewers present.  # Bugs is filled out later. The artifact author is intentionally omitted because the records are about finding bugs, not blaming the author. The issue rows are a place to record issues found in the review in free text, usually only one or two sentences apiece.

The status fields are left blank during the review. However, within 24 hours after the review has taken place the author of the item being reviewed needs to update it to indicate "Fixed", "Not Fixed," or "Not a Bug." The idea here is that if it is easy to fix, the burden to record the bug and fix it is minimal -- it stays in the spreadsheet. But if a bug is too hard to fix immediately, it is "Not Fixed" and must be entered as a formal bug report into the bug tracking system (Bugzilla or otherwise). Some items turn out not to be bugs, and it is OK to record them as such (e.g., a feature request or a misunderstanding by the reviewers). When the program author has updated the status, the # Bugs line is updated to reflect the number of bugs actually found, and that number is rolled up to a project tracking spreadsheet.

This last piece about rolling up the # Bugs to a higher level of visibility is crucial. In my course I have the Teaching Assistants track the number of bugs found weekly for every project team and make sure they asked hard questions if the answers were consistently zero. Really, that's all it took. Apparently if the teams know someone is watching, they'll look a bit harder to find bugs, and once they do it seems that the process boot-strapped into fairly effective reviews with minimal help from the course staff. It's worked several years in a row for me -- almost like a light switch had been flipped for the students in my class. Results have been pretty consistent since we started using the spreadsheet at about 50% of bugs found in peer reviews across dozens of teams. It should be noted that we peer review not only code, but also tests, statecharts, sequence diagrams, and other artifacts for our projects, and the payoff in finding bugs early has been unmistakable. Of course I need to make a "Your Mileage May Vary" disclaimer here, but it's worked for me.

I'd be interested in hearing stories about simple ways to make reviews more effective from industry teams as well. Ideally each team gets solid training on a review process along with help on soft skills for review leaders. But realistically a lot of times a bunch of engineers are just tossed into a room and told to make it happen. Knowing tricks that help is not a bad thing.


Monday, July 22, 2013

Making Main Loop Scheduling Timing More Deterministic

Summary: Wasting time in a main loop scheduler can make testing system-level timing a whole lot easier.



It's common enough to see a main loop scheduler in an embedded system along the lines of the following:

for(;;)
{ Task1();
  Task2();
  Task3();
}

I've heard this referred to as a "static scheduler," "cyclic executive," "main loop scheduler," or "static non-preemptive scheduler" among other terms. Regardless of what you call it, the idea is simple: run all the tasks that need to be run, then go back and do it again until the system is shut down or reset. There might be interrupt service routines (ISRs) also running in the system.

The main appeal of this approach is that it is simple. You don't need a real time operating system and, even more importantly, it would appear to be difficult to get wrong. But, there's a little more to it than that...

The first potential problem is those pesky ISRs. They might have timing problems, cause concurrency problems, and so on. Those issues are beyond what I want to talk about today except for making the point that they can affect the execution speed of one iteration through the main loop in ways that may not be obvious. You should do timing analysis for the ISRs (chapter 14 of my book has gory details).  But for today's discussion we're going to assume that you have the ISRs taken care of.

The next problem is timing analysis of the main loop. The worst case response time for running a task is one trip through the loop. But how long that trip is might vary depending on the calculations each task performs and how much time ISRs steal. It can be difficult to figure out the absolute worst case execution time (you should try, but it might not be easy). But the really bad news is, even if you know the theoretical worst case timing you're unlikely to actually see it during testing.

Consider the tester trying to make sure the system will function with worst case timing. How do you get the above static scheduler to take the worst case path through the code a bunch of times in a row to see what breaks? It's a difficult task, and probably most testers don't have a way to pull that off. So what is happening is you are shipping product that has never been tested for worst case main loop execution time. Will it work?  Who knows.  Do you want to take that chance with 10,000 or 100,000 units in the field? Eventually one of them will see worst case conditions and you haven't actually tested what will happen.

Fortunately there is an easy way to mitigate this risk. Add a time-waster at the end of the main loop. The time waster should convert the above main loop, which runs as fast as it can, to a main loop that runs exactly once per a defined period (for example, once every 100 msec):

for(;;)
{ StartTimer(100);  // start a 100 msec countdown
  Task1();
  Task2();
  Task3();
  WaitForTimer(0);  // wait for the 100 msec countdown to reach 0
}

This is just a sketch of the code -- how you build it will depend upon your system. The idea is that you waste time in the WaitForTimer routine until you've spent 100 msec in the main loop, then you run the loop again. Thus, the main loop runs exactly once every 100 msec.  If the tasks run faster than 100 msec as determined by a hardware timer, you waste time at the end, waiting for the 100 msec period to be up before starting the next main loop iteration. If the tasks take exactly 100 msec then you just start the main loop again immediately. If the tasks run longer than 100 msec, then you should log an error or perform some other action so you know something went wrong.

The key benefit to doing this is to ensure that in testing the average timing behavior is identical to the worst case timing behavior. That way, if something works when the system is fast, but breaks when it actually takes 100 msec to complete the main loop, you'll see it right away in testing. A second benefit is that since you are actively managing the main loop timing, you have a way to know the timing ran a little long on some loops even if it isn't bad enough to cause a watchdog reset.






Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

We're wrapping up the final report for an FAA-sponsored study of CRC and Checksum performance for aviation applications, although the results in general apply to all uses of those error detection codes.

As part of our results we came up with an informal list of "Seven Deadly Sins" (bad ideas):
  1. Picking a CRC based on a popularity contest instead of analysis
    • This includes using “standard” polynomials such as IEEE 802.3
  2. Saying that a good checksum is as good as a bad CRC
    • Many “standard” polynomials have poor HD at long lengths
  3. Evaluating with randomly corrupted data instead of BER fault model
    • Any useful error code looks good on random error patterns vs. BER random bit flips
  4. Blindly using polynomial factorization to choose a CRC
    • It works for long dataword special cases, but not beyond that
    • Divisibility by (x+1) doubles undetected fraction on even # bit errors
  5. Failing to protect message length field
    • Results in pointing to data as FCS, giving HD=1
  6. Failing to pick an accurate fault model and apply it
    • “We added a checksum, so ‘all’ errors will be caught” (untrue!)
    • Assuming a particular standard BER without checking the actual system
  7. Ignoring interaction with bit encoding
    • E.g., bit stuffing compromises HD to give HD=2
    • E.g., 8b10b encoding – seems to be OK, but depends on specific CRC polynomial
(I haven't tried to map it onto the more traditional sin list... if someone comes up with a clever mapping I'll post it!)

Thanks to Kevin Driscoll and Brendan Hall at Honeywell for their work as co-investigators. You can read more about the research on my CRC and Checksum Blog.  That blog has more detailed postings, slide sets, and will have the final research report when it is made publicly available.


Saturday, May 25, 2013

Adding Prioritization to an Single Level Interrupt Priority System


Summary of technique: Add a software structure that executes only the highest priority pending interrupt within the ISR polling loop. Then start again at the top of the polling loop instead of polling all possible ISRs. This gives you a prioritized non-preemptive interrupt service routine scheduler.

- - - - - - - - - - - - - - - - - - - -

With some microcontrollers, all of your interrupts come in at the same priority level (for example, via an external interrupt request pin). The usual thing to do in that case is create a polling loop to check all the sources of interrupts and see which one needs to be serviced by looking at peripheral status registers.  For example:
if(HWTimerTick)  { ... ISR to service hardware timer tick ... }
if(ADCReady)  { ... ISR to service A to D converter ... }
if(SerialPortDataInReady ) { ... ISR to read a serial port byte... }
if(SerialPortDataOutReady) { ... ISR to write a serial port byte ... }
...
(Of course this isn't real code ... I'm just sketching a flow that you've seen before if you've written this type of ISR that polls all the devices that can cause interrupts to see which one actually needs to be serviced.)

If only one of these devices is active, then this approach should work pretty well. And if you do system-level testing probably things will work fine -- at least most of the time.

But the way you can get into trouble is if one of the interrupts has a short deadline for being serviced. Let's say you have the above code and are seeing serial input bytes being dropped once in a while.  What could be happening?

One cause of dropping bytes might be that the HW Timer Tick and/or the ADC Ready interrupts are active at the same time that the serial port data input interrupt is ready. You need to execute them before you can get data from the serial port. If the sum of their two execution times is longer than the time between serial byte arrivals, you're going to take too long to get to the serial port input ISR and will drop bytes.

You might buy a faster processor (which might be unnecessary as we'll see), but before doing that you might reorganize the code to put the serial input first in the list of ISRs so you can get to it faster when an interrupt comes in:
if(SerialPortDataInReady ) { ... read a serial port byte... }
if(HWTimerTick)  { ... service hardware timer tick ... }
if(ADCReady)  { ... service A to D converter ... }
if(SerialPortDataOutReady) { ... write a serial

And that will *almost* work. Things might get a little better, but it won't cure the problem. (Or, MUCH worse, it will cure the problem in testing only to have the problem reappear in the field after you've shipped a lot of systems!)  Now when you get an interrupt you'll service the serial port input ISR first. But, then you'll go off and do the other ISRs. If those other ISRs take enough time, you will be stuck in those other ISRs too long and will miss the next byte -- you won't get back to the top of the list of ISRs in time.

You might try re-enabling interrupts inside any long ISRs to let the serial port get processed sooner. But resist the temptation -- that probably won't work, and will likely result in stack overflows due to recursive interrupt processing. (Simple rule: NEVER re-enable interrupts from inside an ISR.)

What we really need here is prioritization. And it's pretty easy to get even though we don't have hardware interrupt prioritization. All you have to do is (1) put the checks for each ISR in priority order, and (2) only execute the first one in the list each time you process interrupts. This can be done as follows:

if(SerialPortDataInReady ) { ... read a serial port byte... }
else if(HWTimerTick)  { ... service hardware timer tick ... }
else if(ADCReady)  { ... service A to D converter ... }
else if(SerialPortDataOutReady) { ... write a serial port byte ... }

Now only the first active interrupt will be serviced and the rest ignored. When you drop out of this structure and exit, any pending interrupt will re-trigger the checks from the beginning, again executing the highest priority interrupt that is still active (i.e., the first active one in the list). This will continue until all pending interrupts have been processed. You can use a "while" loop around the code above, or in many systems it may make sense just to exit interrupt processing and let the hardware interrupts re-trigger to re-run the polling code as a new interrupt.


This approach means that the worst case delay between processing serial input bytes is no longer all the ISRs running (if all interrupts are active). Rather, the worst case is the single longest ISR happens to be running, completes, and the serial port input ISR runs next. This happens because the list only runs at most one ISR rather than all of them. If that one ISR runs too long to meet deadlines, then it's probably too "fat" and should be simplified or its job moved out of ISRs and into the main loop.

There is no free lunch. The lowest priority ISR (the one at the end of the list) might starve. Making sure you meet all your ISR deadlines is trickier with this structure. Without the "elseif" approach the worst case timing is easy to compute -- it is the run time of all ISRs. But it might be too slow to live with. With this structure you have a nonpreemptive prioritized scheduling system for ISRs, and need to use suitable math and a suitable scheduling approach. Generally you'd want to use rate monotonic analysis (RMA) suitably adapted for the ISRs being non-preemptive. The analysis may be a little more complex, but this approach might help you salvage a situation in which you're missing deadlines and have already committed to a certain speed of microcontroller.


(Note on terminology: technically the whole thing is one big ISR that calls a different function depending upon what's active. But I'm calling each such function an ISR because that is really what it does ... you're using a software dispatcher to pick which ISR to run instead hardware prioritization logic to pick an ISR.)

Thursday, April 25, 2013

Why Short Interrupt Service Routines Matter


Most embedded systems I see use interrupts to handle high priority events, which are typically triggered by some peripheral device. So far so good. But it is also common for these systems to have significant timing problems even though their CPUs are not 100% loaded.

Let's take an example of three interrupts and see how this type of thing can happen.  Let's call their service routines IntH, IntM, and IntL (for high/medium/low priority), and assume this is a single-level interrupt priority system. By that I mean that these Interrupt Service Routines (ISRs) can't be interrupted by any of the others once they start executing.

Say that you write your software and you measure an idle task at taking 80% of the CPU.  The most important ISR has highest priority, etc.  And maybe this time it works fine.  But eventually you'll run into a system which has timing problems.  You're only 80% loaded; how could you have timing problems? To find out why, we need to dig deeper.

The first step is to measure the worst case (longest) execution time and worst case (fastest) period for each ISR.  Let's say it turns out this way:

IntH: Execution time = 10 msec       Period = 1000 msec
IntM: Execution time =  0.01 msec    Period =    1 msec
IntL: Execution time =  2 msec       Period =  100 msec

Let's take a look a the numbers. This task set is loaded at:  (10/1000) + (0.01/1) + (2/100) = 4%.
BUT it will miss deadlines! How can that be?

IntM and IntL are both going to miss their deadlines (if we assume deadline = period) periodically.  IntM will miss its deadline up to 10 times every time IntH runs, because the CPU is tied up for 10 msec with IntH, but IntM needs to run every 1 msec. So once per second IntM will miss its deadlines because it is starved by IntH.

OK, so maybe you saw that one coming.  But there is a more insidious problem here. IntM can also miss its deadline because of IntL.  Once IntL executes, it ties up the CPU for 2 msec, causing IntM to miss its 1 msec period. Even though IntL has a lower priority, once it runs it can't be interrupted, so it hogs the CPU and causes a deadline miss.

There are plenty of bandaids that can be tossed at this system (and I have the feeling I've seen them all in design reviews). The obvious hack of re-enabling interrupts partway through an ISR is dangerous and should not be used under any circumstance.  It leads to timing-dependent stack overflows, race conditions and so on. And more importantly, re-enabling interrupts in an ISR is, in my experience, a sign that the designers didn't understand the root cause of the timing problems.

But there is a principled way to solve these problems involving two general rules:
 - If possible, sort ISR and task priority by period; shortest period with highest priority.  This minimizes effective CPU use when you do scheduling. To understand why this is important you'll need to read up on Rate Monotonic Scheduling and related techniques.
 - Keep ISR worst case execution time as small as possible -- only a few handfuls of instructions.  If you need to get more done, dump data from the ISR into a buffer and kick off a non-ISR task do do the processing. This prevents one ISR from making another miss its deadline and largely deflects the problem of ISRs not necessarily being assigned the priority you'd like in your particular hardware.

The key insight is that "important" and "priority" are not the same things. Priority is about making real time scheduling math work, and boils down to assigning highest priority to short-period and short-deadline tasks. Getting that to work in turn requires all ISRs (even low priority ones) to be short. The importance of an ISR from the point of view of functionality ("this function is more important to the customer") is largely irrelevant -- the point of real time scheduling is to make sure everything executes every time.  Sometimes "important" and "short deadline" correspond, but not always. It is the deadline that should be paid attention to when assigning priorities if you want to meet real-time deadlines.  (Or, put another way, "important" means real-time and unimportant means non-real-time.)

The discussion above also applies to systems with multiple levels of interrupt priorities. Within each level of priority (assuming one level can interrupt ISRs in another level), once a pig ISR starts none of the other interrupts at that task level can interrupt it.

Make all your ISRs short, and do the analysis to make sure the worst case clumping of ISR executions doesn't overload your CPU.

Monday, March 25, 2013

Rules for Using Interrupts

Here's a brief guide to rules for good interrupt design.

  • Keep your Interrupt Service Routine (ISR) short. Ideally half a page of C code max.  If you must use assembly code, keep it to one page max. Long ISRs cause timing problems, often in surprising ways.
  • Keep ISR execution time very short. 100-200 clock cycles tops, although there is room for discussion on the exact number. If you have a lot of work to do, shovel the data into a holding buffer and let the main loop or a non-ISR task do the rest.
  • Know the worst case ISR execution time so you can do real-time scheduling. Avoid loops, because these make worst case trickier, and an indefinite loop might hang once in a while due to something you didn't think of.
  • Actually do the real time scheduling, which is a bit tricky because ISRs are non-preemptive within the same ISR priority level. (My book chapter on this works out the math in gory detail.)
  • Don't waste time in an ISR (for example, don't put in a wait loop for some hardware response).
  • Save the registers you modify if your hardware doesn't already do that for you. (Seems obvious, but if you have a lot of registers it might take a lot of testing to catch the one place where a register is used in the main code and the ISR clobbers it.)
  • Acknowledge the interrupt source at the beginning of the ISR (right after you save registers). It makes code reviews easier if it is always in the same place.
  • Don't re-enable interrupts within an ISR.  That's just asking for subtle race condition and stack overflow problems.
There also some system-level issues having to do with playing well with ISRs:
  • Make sure to disable interrupts when accessing a variable shared with an ISR. Do so for the shortest possible time. Do this even if you "know" it is safe (compiler optimizer behavior is difficult to predict, and code generation may change with a new compiler version). Ideally, protect those variables with access methods so you only have to get this right in one place in the code.
  • Declare any shared ISR/non-ISR variables as volatile. 
  • When you do timing analysis, don't forget that ISRs consume time too.
  • When you do stack depth analysis, don't forget worst-case stack-up of interrupts all occurring at the same time (especially if your processor supports multiple levels of interrupts).
  • Make sure that all interrupt vectors are initialized, even if you don't plan on using them.
  • Only use Non-Maskable Interrupts for a catastrophic system event such as system reset.
  • Be sure to initialize all your interrupt-related data structures and hardware that can generate interrupts before you enable interrupts during the boot-up cycle.
  • Once you turn on the watchdog timer, don't ever mask its interrupt.
  • If you find yourself doing something weird within an ISR, go back and fix the root cause of the problem. Weird ISRs spell trouble.
If you've been bitten by an interrupt in a way that isn't covered above, let me know!

Monday, February 25, 2013

Using Profiling To Improve System-Level Test Coverage


Sometimes I run into confusion about how to deal with "coverage" during system level testing. Usually this happens in a context where system level testing is the only testing that's being done, and coverage is thought about in terms of lines of code executed.  Let me explain those concepts, and then give some ideas about what to do when you find yourself in that situation.

"Coverage" is the fraction of the system you've exercised during testing. For unit tests of small chunks of code, it is typically what fraction of the code was exercised during testing (e.g., 95% means in a 100 line program 95 lines were executed at least once and 5 weren't executed at all).  You'd think it would be easy to get 100%, but in fact getting the last few lines of code in test is really difficult even if you are just testing a single subroutine in isolation (but that's another topic). Let's call this basic block coverage, where a basic block is a segment of code with no branches in or out, so if you execute one line of a basic block's code you have to execute all of them as a set. So in a perfect world you get 100% basic block coverage, meaning every chunk of code in the system has been executed at least once during testing. (That's not likely to be "perfect" testing, but if you have less than 100% basic block coverage you know you have holes in your test plan.)

By system level test I mean that the testing involves more or less a completely running full system. It is common to see this as the only level of testing, especially when complex I/O and timing constraints make it painful to exercise the software any other way. Often system level tests are based purely on the functional specification and not on the way the software is constructed. For example, it is the rare set of system level tests that checks to make sure the watchdog timer operates properly.  Or whether the watchdog is even turned on. But I digress...

The problem comes when you ask the question of what basic block coverage is achieved by a certain set of system-level tests. The question is usually posed less precisely, in the form "how good is our testing?"  The answer is usually that it's pretty low basic block coverage. That's because if it is difficult to reach into the corner cases when testing a single subroutine, it's almost impossible to reach all the dusty corners of the code in a system level test.  Testers think they're good at getting coverage with system level test (I know I used to think that myself), but I have a lot of doubt that coverage from system-level testing is high. I know -- your system is different, and it's just my opinion that your system level test coverage is poor. But consider it a challenge. If you care about how good your testing is for an embedded product that has to work (even the obscure corner cases), then you need data to really know how well you're doing.

If you don't have a fancy test coverage tool available to you, a reasonable way to go about getting coverage data is to use some sort of profiler to see if you're actually hitting all the basic blocks. Normally a profiler helps you know what code is executed the most for performance optimization. But in this case what you want to do is use the profiler while you're running system tests to see what code is *not* executed at all.  That's where your corner cases and imperfect coverage are.  I'm not going to go into profiler technology in detail, but you are likely to have one of two types of profiler. Maybe your profiler inserts instructions into every basic block and counts up executions.  If so, you're doing great and if the count is zero for any basic block you know it hasn't been tested (that would be how a code coverage tool is likely to work). But more likely your profiler uses a timer tick to sample where the program counter is periodically.  That makes it easy to see what pieces of code are executed a lot (which is the point of profiling for optimization), but almost impossible to know whether a particular basic block was executed zero times or one time during a multi-hour test suite.

If your profiler only samples, you'll be left with a set of basic blocks that haven't been seen to execute. If you want to go another step further you may need to put your own counters (or just flags) into those basic blocks by hand and then run your tests to see if they ever get executed. But at some point that may be hard too. Running the test suite a few times may help catch the moderately rare pieces of code that are executed. So may increasing your profile sample rate if you can do that. But there is no silver bullet here -- except getting a code coverage tool if one is available for your system.  (And even then the tool will likely affect system timing, so it's not a perfect solution.)

At any rate, after profiling and some hand analysis you'll be left with some level of idea of what's getting tested and what's not.  Try not to be shocked if it's a low fraction of the code (and be very pleased with yourself it if is above 90%).  

If you want to improve the basic block coverage number you can use coverage information to figure out what the untested code does, and add some tests to help. These tests are likely to be more motivated by how the software is built rather than by the product-level function specification.  But that's why you might want both a software test plan in addition to a product test plan.  Product tests never cover all the software corner cases.

Even with expanded testing, at some point it's going to be really really hard to exercise every last line of code -- or even every last subroutine. For those pieces you can test portions of the software to make it easier to get into the corner cases. Peer reviews are another alternative that is more cost effective than testing if you have limited resources and limited time to do risk reduction before shipping. (But I'm talking about formal peer reviews, not just asking your next cube neighbor to look over your code at lunch time.) Or there is the tried-and-true strategy of putting a breakpoint in before a hard-to-get-to basic block with a debugger, and then changing variables to force the software down the path you want it to go. (Whether such a test is actually useful depends upon the skill of the person doing it.)

The profiler-based coverage improvement strategy I've discussed is really about how to get yourself out of a hole if all you have is system tests and you're getting pressure to ship the system. It's better than just shipping blind and finding out later that you had poor test coverage. But the best way to handle coverage is to get very high basic block coverage via unit test, then shake out higher level problems with integration test and system test.

If you have experience using these ideas I'd love to hear about them -- let me know.


Friday, January 25, 2013

Exception Handling Fishbone Diagram

Exception handling is the soft underbelly of many software systems. A common observation is that there a lot more ways for things to go wrong than there are for them to go right, so designing and testing for exceptional conditions is difficult to do well. Anecdotally, it is also where you find many bugs that, while infrequently encountered, can cause dramatic system failures. While not every software system has to be bullet-proof, embedded systems often have to be quite robust. And it's a lot easier to make a system robust if you have a checklist of things to consider when designing and testing.

Fortunately, just such a checklist already exists. Roy Maxion and Bob Olszewski at Carnegie Mellon created a structured list of exceptional conditions to consider when designing a robust system in the form of a fishbone diagram (click on the diagram to see the full detail in a new window).




(Source: Maxion & Olszewski, Improving Software Robustness with Dependability Cases, FTCS, June 1998.)

The way to read this diagram is that an exception failure could be caused by any of the general causes listed in the boxes at the end of the fish-bone segments, and the arrows into each fishbone are more specific examples of those types of problems.

If you don't have the picture handy, a way to remember the main branches is:
C - Computational problem
H - Hardware problem
I  - I/O and file problem
L - Library function problem
D - Data input problem
R - Return value problem
E - External user/client problem  (in embedded systems this may include control network exceptions)
N - Null pointer or memory problems

There isn't a silver bullet for exception handling -- getting it right takes attention to detail and careful work. But, this fishbone diagram does help developers avoid missing exception vulnerabilities. You can read more about the idea and the human subject experiments showing its effectiveness in the free on-line copy of their conference paper: Improving Software Robustness with Dependability Cases,

You can read more detail in the (non-free unless you have a subscription) journal paper:
Eliminating exception handling errors with dependability cases: a comparative, empirical study, IEEE Transactions on Software paper, Sept. 2000.  http://dx.doi.org/10.1109/32.877848



Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...