Better Embedded System SW: June 2010

Monday, June 21, 2010

Don't require perfection

A common problem with requirements is that they mandate perfection that is unattainable. For example, it is common for embedded software requirements to state the the software shall never crash, shall be perfectly safe, and shall be defect-free. (In truth, more often these things aren't even written down, but those are the answers you get when you ask what the requirements are for dependability, safety, and software defect rates for high quality embedded systems.)

Perfect doesn't ever happen. "Never" is longer than you have available to test for software dependability. And it is the rare everyday embedded system that has taken a rigorous approach to ensuring safety.

It is also true that in many areas it is too risky from a liability point of view to write down a concrete requirement for less than perfection. And you may be guessing as to a target of less than perfection even if you do specify one. (Is it OK for your system to crash once every 1000 years, or will 900 years do? Did you guess when you answered that, or do you have a concrete basis for making that tradeoff?)

If you can and are permitted to specify a concrete, non-perfect set of requirements for your product, you should. But if you can't, consider instead defining a set of acceptance tests that will at least let you perform actual measurements to validate your system is good enough. These can be either process requirements or actual tests. Some examples include:

System shall not crash during one full week of stress tests.
All sources of crashes during testing shall be tracked down to root cause, and eliminated if appropriate.
System shall perform an emergency shutdown if a defined safety requirement is violated at run time. (This assumes you are able to monitor these requirements effectively at run time.)
All system errors shall be logged for analysis in failed units returned for factory service.

None of these will get you to perfection. But they, and other possible criteria like them, will give you a concrete way of knowing if you have worked hard enough in relevant areas before you release your software. You can find out more by reading Chapter 6 of my book, which discusses creating measurable requirements.
---

Thursday, June 17, 2010

When is color worse than B&W?

Generally, color displays are better than black and white ones (or monochrome displays, depending on your display technology). In addition to making products look more sophisticated, color lets us communicate more information for a given display size. There's nothing like red to tell you there is a problem and green to tell you things are OK.

Unless you're red/green colorblind.

About 10% of males can have problems with red/green colorblindness, varying with the population you are considering. Most colorblind people don't just see gray, but it is very common for them to have problems distinguishing particular hues and intensities of red from corresponding greens.

If you are designing a product that uses red and green to display important information, then make sure there is a secondary way to obtain that information that works even if you can't tell the colors apart. Some example strategies include:

Positional information. Traffic lights are OK because the red light is always on top, so you know what color it is by its position.
Use color only as auxiliary information. If the display is the red text "FAIL" vs. green text "OK" then colorblind folks will do just fine.
Blinking rates. If you have a bicolor LED, then consider flashing for red and solid for green (which may be a good idea anyway, since flashing lights attract attention). Or a distinctly different blinking rate.
Significantly different luminosity or brightness. A very dark red vs. a bright green may work out OK, but you should do some testing or dig deeper to be sure you got it right.

Fortunately for me, I'm not colorblind. (This also means I'm not a personal expert on what tricks might work.) But enough people are that this is the sort of thing you don't want to miss when you are making an embedded system. Chapter 15 of my book discusses user interface design and user demographics in more detail.
---

Monday, June 14, 2010

White Box Testing

White box software tests are designed in light of the particular software design and implementation being tested. For example, if you have an if {} else {} code construct, a white box test would intentionally try to execute both the if side and the else clause of that statement (probably using separate tests). This may sound trivial, but designing tests to execute rare cases and fault handling code can be a real challenge.

The fraction of code that is tested is known as the test coverage. In general, higher coverage is good. For example, white box test code coverage might be the fraction of lines of code executed by tests, with 95% being a pretty good result and 98% to 99% is often the best that people do without heroic effort. (95% coverage means 5% of the code is never executed -- not even once -- during testing. Hard to believe this is a pretty good result. But as I said, executing code that handles rare cases can be a challenge.)

Although lines of code executed is the classic coverage metric, there are other possible coverage metrics that might be useful depending on your situation:

Testing that code can correctly handle exceptions it might encounter (for example, does it handle malloc failing?)
Testing that all entries of a lookup table are exercised (what if only one table entry is out to lunch?)
Testing that all states and arcs of a statechart have been exercised
Testing that algorithms have been checked for numerical stability in tricky areas where they might have problems

The point of all the above is that the tester knows exactly how the code is trying to perform its functions, and makes sure that nothing was missed in testing, especially corner cases.

It's important to remember that even if you have 100% test coverage, it doesn't mean you have tested the system completely (usually "complete" testing is impossible -- there are too many possibilities). What good coverage does mean is that you haven't left out anything obvious. And that's good enough to make understanding the coverage of your testing worthwhile. Chapter 23 of my book discusses the concepts and practices of embedded software testing in more detail.
---

Thursday, June 10, 2010

Is your software dependable enough?

Most embedded system software has to be reasonably dependable. For example, customers are likely to be unhappy if their software crashes once per minute. But how dependable is good enough can be a slippery subject. For example, is it OK if your software crashes once every 10 minutes? Every 10 hours? Every 10 days? Every 10 years? Is that number written down anywhere? Or is it just a guess as to what might be acceptable?

We suggest that every product have written dependability requirements. This probably has two parts: Mean Time Between Failures (MTBF) for hardware, and mean time between crashes for software. (You can add a lot more if you like, but if you are missing either of these you have a big hole in your requirements.)

Once you have set your requirements, how do you know you meet them? For hardware you can use well established reliability calculation approaches that ultimately rest upon an assumption of random independent failures. But for software there is no reasonable failure rate to make predictions with. So that leaves you with testing to determine software dependability.

Testing to determine your software crashes less often than once per minute is pretty easy. But when your dependability target is many years between software crashes, then testing longer than that is likely to be a problem. So, for most systems we recommend defining not only a target operational dependability, but also a concrete acceptance test for dependable that is easily measurable.

For example, set a requirement that the system has to survive 1 week of intense stress testing without a crash before it ships. This certainly doesn't guarantee you'll get 10 years between crashes in the field, but at least it is a concrete, measurable requirement that everyone can discuss and agree upon during the requirements process. It's far better to have a concrete, defined dependability acceptance test than to just leave dependability out of the requirements and hope things turn out OK. Chapter 26 of my book discusses embedded system dependability in more detail.
---

Monday, June 7, 2010

CAN Tutorial

If you are looking for a Controller Area Network (CAN) tutorial, you may find the slides I use in teaching one of my courses useful.

Have a look at this Acrobat file: http://www.ece.cmu.edu/~ece649/lectures/14_can.pdf
which covers:

CAN overview
Bit dominance and binary countdown
Bit stuffing (including the bit stuffing error vulnerability in CAN)
Message headers
Message header filtering
Network length restrictions
Devicenet overview

While there are a number of web pages and articles on CAN, sometimes it helps to have lecture slides to browse through.

Thursday, June 3, 2010

Top two mistakes with watchdog timers

Watchdog timers provide a useful fallback mechanism for tasks that hang or otherwise violate timing expectations. In brief, application software must occasionally kick (or "pet") the watchdog to demonstrate things are still working properly. If the watchdog hasn't seen a pet operation in too long, it times out, resetting the system. The idea is that if the system hangs, the watchdog will reset the system to restore proper operation.

The #1 mistake with watchdog timers is not using one. It won't work if you don't turn it on and use it.

The #2 mistake is using an interrupt hooked up to a counter/timer to service the watchdog. For example, if your watchdog trips after 250 msec, you might have a hardware timer/counter generate an interrupt every 200 msec that runs a task to pet the watchdog. This is, in some ways, WORSE than leaving the watchdog turned off entirely. The reason is that it fools people into thinking the watchdog timer is providing benefit, when in fact it's really not doing much for you at all.

The point of the watchdog timer is to detect that the main application has hung. If you have an interrupt that pets the watchdog, the main application could be hung and the watchdog will get petted anyway. You should always pet the watchdog from within the main application loop, not from a timer-triggered interrupt service routine. (As with any rule you can bend this one, but if it is possible to pet the watchdog when your application has hung, then you aren't using the watchdog properly.) Chapter 29 of my book discusses how to use watchdog timers in more detail.
---

Better Embedded System SW