Better Embedded System SW: 2015

Monday, December 7, 2015

Multi-Rate Main Loop Tasking

Recently I was looking for an example of a prioritized, cooperative multitasking main loop and realized it is surprisingly difficult to find one that is (1) readily available and (2) comprehensible.

Sure, once you understand the concept maybe you can dig through an industrial-strength implementation with all sorts of complexity. But good luck getting up the learning curve! So, here is a simple (I hope!) description of a multi-rate main loop scheduler.

First of all, we're talking about non-preemptive multitasking. This is variously called main loop scheduling, a main loop tasker, a prioritized cooperative multitasker, a cyclic executive, a non-preemptive scheduler, and no doubt a bunch of other terms. (Not all of these terms are precisely descriptive, but in terms of search terms they'll get you in the ballpark.) The general idea is to be able to schedule multiple tasks that operate at different periods without having to use an RTOS and without having to use the generally bad idea of stuffing everything into timer-based interrupts.

Single Rate Main Loop

The starting point is a single-rate main loop. This is a classical main loop schedule in which all tasks are run to completion over and over again:

void main(void)
{ ... initialization ...
while(1)
{ Task0();
    Task1();
    Task2();
    Task3();
}
}

The good news about this is that you don't need an RTOS, and it's really hard to get things wrong. (In other words reviews and testing are easy to do well.)

The bad news is that all tasks have to run at the same period. That means that if one task needs to run really quickly you'll miss its deadlines.

I've seen way too many hacks that use conditional execution based on CPU load to sometimes skip tasks. But ad hoc approaches have the problem that you can't really know if you'll miss deadlines. Instead, you should use a methodical approach that can be analyzed mathematically to make sure you'll meet deadlines. The way people generally go is with some variation of a multi-rate main loop.

Multi-Rate Main Loop

The idea behind a multi-rate main loop is that you can run each task at a different periodic rate. Each task (which is just a subroutine) still runs to completion, so this is not a full-up preemptive multitasking system. But it is relatively simple to build, and flexible enough for many embedded systems.

Here is some example code of the main loop itself, with some explanation to follow.

void main(void)
{ ... initialization ...
while(1)
{ if( flag0 ) { flag0 = 0; Task0(); }
else if ( flag1 ) { flag1 = 0; Task1(); }
else if ( flag2 ) { flag2 = 0; Task2(); }
else if ( flag3 ) { flag3 = 0; Task3(); }
}
}

The way this code works is as follows. All the tasks that need to be run have an associated flag set to 1. So if Task1 and Task2 are the only tasks that need to run, flag0 is 0, flag1 is 1, flag2 is 1, and flag3 is 0. The code crawls through an "else if" cascade looking for the first non-zero flag. When it finds a non-zero flag it executes that task, and only that task.

Note that each task sets its flag to zero so that it runs exactly one time when it is activated by its flag. If all flags are zero then no task is executed and the do..while loop simply spins away until a flag finally becomes non-zero. More about how flags get set to 1 in a moment.

After executing at most one task, the loop goes back to the beginning. Because at most one task is executed per iteration of the main do..while loop, the tasks are prioritized. Task0 has the highest priority, and Task3 the lowest priority.

Yes, this prioritization means that if your CPU is overloaded Task0 may execute many times and Task3 may never get a turn. That's why its important to get scheduling right (this will be a topic in a later blog posting).

Multi-Rate Timers

The main loop wasn't so bad, except we swept under the rug the messy business of getting the flags set properly. Trying to do that in the main loop generally leads to problems, because a long task will cause many milliseconds to go by between timer checks, and it is too easy to have a bug that misses setting a flag some of the time. Thus, in general you tend to see flag maintenance in the timer interrupt service routine.

Conceptually the code looks like this, and for our example lives in a timer interrupt service routine (ISR) that is called every 1 msec. A variable called "timer" keeps track of system time and is incremented once every msec.

// in a timer ISR that is called once every msec
timer++;
if ((timer % 5) == 0) { flag0 = 1; } // enable 5 msec task0
if ((timer % 10) == 0) { flag1 = 1; } // enable 10 msec task1
if ((timer % 20) == 0) { flag2 = 1; } // enable 20 msec task2
if ((timer % 100) == 0) { flag3 = 1; } // enable 100 msec task3
if (timer >= 100) { timer = 0; } // avoid timer overflow bug

Every 5 msec the timer will be zero modulo 5, every 10 msec the timer will be zero modulo 10, and so on. So this gives us tasks with periods of 5, 10, 20, and 100 msec.   Division is slow, and in many low-end microcontrollers should be avoided in an ISR. So it is common to see a set of counters (one per flag), where each counter is set to the period of a particular task and counts down toward zero once per msec. When a counter reaches zero the associated flag is set to 1 and the counter is reset to the tasks' period. This takes a little more RAM, but avoids division. How it's implemented depends upon your system tradeoffs.

The last line of this code avoids weird things happening when the timer overflows. The reset to zero should run at the least common multiple of all periods, which in this case happens to be equal to the longest period.

Concurrency Issues

As with any scheduling system there are potential concurrency issues. One subtle one is that the timer ISR can run part-way down the else..if structure in the main loop. This could cause a low-priority task to run before a high priority task if they both have their flags set to 1 on the same timer tick. It turns out that this doesn't make the worst case latency much worse. You could try to synchronize things, but that adds complexity. Another way to handle this is to copy the current time into a temporary variable and do the checks for when to run each task in the main loop instead of the timer ISR.

It's also important to note that there is a potential concurrency problem in writing flags in the main loop since both the ISR and the main task can write the flag variables. In practice the concurrency bug will only hit when you're missing deadlines, but good coding style dictates disabling interrupts when you update the flag values in the main loop, which isn't shown in the main loop code in an attempt to keep things simple for the purpose of explanation.

The Big Picture

OK, that's pretty much it. We have a main loop that runs each task when its ready-to-run-flag is set, and a timer ISR that sets a ready-to-run flag for each task at the desired period. The result is a system that has the following properties:

Each task runs once during its assigned period
The tasks are prioritized, so for example task 2 only runs when task 0 and task 1 do not need to run

The big benefit is that, so long as you pay attention to schedulability math, you can run both fast and slow tasks without needing a fancy RTOS and without missing deadlines.

In terms of practical application this is quite similar to what I often see in commercial systems. Sometimes developers use arrays of counters, arrays of flags, and sometimes even arrays of pointers to functions if they have a whole lot of functions, allowing the code to be a generic loop rather than spelling out each flag name and each task name. This might be necessary, but I recommend keeping things simple and avoiding arrays and pointers if it is practical for your system.

Coming soon ... real time analysis of a multi-rate main loop

Monday, November 9, 2015

How Long Should You Set Your Watchdog Timer

Summary: You can often use a watchdog effectively even if you don't know worst case execution time. The more you know about system timing, the tighter you can set the watchdog.

Sometimes there is a situation in which a design team doesn't have resources to figure out a tight bound on worst case execution time for a system. Or perhaps they are worried about false alarm watchdog trips due to infrequent situations in which the software is running properly, but takes an unusually long time to complete a main loop. Sometimes they set the watchdog to the maximum possible time (perhaps several seconds) to avoid false alarm trips. And sometimes they just set it to the maximum possible value because they don't know what else to do.

In other words, some designers turn off the watchdog or set it to the maximum possible setting because they don't have time to do a detailed analysis. But, almost always, you can do a lot better than that.

To get maximum watchdog effectiveness you want to put a tight bound on worst case execution time with it. However, if your system safety strategy permits it, there are simpler ways to compute watchdog period by analyzing the application instead of the software itself. Below I'll work through some ways to set the watchdog, going from the more complicated way to a simpler way.

Consider software that heats water in an appliance. Just to make the problem concrete and use easy numbers, lets say that the control loop for heating executes every 100 msec, and takes between 40 and 75 msec to execute (worst case slow and fast speeds). Let's also say that it uses a single-task main loop scheduler without an RTOS so we don't have to worry about task start time jitter. How could we set the watchdog for this system? Ideally we'd like the tightest possible timing, but there may be some slack because water takes a while to heat up, and takes a while to boil dry. How long should we set the watchdog timer for?

Classical Watchdog Setup

Classically, you'd want to compute the worst case execution time range of the software (40-75 msec in this case). Let's assume the watchdog kick happens as the last instruction of each execution. Since the software only runs once every 100 msec, then the shortest time between kicks is when one cycle runs 75 msec, waits 25 msec, then the next cycle runs faster, completing the computation and kicking the watchdog in only 40 msec. 25+40=65 msec shortest time between kicks. In contrast, the longest time between kicks is when a short cycle of 40 msec is followed by 60 msec of waiting, then a long cycle of 75 msec. 60+75=135 msec longest time between kicks. It helps a lot to sketch this out:

If you're setting a conventional watchdog timer, you'd want to set it at 135 msec (or the closest setting greater than that). If you have a windowed watchdog, you'd want to set the minimum setting at 65 msec, or the closest setting lower than that.

Note that if your running an RTOS, the scheduling might guarantee that the task runs once in every 100 msec, but not when it starts within that period. In that case the worst case shortest time is the task running back-to-back at its shortest length = 40 msec. The longest time will be when a short task runs at the beginning of a period, and the next task completes right at the end of a period, giving 60 msec (idle time at end of period) + 100 msec (one more period) = 160 msec between watchdog kicks. Thus, a windowed watchdog for this system would have to permit a kick interval of 40 to 160 msec.

Watchdog Approximate Setup

Sometimes designers want a shortcut. It is usually a mistake to set the watchdog at exactly the period because of timing jitter in where the watchdog actually gets kicked. Instead, a handy rule of thumb for non-critical applications (for which you don't want to do the detailed analysis) is to set the watchdog timer interval to twice the software execution period. For this example, you'd set the watchdog timer to twice the period of 100 msec = 200 msec. There are a couple assumptions you need to make: (1) the software always finishes before the end of its period, and (2) the effectiveness of the watchdog at this longer timeout will be good enough to ensure adequate safety for your system. (If you are building a safety critical system you need to dig deeper on this point.)

This simpler approach sacrifices some watchdog timer effectiveness for detecting faults that perturb system timing compared to the theoretical bound of 135-160 msec. But it will still catch a system that is definitely hung without needing detailed timing analysis.

For a windowed watchdog the rule of thumb is a little more difficult. That is because, in principle, your task might run the full length of one period and complete instantly on the next period, giving effectively a zero-length minimum watchdog timer kick interval. If you can establish a lower bound on the minimum possible run time of your task, you can set that as an approximate lower bound on watchdog timer kicks. If you don't know the minimum time, you probably can't use the lower bound of a windowed watchdog timer.

Application-Based Watchdog Setup

The previous approaches required knowing the run time of the software. But, what do you do if you know nothing about the run time, or have low confidence in your ability to predict the worst case bounds even after extensive analysis and testing?

An alternate way to approach this problem is to skip the analysis of software timing and instead think about the application. Ask yourself, what is the longest period for which the application can withstand a hung CPU? For example, for a counter-top appliance heating water, how long can the heater be left full-on due to hung software without causing a problem such as a fire, smoke, or equipment damage? Probably it's quite a bit longer than 100 msec. But it might not be 5 or 10 seconds. And probably you don't want to risk melting a thermal fuse or a household smoke alarm by turning off the watchdog entirely.

As a practical matter, if the system's controls go unstable after a certain amount of time, you'd better make your watchdog timer period shorter than that length of time!

For the above example, the control period is 100 msec. But, let's say the system can withstand 500 msec of no control inputs without becoming uncrecoverable, so long as the control system starts working again within that time and it doesn't happen often. In this case, the watchdog timer must be shorter than 500 msec at least. But there is one other thing to consider. The system probably takes a while to reboot and start running the control loop after the watchdog timer trips. So we need to account for that restart time. For this example, let's say the time between a watchdog reset and the time the software starts controlling the system generally ranges from 70 to 120 msec based on test measurements.

Based on knowing the system reset time and stability grace period, we can get an approximate watchdog setting as follows. We have 500 msec of no-control grace period, minus 120 msec of worst case restart time. 500-120 = 380 msec. Thus, for this system the maximum permissible watchdog timer value is 380 msec to avoid losing control system stability. Using this approach, the watchdog maximum period should be set at the longest time that is shorter than 380 msec. Without knowing more about software computation time, there is not much we can say about the minimum period for a windowed watchdog.

Note that for this approach we still need to know something about the worst case execution of the software in case you hit a long execution path after the watchdog reset. However, it is often the case that you know the longest time that is likely to be seen (e.g., via measuring a number of runs) even if you don't know the details of the real time scheduling approach used. And often you might be willing to take the chance that you won't hit an unlikely, even worse running time right after a system reset. For a non-safety critical application this might be appropriate, and is certainly better than just turning the watchdog off entirely.

Finally, it is often useful to combine the period rule of thumb with the control stability rule of thumb (if you know the task execution period). You want the watchdog set shorter than the time required to ensure control stability, but longer than the time it actually takes to execute the software so that it will actually be kicked often enough. For the above example this means setting the watchdog somewhere between two periods and the control stability time limit, giving a range for the maximum watchdog limit of 200-380 msec. This can be set without detailed software execution time analysis beyond knowing the task period and the range of likely system restart times.

Summary
If you know the maximum worst case execution time and the minimum execution time, you should set your watchdog as tightly as you can. If you don't know those values, but at least are confident you'll complete execution within your task period, then you should set the watchdog to twice the period. You should also bound the maximum watchdog timeout setting by taking into account how long the system can operate without a reset before it loses control stability.

Monday, September 28, 2015

Open Source IoT Code Is Not The Entire Answer

Summary: Whether or not to open sourcing embedded software is the wrong question. The right question is how can we ensure independent checks and balances on software safety and security. Independent certification agencies have been doing this for decades. So why not use them?

In the wake of the recent Volkswagen diesel software revelations, there has been a call from some that automotive software and even all Internet of Things software should be open source. The idea is that if the software is released publicly, then someone will notice if there is a security problem, a safety problem, or skulduggery of some sort. While open source can make sense, this is neither an economically realistic nor necessary step to apply across-the-board.

The Pro list for open source is pretty straightforward: if you publish the code, someone will come and read it and find all the problems.

The Con list is, however, more reflective of how things really work. You have to assume that someone with enough technical skill will actually spend the time to look, and will actually find the problem. That doesn't always happen. The relatively simple Heartbleed bug was there for all to see in OpenSSL, and it stayed there for a couple years despite being a widely used, crucial piece of open source Internet infrastructure software. Presumably a lot more people care about OpenSSL than your toaster oven's software.

Some of the opponents of open sourcing IoT software invoke the security bogeyman. They say that if you publish the source you'll be vulnerable to attacks. Well sure, it might make it easier to find a way to attack, but it doesn't make you "vulnerable." If your code was already full of vulnerabilities, publishing source code just might make it a little easier for someone to find them. Did you notice that the automotive security exploits published recently did not rely on source code? I can believe that exploits could, at least sometimes, be published more quickly for open source code, but I don't see this as a compelling argument for keeping code secret and un-reviewed.

A more fundamental point is that software is often the biggest competitive advantage in making products that would otherwise be commodities. Asking companies to reveal their most important trade secrets (their software), so that a hypothetical person with the time and skills might just happen to find a problem sounds like a hard sell to me. Especially since there is the well established alternative of having an external, independent certification agency look things over in private.

Safety critical systems have had standards and independent review systems in place for decades. Aviation uses DO-178c and other standards, and has a set of independent reviewers called Designated Engineering Representatives (DERs) that provide design reviews during the development cycle. Rail systems follow EN-50126/8/9 and typically involve oversight from acquisition consultants. The chemical process industry generally follows IEC-61508, and has long used independent certification organizations to check their work (typically I see reviews have been done by Exida or TUV). The consumer appliance industry has long had Underwriters Laboratories (UL) certification, and is moving to a more comprehensive software safety standard approach based on IEC 60730, including external independent certification. There are also more recent domain-specific security standards that can be applied. (It is worth noting that ensuring safety and security requires a lot more than just source code, but that's a topic for another day.)

Cars have long had the option to use the MISRA software safety guidelines, and more recently the ISO 26262 safety standard. Historically, some companies have had external agencies certify automotive components to those standards. But, at least some car companies have not taken advantage of this external audit opportunity, and thus there has been no independent check and balance on their software until we their problems show up in the news. Software safety and security audits are not required to sell cars in the US. (There is some vehicle-level testing according to FMVSS requirements, but it's about vehicle behaviors, not the actual source code.)

For Internet of Things it will be interesting to see how things play out. As I understand it the EU is already requiring IEC 60730 compliance, which means external safety checks for safety critical IoT applications. We could see that mandate spread to more IoT products sold in Europe if there are high-profile problems. And perhaps we'll see a push on automotive software regulation too.

So, there is a well established alternative to open source in the form of external certifying organizations issuing compliance certificates based on international safety and security standards. Rather than get distracted by an open source debate, what we should be doing is asking "what's the most effective way to ensure adequate software safety and dependability in a way that doesn't put companies out of business." Sometimes that might be open source, especially for underlying infrastructure. But other times, probably most times, independent review by a trusted certification party will be up to the task. The question is really what it will take to make companies produce verifiably adequate software.

Having checks and balances works. We should use them.

(For the record, I made some of my source code public domain before "open source" was even a buzzword, and have released other source code under an older version of GPL (Ballista robustness testing) and Creative Commons BY 4.0 (CRC Hamming Distance length calculation). Some code I copyright and release. And some I keep as a trade secret. My interest here is in the public being able to use safe and secure embedded software. We should focus on that, and not let things get sidetracked into another iteration of the open source vs. proprietary software debate.)

Monday, September 7, 2015

Essential Embedded Software Skills

I spend a lot of time trying to grapple with what makes embedded systems different than desktop computer systems in terms of skills and development processes. Often the answer to this question on discussion groups ends up being something like "everything has to be super-optimized," or "you need to meet real-time deadlines." But those are technical measures that seem to me to be more symptoms of particular embedded system projects rather than root cause of the differences. And, such answers tend to be a bit one-dimensional.

After some thought, perhaps the distinctive attributes of embedded systems can be summarized in the following way:

Interaction with the physical world:
Embedded systems generally have a primary goal of interacting with the physical world using sensors and actuators. This in turn encompasses various topics depending on the application, including:
- Real time responsiveness (scheduling, concurrency management, timekeeping)
- Analog & digital interfacing
- Control approaches
- Signal processing
- Coordination via networked and Cloud services
- Reliability, safety, system robustness

Special-purpose computing platform:
Most embedded systems don't use a general purpose computing platform (a desktop comptuer, laptop, tablet, smart phone, etc.). Rather, they use a customized hardware platform that is permanently embedded into the product. (Even those that do use somewhat standardized hardware often have specialized I/O devices attached.) This in turn encompasses various topics depending on the application, including:
- Software optimization (squeezing to fit into a cost-constrained platform)
- Close-to-hardware programming (interrupts, device interfacing)
- Hardware specialization (application-specific hardware, DSP platforms)
- Specialized network protocols
- Special-purpose human interaction devices
- Hardware-dependent testing approaches
- Customized operating system (or custom non-OS task manager)
- Power management

Domain-centric development:
Outside the consumer electronics area, in my experience it is rare to meet a deeply embedded system developer with a primary college degree in computer engineering or computer science. Generally they have a degree more relevant to their product domain. Yet, nonetheless, here they are writing significant amounts of code for a living. Those trained in software development are also missing somewhat different pieces. Regardless of background, developers usually need to understand the following areas:
- General software process and technical practice literacy (for domain experts) / Domain expertise (for software experts)

- Life-cycle support for long-lived, hard-to-update products

- Distributed and federated system architecture design
- Domain-optimized development (e.g., model-based design for control systems)
- Domain-specific aspects of security

Looking at this list, it becomes clear that skills such as knowing how to write super-optimized code are merely pieces of a larger puzzle. In general, you need to be at least literate in all the topics above to be a well-rounded embedded system developer. Sure, not everyone and not every project needs deep expertise in everything. But if you're planning on a career in embedded systems you'll likely hit just about everything on the list -- I know that I certainly have. (And, if you're a hiring manager, now you have a shopping list for skills for your senior developers.)

Monday, August 17, 2015

Three Suggestions from Les Chambers

Les Chambers is a knowledgeable software engineer with plenty of experience in critical embedded systems. He had the following three comments about what kinds of "paper" (documentation) are essential for larger-scale embedded systems, which are all spot on. (His points are in bullets below with permission and light editing for this format.)

Without an architectural design document it is impossible to plan and manage any software project with more than two or three people coding. You've got to know numbers of things to estimate resource requirements. How many functions, how many screens, how many objects. I've seen regularly a problem with big teams coming together and floundering around with nothing productive to do because the architect is still pulling his ideas together and has no vehicle for communicating them to the team.

Les is correct. Any time you have a project with more than a couple people you really need an architectural document of some sort. Ideally a single sheet of paper, usually with boxes and arrows, that shows you what the pieces are and how they fit together. Once you have 5 people on the team, this is absolutely mandatory, and in my experience as Les says you will have just chaos until that picture is nailed down.

A related problem I've seen is when the architecture document shows all the hardware boxes and communication links, but software is nowhere to be found in the picture. You need to either put software on that same diagram or have a separate picture for the software structure that is compatible with the hardware architecture. Chapter 10 of my book has some general rules to help in constructing these types of diagrams. However, I'll be the first to admit that creating a good architecture is as much art as science. If you want to really delve into this, the best systems architecture book I've found is Rechtin's book on System Architecting (The first edition is by far the best for an initial read, if you can find it. The 3rd edition by Maier & Rechtin has a lot more material, but is a bit more complex if you are just going for the essentials.)

Another issue you've hinted at but not explicitly stated is the importance of detailed rationales for design approaches. The symptom here is, six months down the track, someone questions a nonintuitive design approach, spends a week working through the design rationale and then decides, "... oh yes it was right in the first place." Worse: someone, not in possession of the facts, changes an approach that was made for rational reasons and injects bugs.

Yes, I've seen this one too. This can especially be a problem if a design decision is made for an extra-functional purpose such as safety. For example, consider an aircraft in which two cable bundles are run down different sides of an aircraft. Someone later might conclude that it is cheaper and easier to run them next to each other. Functionally there is (at least at first glance) no difference. But the point of separating the wires was so that if physical damage occurs to one part of the aircraft only one of the two cable bundles will be affected. (Could this happen? Read about the United Airlines Flight 232 crash where three hydraulic lines were damaged where they ran too close together.)

In general, it is a good idea to capture not just requirements but also design decisions with rationale so that the basis for important decisions is not lost. This is especially important for long-lived systems that are likely to be maintained and updated over periods of decades, which is a common enough situation in the embedded systems world.

Another piece of paper I think should be added is the configuration management documentation. Exactly what versions of what software are running on what versions of what hardware where. I once had to tackle this problem on a project with in excess of 200 computers deployed all over a [geographically distributed embedded system] network. The symptoms were: people in the development shop spending a week working on reproducing a bug found on site and being unable to do so because they are working on the wrong version of the code. Large deployment teams turning out to do site installation and having to abandon because they were armed with the wrong version of the software – incompatible with other control computers. The obvious solution was a database application, which took me three months to build, and turned out to be very useful - more useful than a stack of procedures and paper records.

And again, I've seen this one as well. It is common enough for the configuration management for older systems to be a filing cabinet full of hard copy printouts, sometimes with the software for every single field installation having been customized by a field engineer. Usually the paper copy is out of date with reality. If you find a bug, how can you fix it if you don't really even know what software is out in the field?

Configuration management is important not just for your build process, but also to keep track of what's out in the field. The most basic requirement is that your device needs to be able to tell you what version of software is installed (for example, with a start-up message). Beyond that, you really want a database that you can run queries on to find out what's out there. Such databases often get stale, so it's also very helpful to make a configuration audit part of every time you touch the equipment for maintenance to keep the database updated.

Les writes a thought-provoking (and nicely styled) long-form blog on system engineering : http://www.systemsengineeringblog.com/
The stories are interesting and well told, with a few twists and turns along the way. For example, the article on Fagan inspections provides insight into how culture changes if you are serious about creating high quality software. The title gives you an idea of the style: "Extreme Review: A Tale of Nakedness, Alsations and Fagan Inspection." Highly recommended reading.

Monday, July 20, 2015

Avoiding EEPROM and Flash Memory Wearout

Summary: If you're periodically updating a particular EEPROM value every few minutes (or every few seconds) you could be in danger of EEPROM wearout. Avoiding this requires reducing the per-cell write frequency. For some EEPROM technology anything more frequent than about once per hour could be a problem. (Flash memory has similar issues.)

Time Flies When You're Recording Data:

EEPROM is commonly used to store configuration parameters and operating history information in embedded processors. For example, you might have a rolling "flight recorder" function to record the most recent operating data in case there is a system failure or power loss. I've seen specifications for this sort of thing require recording data every few seconds.

The problem is that EEPROM only works for a limited number of write cycles. After perhaps 100,000 to 1,000,000 (depending on the particular chip you are using), some of your deployed systems will start exhibiting EEPROM wearout and you'll get a field failure. (Look at your data sheet to find the number. If you are deploying a large number of units "worst case" is probably more important to you than "typical.") A million writes sounds like a lot, but they go by pretty quickly. Let's work an example, assuming that a voltage reading is being recorded to the same byte in EEPROM every 15 seconds.

1,000,000 writes at one write per 15 seconds is 4 writes per minute:
1,000,000 / ( 4 * 60 minutes/hr * 24 hours/day ) = 173.6 days.
In other words, your EEPROM will use up its million-cycle wearout budget in less than 6 months.

Below is a table showing the time to wearout (in years) based on the period used to update any particular EEPROM cell. The crossover values for 10 year product life are one update every 5 minutes 15 seconds for an EEPROM with a million cycle life. For a 100K life EEPROM you can only update a particular cell every 52 minutes 36. This means any hope of updates every few seconds just aren't going to work out if you expect your product to last years instead of months. Things scale linearly, although in real products secondary factors such as operating temperature and access mode can play a factor.

Reduce Frequency
The least painful way to resolve this problem is to simply record the data less often. In some cases that might be OK to meet your system requirements.

Or you might be able to record only when things change more than a small amount, with a minimum delay between successive data points. However, with event-based recording be mindful of value jitter or scenarios in which a burst of events can wear out EEPROM.

(It would be nice if you could track how often EEPROM has been written. But that requires a counter that's kept in EEPROM ... so that idea just pushes the problem into the counter wearing out.)

Low Power Interrupt
In some processors there is a low power interrupt that can be used to record one last data value in EEPROM as the system shuts down due to loss of power. In general you keep the value you're interested in a RAM location, and push it out to EEPROM only when you lose power. Or, perhaps, you record it to EEPROM once in a while and push another copy out to EEPROM as part of shut-down to make sure you record the most up-to-date value available.

It's important to make sure that there is a hold-up capacitor that will keep the system above the EEPROM programming voltage requirement for long enough. This can work if you only need to record a value or two rather than a large block of data. But it is easy to get this wrong, so be careful!

Rotating Buffer
The classical solution for EEPROM wearout is to use a rotating buffer (sometimes called a circular FIFO) of the last N recorded values. You also need a counter stored in EEPROM so that after a power cycle you can figure out which entry in the buffer holds the most recent copy. This reduces EEPROM wearout proportionally to the number of copies of the data in the buffer. For example, if you rotate through 10 different locations that take turns recording a single monitored value, each location gets modified 1/10th as often, so EEPROM wearout is improved by a factor of 10. You also need to keep a separate counter or timestamp for each of the 10 copies so you can sort out which one is the most recent after a power loss. In other words, you need two rotating buffers: one for the value, and one to keep track of the counter. (If you keep only one counter location in EEPROM, that counter wears out since it has to be incremented on every update.) The disadvantage of this approach is that it requires 10 times as many bytes of EEPROM storage to get 10 times the life, plus 10 copies of the counter value. You can be a bit clever by packing the counter in with the data. And if you are recording a large record in EEPROM then an additional few bytes for the counter copies aren't as big a deal as the replicated data memory. But any way you slice it, this is going to use a lot of EEPROM.

Atmel has an application note that goes through the gory details:
AVR-101: High Endurance EEPROM Storage: http://www.atmel.com/images/doc2526.pdf

Special Case For Remembering A Counter Value
Sometimes you want to keep a count rather than record arbitrary values. For example, you might want to count the number of times a piece of equipment has cycled, or the number of operating minutes for some device. The worst part of counters is that the bottom bit of the counter changes on every single count, wearing out the bottom count byte in EEPROM.

But, there are special tricks you can play. An application note from Microchip has some clever ideas, such as using a gray code so that only one byte out of a multi-byte counter has to be updated on each count. They also recommend using error correcting codes to compensate for wear-out. (I don't know how effective ECC will be at wear-out, because it will depend upon whether bit failures are independent within the counter data bytes -- so be careful of using that idea). See this application note: http://ww1.microchip.com/downloads/en/AppNotes/01449A.pdf

Note: For those who want to know more, Microchip has a tutorial on the details of wearout with some nice diagrams of how EEPROM cells are designed:
ftp://ftp.microchip.com/tools/memory/total50/tutorial.html

Don't Re-Write Unchanging Values
Another way to reduce wearout is to read the current value in a memory location before updating. If the value is the same, skip the update, and eliminate the wearout cycle associated with an update that has no effect on the data value. Make sure you account for the worst case (how often can you expect values to be the same?). But even if the worst case is bad, this technique will give you a little extra margin of safety if you get lucky once in a while and can skip writes.

If you've run into any other clever ideas for EEPROM wearout mitigation please let me know.

Leraning More
Nash Reilly has a nice series of tutorial postings on how Flash/EEPROM technology works. (I found out about these via Jack Ganssle's newsletter.)
http://cushychicken.github.io/nand-pt1-transistors/
http://cushychicken.github.io/nand-pt2-floating/
http://cushychicken.github.io/nand-pt3-arrays/
http://cushychicken.github.io/nand-pt4-pages-blocks/
http://cushychicken.github.io/nand-pt5-how-nand-breaks/
http://cushychicken.github.io/nand-pt6-dealing-with-flaws/
http://cushychicken.github.io/inconvenient-truths/

Oct 2019: Tesla is said to have a flash wearout problem for its SSDs. https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/

Sunday, May 3, 2015

Counter Rollover Bites Boeing 787

Counter rollover is a classic mistake in computer software. And, it just bit the Boeing 787.

The Problem:

The Boeing 787 aircraft's electrical power control units shut down if powered without interruption for 248 days (a bit over 8 months). In the likely case that all the control units were turned on at about the same time, that means they all shut down at the same time -- potentially in the middle of a flight. Fortunately, the power is usually not left on for 8 continuous months, so apparently this has not actually happened in flight. But the problem was seen in a long-duration simulation and could happen in a real aircraft. (There are backup power supplies, but do you really want to be relying on them over the middle of an ocean? I thought not.) The fix is turning off the power and turning it back on every 120 days.

That's right -- the FAA is telling the airlines they have to do a maintenance reboot of their planes every 120 days.

(Sources: NY Times ; FAA)

Analysis:

Just for fun, let's do the math and figure out what's going on.
248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200
Hmmm ... what if those systems keep time as an 32-bit signed integer in hundredths of a second? The maximum positive value for such a counter would give:
0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days.
Bingo!

If they had used a 32-bit unsigned it would still overflow after twice as long = 497.1 days.

Other Examples:

This is not the first time a counter rollover has caused a problem. Some examples are:

IBM: Interface adapters hang after 497 days of uptime [IBM]
Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft]
Hong Kong rail service outage [Blog]

There are also plenty of date roll-over bugs:

Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00) [Wikipedia]
GPS: 1024 week rollover on 22 August 1999 [USCG]
Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such as 512K day for IPv4 routers.

If you want to dig further, there is a "zoo" of related problems on Wikipedia: "Time formatting and storage bugs"

Friday, May 1, 2015

How To Report An Unintended Acceleration Problem

Every once in a while I get e-mail from someone concerned about unintended acceleration that has happened to them or someone they know. Commonly they go to the car dealer and get told (directly or indirectly) that it must have been the driver's fault. I'm sure that must be a frustrating experience.

Fortunately, you can do more than just get blown off by the dealer (if you feel like that is what happened to you). Visit the US Dept. of Transportation's complaint system and file a complaint with their Office of Defects Investigation (ODI):

https://www-odi.nhtsa.dot.gov/ivoq/

What this does is put information into the database that DoT uses to look for unsafe trends in vehicles. ODI conducts defect investigations and administers safety recalls, and this database is a primary source of information for them. Putting in an entry does not mean that anyone will necessarily get back to you about your particular complaint, but eventually if enough drivers have similar problems with a particular vehicle type, ODI is supposed to investigate. You should be sure to use several different words and phrases to thoroughly describe your situation since often this database is searched via key words. (That means that if they are looking for a particular trend they might only look at records that contain a specific word or phrase, not all records for that vehicle.) You should include specifics, and in particular things that you can recall that would suggest it is not simply driver error. But, realize that the description you type in will be publicly available, so think about what you write.

To be sure, this should not be the only thing you do. If you believe you have a problem with your vehicle should talk to the dealer and perhaps escalate things from there. (If it happened to me I would at a minimum demand a written problem report to the manufacturer central defects office be created and demand a written response from the manufacturer customer relations office to leave a record.) But, if you skip the DoT database then one of the important feedback mechanisms independent of the car companies that triggers recalls won't have the data it needs to work. If you had an incident that did not result in a police report or insurance claim reporting is especially important, since there is no other way for DoT or the manufacturer to even know it happened.

Even if you haven't suffered unintended acceleration, you might be interested to look at complaints others have filed for your vehicle type, which are publicly available. And of course you can report any defect you like, not just acceleration issues.

I recently came across the web site: http://www.safercar.gov/
This has general car safety information and also has a way to file a vehicle safety complaint, including a specific page for filing a safety complaint (https://www-odi.nhtsa.dot.gov/ivoq/). One would hope the data ends up in the same place, but I don't have information either way on that.

(For those who are interested in how the keyword search might be done, you can see a NHTSA Document for an example from the Toyota UA investigations.)

Monday, April 13, 2015

FAA CRC and Checksum Report

I've been working for several years on an FAA document covering good practices for CRC and Checksum use. At long last this joint effort with Honeywell researchers has been issued by the FAA as an official report. Those of you who have seen my previous tutorial slides will recognize much of the material. But this is the official FAA-released version.

Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity
DOT/FAA/TC-14/49
Philip Koopman, Kevin Driscoll, Brendan Hall

http://users.ece.cmu.edu/~koopman/pubs/faa15_tc-14-49.pdf

(Don't miss the webinar tutorial on this other Blog post.)

Abstract:

This report explores the characteristics of checksums and cyclic redundancy codes (CRCs) in an aviation context. It includes a literature review, a discussion of error detection performance metrics, a comparison of various checksum and CRC approaches, and a proposed methodology for mapping CRC and checksum design parameters to aviation integrity requirements. Specific examples studied are Institute of Electrical and Electronics Engineers (IEEE) 802.3 CRC-32; Aeronautical Radio, Incorporated (ARINC)-629 error detection; ARINC-825 Controller Area Network (CAN) error detection; Fletcher checksum; and the Aeronautical Telecommunication Network (ATN)-32 checksum. Also considered are multiple error codes used together, specific effects relevant to communication networks, memory storage, and transferring data from nonvolatile to volatile memory.

Key findings include:

(1) significant differences exist in effectiveness between error-code approaches, with CRCs being generally superior to checksums in a wide variety of contexts;
(2) common practices and published standards may provide suboptimal (or sometimes even incorrect) information, requiring diligence in selecting practices to adopt in new standards and new systems;
(3) error detection effectiveness depends on many factors, with the Hamming distance of the error code being of primary importance in many practical situations;
(4) no one-size-fits-all error-coding approach exists, although this report does propose a procedure that can be followed to make a methodical decision as to which coding approach to adopt; and
(5) a number of secondary considerations must be taken into account that can substantially influence the achieved error-detection effectiveness of a particular error-coding approach.

You can see my other CRC and checksum posts via the CRC/Checksum label on this blog.

Be sure to see the webinar version of this report.

Official FAA site for the report is here: http://www.tc.faa.gov/its/worldpac/techrpt/tc14-49.pdf

Monday, February 2, 2015

Tester To Developer Ratio Should Be 1:1 For A Typical Embedded Project

Believe it or not, most high quality embedded software I've seen has been created by organizations that spend twice as much on validation as they do on software creation. That's what it takes to get good embedded software. And it is quite common for organizations that don't have this ratio to be experiencing significant problems with their projects. But to get there, the head count ratio is often about 1:1, since developers should be spending a significant fraction on their time doing "testing" (in broad terms).

First, let's talk about head count. Good embedded software organizations tend to have about equal number of testers and developers (i.e., 50% tester head count, 50% developer head count). Also, typically, the testers are in a relatively independent part of the organization so as to reduce pressure to sign off on software that isn't really ready to ship. Ratios can vary significantly depending on the circumstance. 5:1 tester to developer ratio is something I'd expect in safety-critical flight controls for an aerospace application. 1:5 tester to developer ratio might be something I'd expect on ephemeral web application development. But for a typical embedded software project that is expected to be solid code with well defined functionality, I typically expect to see 1:1 tester to developer staffing ratios.

Beyond the staffing ratio is how the team spends their time, which is not 1:1. Typically I expect to see the developers each spend about two thirds of their time on development, and one-third of their time on verification/validation (V&V). I tend to think of V&V as within the "testing" bin in terms of effort in that it is about checking the work product rather than creating it. Most of the V&V effort from developers is spent doing peer reviews and unit testing.

For the testing group, effort is split between software testing, system testing, and process quality assurance (SQA). Confusingly, testing is often called "quality assurance," but by SQA what I mean is time spent creating, training the team on, and auditing software processes themselves (i.e., did you actually follow the process you were supposed to -- not actual testing). A common rule of thumb is that about 5%-6% of total project effort should be SQA, split about evenly between process creation/training and process auditing. So that means in a team of 20 total developers+testers, you'd expect to see one full time equivalent SQA person. And for a team size of 10, you'd expect to see one person half-time on SQA.

Taking all this into account, below is a rough shot at how effort across a project should break down in terms of head counts and staff.

Head count for a 20 person project:

10 developers
9 testers
1 SQA (both training and auditing)

Note that this does not call out management, nor does it deal with after-release problem resolution, support, and so on. And it would be no surprise if more testers were loaded at the end of the project than the beginning, so these are average staffing numbers across the length of the project from requirements through product release.

Typical effort distribution (by hours of effort over the course of the project):

33%: Developer team: Development (left-hand side of "V" including requirements through implementation)
17%: Developer team: V&V (peer reviews, unit test)
44%: V&V team: integration testing and system testing
3%: V&V team: Process auditing
3%: V&V team: Process training, intervention, improvement

If you're an agile team you might slice up responsibilities differently and that may be OK. But in the end the message is that you're probably only spending about a third of your time actually doing development compared to verification and validation if you want to create very high quality software and ensure you have a rigorous process while doing so. Agile teams I've seen that violated this rule of thumb often did so at the cost of not controlling their process rigor and software quality. (I didn't say the quality was necessarily bad, but rather what I'm saying is they are flying blind and don't know what their software and process quality is until after they ship and have problems -- or don't. I've seen it turn out both ways, so you're rolling the dice if you sacrifice V&V and especially SQA to achieve higher lines of code per hour.)

There is of course a lot more to running a project than the above. For example, there might need to be a cross-functional build team to ensure that product builds have the right configuration, are cleanly built, and so on. And running a regression test infrastructure could be assigned to either testers or developers (or both) depending on the type of testing that was included. But the above should set rough expectations. If you differ by a few percent, probably no big deal. But if you have 1 tester for every 5 developers and your effort ratio is also 1:5, and you are expecting to ship a rock-solid industrial-control or similar embedded system, then think again.

(By way of comparison, Microsoft says they have a 1:1 tester to developer head count. See: How We Test Software at Microsoft, Page, Johston & Rollison.)

Better Embedded System SW