Wednesday, July 23, 2014

Don’t Overflow the Stack

Somewhere in your embedded system is the stack (or several of them for some multitasking systems). If you blow up the stack, your software will crash. Or worse, especially if you don’t have memory protection. For a critical system you need to make sure the stack has some elbow room and make sure that you know when you have a stack size problem.

Also, you shouldn’t ever use recursion in a safety critical embedded system. (This shouldn’t even need to be said, but apparently it does.)

Consequences: The consequences of not understanding maximum stack depth can be a seemingly random memory corruption as the stack overwrites RAM variables. Whether or not this actually causes a program crash depends upon a number of factors, including chance, and in the worse case it can cause unsafe program behavior without a crash. 

Accepted Practices:
  • Compute worst case stack depth using a static analysis tool.
  • Include a stack sentinel or a related technique in the supervisor task and perform a graceful shutdown or reset prior to an actual overflow.
  • Avoid all recursion to ensure that worst case stack depth is bounded.
Discussion:
The “stack” is an area in memory used for storing certain types of data. For example, in the C programming language this is where non-static local variables go. The stack gets bigger every time a subroutine is called, usually holding the subroutine return address, parameter values that have been passed, and local variables for each currently active subroutine. As nested subroutines are called, the stack keeps getting bigger as more information is added to the stack to perform each deeper and deeper call. When each subroutine is completed, the stack information is returned to the stack area for later use, un-nesting the layers of stack usage and shrinking the stack size. Thus the maximum size of the stack is determined by how many subroutines are called on top of each other (the depth of the subroutine call graph), as well as the storage space needed by each of those subroutines, plus additional area needed by any interrupt service routines that may be active at the same time.

Some processors have a separate hardware stack for subroutine return information and a different stack for parameters and variables. And many operating systems have multiple stacks in support of multiple tasks. But for the most part similar ideas apply to all embedded controllers, and we’ll just discuss the single-stack case to keep things simple.


It is common for stacks to grow “downward” from high memory locations to low locations, with global variables, Real Time Operating System (RTOS) task information, or other values such as the C heap being allocated from the low memory locations to high memory locations. There is an unused memory space in between the stack and the globals. In other words, the stack shares limited RAM space with other variables. Because RAM space is limited, there is the possibility of the stack growing so large that it overlaps variable storage memory addresses so that the stack corrupts other memory (and/or loads and stores to global memory corrupt the stack). To avoid this, it is accepted practice to determine the worst case maximum stack depth and ensure such overlap is impossible.

Maximum stack depth can be determined by a number of means. One way is to periodically sample the stack pointer while the program is running and find the maximum observed stack size. This approach is a starting point, but one should not expect that sampling will happen to catch the absolute maximum stack depth. Catching the worst case can be quite difficult because interrupt service routines also use the stack and run at times that are generally unpredictable compared to program execution flow. (Moreover, if you use a timer interrupt to sample the stack pointer value, you’ll never see the stack pointer when the timer interrupt is masked, leaving a blind spot in this technique.) So this is not how you should determine worst case stack depth.

A much better approach is to use static analysis tools specifically designed to find the worst case set of subroutine calls in terms of stack depth. This technique is effective unless the code has structures that confound the analysis (e.g., if a program uses recursion, this technique generally doesn’t work). When the technique does work, it has the virtue of giving an absolute bound without having to rely upon whether testing happened to have encountered the worst case. You should use this technique if at all possible. If static analysis isn’t possible because of how you have written your code, you should change your code so you actually can do static analysis for maximum stack depth. In particular, this means you should never use recursion in a critical system both because it makes static analysis of stack depth impossible. Moreover, recursive routines are prone to stack overflows even if they are bug free, but happen to be fed an exceptional input value that causes too many levels of recursion. (In general, using recursion in any small-microcontroller embedded system is a bad idea for these reasons even if the system is not safety critical.)



Beyond static analysis of stack depth (or if static analysis isn’t possible and the system isn’t safety critical), an additional accepted practice is to use a “stack sentinel” to find the “high water mark” of the stack (I’ve also heard this called a “stack watermark” and a “stack guard”). With this technique each memory location in the stack area of memory is initialized to a predetermined known value or pattern of known values. As the stack grows in size, it will overwrite these known sentinel values with other values, leaving behind new patterns in memory that remain behind, like footprints in fresh snow. Thus, it is easy to tell how big the stack has ever been since the system was started by looking for how far into the unused memory space the “footprints” of stack usage trample past the current stack size. This both gives the maximum stack size during a particular program run, and the overall system maximum stack size assuming that the worst case path has been executed. (Actually identifying the worst case path need not be done – it is sufficient to have run it at some time during program execution without knowing exactly when it happened. It will leave its mark on the stack memory contents when it does happen.)


To convert this idea into a run-time protection technique, extra memory space between the stack and other data is set up as a sacrificial memory area that is there to detect unexpected corruptions before the stack can corrupt the globals or other RAM data. The program is run, and then memory contents are examined periodically to see how many stack area memory words are left unchanged. As part of the design validation process, designers compare the observed worst case stack depth against the static analysis computed worst case stack depth to ensure that they understand their system, have tested the worst case paths, and have left adequate margin in stack memory to prevent stack overflows if they missed some special situation that is even worse than the predicted worst case.

This sentinel technique should also be used in testing and in production systems to periodically check the sentinels in the sacrificial memory area. If you have computed a worst case stack depth at design time and you detect that the computed depth has been exceeded at run time, this is an indication that you have a very serious problem.

If the stack overflows past the sacrificial memory space into global variables, the system might crash. Or that might just corrupt global variables or RTOS system state and have who-knows-what effect. (In a safety critical system “who-knows-what” equals “unsafe.”)  Naively assuming that you will always get a system crash detectable by the watchdog on a stack overflow is a dangerous practice. Sometimes the watchdog will catch a stack overflow. But there is no guarantee this will always happen. Consider that a stack-smashing attack is the security version of an intentional stack overflow, but is specifically designed to take over a system, not just merely crash it. So a crash after a stack overflow is by no means a sure thing.

Avoiding stack overflow problems is a matter of considering program execution paths to avoid a deep sequence of calls, and accounting for interrupts adding even more to the stack. And, again, even though we shouldn’t have to say this – never using recursion.

While there isn’t a set number for how much margin to leave in terms of extra memory past the computed maximum stack size, there are two considerations. First, you want to leave enough room to have an ample sacrificial area so that a problem with stack depth is unlikely to have enough time to go all the way through the sacrificial area and touch the globals before it is detected by a periodic timer tick checking sentinels. (Note: we didn’t say check current stack depth; we said periodically check sentinels to see if the stack has gotten too big between checks.)  Also, you want to leave some extra margin to account for the possibility you just never encountered the worst-case stack depth in analysis or testing. I’ve heard designers say worst case stack depth of 50% of available stack memory is a good idea. Above 90% use of stack memory (10% sacrificial memory area set aside) is probably a bad idea. But the actual number will depend on the details of your system.

Selected Sources:
Stack sentinels and avoidance of recursion are an entrenched part of embedded systems folklore and accepted practices. Douglass mentions watermarking in Section 9.8.6 (Douglass 2002).
MISRA Software Guidelines discourage recursion, saying that it can cause unpredictable behavior (MISRA Guidelines, pg. 20).

MISRA C required rule 70 explicitly bans recursion:

NASA recommends using stack guards (essentially the same as the technique that I call “stack sentinels”) to check for stack overflow or corruption (NASA 2004, p. 93).

Stack overflow errors are well known to corrupt memory and cause arbitrarily bad behavior of a running program. Regehr et al. provide an overview of research in that area relevant to embedded microcontrollers and ways to mitigate those problems (Regehr 2006). He notes that “the potential for stack overflow in embedded systems is hard to detect by testing.” (id., p. 776), with the point being that it is reasonable to expect that a system which has been tested but not thoroughly analyzed will have run-time stack overflows that corrupt memory.

References:
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Regehr et al., Eliminating stack overflow by abstract interpretation, ACM Trans. Embedded Computing Systems, Nov. 2006, pp. 751-778.

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...