Implementing Plausible Crash Recovery
Yesterday we announced Plausible Crash Recovery, a working crash recovery system built on top of PLCrashReporter. Upon a crash, the recovery implementation steps backwards from the crashing function, restoring non-volatile register state and returning nil to the original caller.
Despite the fact that it was released on April 1st and indeed, was a prank, it does actually work. That doesn’t mean that using it is a good idea, though, and today I figured I’d explain how it works, why you don’t want to use it as-is, and where the underlying technology might actually be applicable.
If you haven’t already checked out the source and played with it yourself, give it a go. You can try plugging in some different crashing bugs of your own, and see how it behaves.
As a fair warning, I’m going to sacrifice some precision in my explanations below for the sake of overall clarity; there are a lot of details and edge cases that must be accounted for when implementing a crash reporter (or in this case, a crash recovery system), and if you’re interested in a digging in further, feel free to stop by and chat with us on the freenode #plcrashreporter IRC channel.
April Fools
I know a number of folks assumed — like many of the other April Fools absurdities — that Plausible Crash Recovery didn’t actually work. Despite the fact that we bolted on a goofy UI, the code works as advertised, as befits a proper hack. The “restoration” UI, despite being totally unnecessary, even shows the actual steps taken to restore thread state; the only exception being the “Reticulating Splines” step – that part I made up.
The last time I was privy to a fun April Fools Day prank was back in the 90s, when some co-workers implemented a local man-in-the-middle attack on common stock ticker sites, proxying and adjusting the returned data for our .com to show a precipitous fall (or rise, I don’t recall which). What was fun about the prank wasn’t the actual effect it had on people — as I recall, nobody was seriously fooled, which was probably a good thing for all involved.
Rather, what made my coworker’s prank fun (for me, anyway) was that it was a good hack. It was the kind of wacky technical implementation that you can do when you decide that it’s OK to break the rules and see what neat ideas come of it. It reminded me of the ethos that drove the fabled MacHack conferences, the source of gems such as Quinn “The Eskimo!”‘s 2002 “Best Hack” winner, Firestarter, which demonstrated that just plugging in a firewire cable was enough to allow DMA writes to the target computer’s video buffer (in this case, displaying flames at the bottom of the screen).
So when it struck me that PLCrashReporter actually had the tooling necessary to implement a bad clone of a bad Visual Basic feature, actually implementing Crash Recovery seemed like a good April 1st hack — in the classic meaning of a hack.
Of course, like most hacks, the fact that it mostly works doesn’t mean you should actually use it.
Rolling Back Time
I often think of PLCrashReporter itself as a “time machine debugger” — it ideally provides a view into the past that can be used to reconstruct the state of the process and debug a long-past failure. Crash Recovery takes this time machine metaphor much further — using PLCrashReporter’s async-safe stack unwinding to step backwards from the crashing function, restoring non-volatile register state and returning nil to the original caller.
To understand how this works, we first need to understand the state that represents a thread of execution, and what parts of it must be rewound to return to the original caller — as well as what can’t be rewound, but really should be.
For a any given process, if you were to pause it at a moment in time — say, when a crash occurs — the crashed function’s execution state would be fully encapsulated in:
- Global State (including the heap, file descriptors, memory mappings, etc …)
- Thread Stack
- CPU Register State
If we want to restart execution in the crashed function’s caller, we need to work backwards from the current process state to restore as much of the caller’s previous state as we can.
Global State
Global state includes (but certainly isn’t limited to) the heap, file descriptors, shared data structures, and even the process’ current working directory. Any part of global state that is changed during execution of the crashing function represents modified state that must be rolled back if we wish to perfectly restore the thread to its pre-crashed state.
Unfortunately, restoring global state is a non-starter — there’s no way for us to know what was changed. For example, if the crashed thread has has corrupted the heap (or it was already corrupt), we can’t restore the heap to a non-corrupted state, and the application will likely just crash again. However, there are plenty of crashes that don’t involve corruption or mutation of global state, in which case we don’t need to restore any global state to allow execution to continue in the caller.
In other words, despite this limitation, we can actually recover from a large number of common crashes despite not having the facility to roll back global state. Of course, the less mutable shared state you use, the more recoverable your crashes are — funny as that might sound in the context of an April Fool’s hack, that’s actually the principal behind “fail fast” semantics often supported in functional programming languages. A failed thread can simply be discarded if there’s no chance it will leave behind partially modfied shared state, and the process state will remain fully consistent.
Thread Stack
The thread’s stack maintains state for each called subroutine via a series of stack frames. At the time of the crash, the current stack frame is represented via the following state:
- The stack pointer points to the current top of the stack. Any new stack allocations will likewise adjust the stack pointer.
- The frame pointer usually – depending on the architecture, calling conventions, and emitted code — points to a fixed stack allocation that is at a fixed offset from the caller’s original stack pointer, and is used to store the caller’s return address and original frame pointer.
- The return address is the address to which the called function should return upon completion. Depending on the architecture and calling conventions, this may be stored in a register (as it is on ARM), or may be stored on the stack via the frame pointer (x86).
To restore the caller’s original stack, as well as to determine the code address at which we should restart execution to simulate a return
, we need to derive the caller’s stack pointer, frame pointer, and *return address *from the current thread’s stack state. If we’re able to successfully determine those original values, then we’ll have successfully restored the stack, as well as the execution address.
Of course, if the crashed function smashes any of this data, or the caller’s stack frame, or some other critical data on the stack, we can’t actually recover reliably; should we try, we’ll likely just trigger a secondary crash.
While that’s the basic premise, the actual process of performing the stack unwinding is a bit tricky. On some architectures (including 32-bit iOS and 32-bit Mac OS X), the frame pointer is almost always stored in a fixed register, and can be easily fetched from the crashed thread’s register state. The caller’s original stack state can simply be directly fetched or computed from the frame pointer register.
On other architectures, however, things aren’t so simple. On Mac OS X x86-64, for example, there is no requirement that the frame pointer be saved in a machine register. Instead, additional unwind data is provided by the compiler; this data defines how the caller’s state may be restored from the current thread state: values may be computed from existing registers, existing stack values, as fixed offsets, or through a variety of other mechanisms. This relates to how we restore register state, and we’ll cover how this works in more detail below.
Register State
The crashed thread’s register state represents the processor’s execution state at the time of the crash. During execution, the crashed function may have overwritten some of the caller’s register values; since the crashed function will never have the opportunity to restore those overwritten values, restoring the caller’s state will require that we somehow determine:
- Which registers are expected by the caller to have been preserved (ie, caller-preserved registers).
- Which of those registers have been modified and require restoration.
- How to actually restore the original values for those registers.
To answer the first question, we simply need to look at the platform’s defined calling conventions. For Apple’s platform, these are defined in the iOS ABI Function Call Guide and Mac OS X ABI Function Call Guide. The calling conventions define callee-preserved and caller-preserved registers:
- Callee-preserved registers (or, non-volatile registers) must be preserved by the called function, if it overwrites the caller’s original register value(s). These are the registers that we must restore, if they’ve been overwritten.
- Caller-preserved registers (or, volatile registers) must be preserved by the calling function, if it requires later access to those values. These registers may be freely overwritten, and do not need to be restored prior to returning to the caller.
This answers the first question, but we’re left with a connondrum — when execution stops in the middle of a crashing function, how do we know which non-volatile registers have been modified, and how do we know how to restore their original values?
Unfortunately, on Apple’s 32-bit platforms (ARM and i386), the answer is that we don’t. This information is not available, and we simply have to restore the stack state we can and hope that’s enough. Surprisingly, this actually works fairly often. It is, of course, a terrible idea, and one of many good reasons why “crash recovery” ought to be considered a hack, and not an actually useable product.
On Apple’s 64-bit platforms (x86-64 and ARM64), however, this information is provided via the same *unwind data *that allow us to pop the thread’s stack frame; we can interpret the unwind data at crash time to perform non-volatile register restoration.
Leveraging Unwind Data
Background: Exception Unwinding
We’ve already established that on 32-bit Apple platforms, our ability to unwind the stack is limited due to the lack of unwinding data. The reason for this actually has to do with how exceptions are handled on the platform. On 32-bit Apple systems, when a try/catch/finally block is declared, the current thread’s state is actually saved via setjmp() (or equivalent functionality), and pushed onto a per-thread stack of exception handlers; when it comes time to find an exception handler, the stack is popped until a matching handler is reached, and the equivalent of longjmp() is used to re-load that thread state, resuming execution.
This approach has two downsides; first of all, there’s no way for a debugger or crash reporter to use the exception unwinding information to unwind arbitrary intermediate frames. The only time exception unwinding information is available is when a catch
block is executed, and in that case, it’s only possible to restore the specifically saved thread state. Secondly, there is the issue of runtime cost. At each catch
or finally
statement, the thread state must be saved and pushed onto a stack, even if it’s never used.
The alternative approach, and what is used on Apple’s 64-bit platforms, is the use of so-called zero-cost exceptions. Rather than recording thread state at runtime, the compiler builds a lookup table that covers *all code*in an executable. This table defines how to accurately unwind a single frame from any valid instruction address, as well as providing language/runtime-specific definitions of where try/catch/finally blocks are defined, and how to handle them.
As a result, it’s not necessary to do any work at runtime if an exception is not thrown; hence the name “zero-cost exceptions”. If an exception is thrown, the language/exception runtime must consult the lookup table to correctly unwind the stack.
As it turns out, this is exactly the same information that debuggers, crash reporters, and evil crash recovery hacks need to perform their own stack unwinding.
Interpreting the Unwind Data
To correctly unwind a frame in our crash recovery system, we need to actually interpret the unwind data, and extract the rules necessary to calculate, load, or otherwise restore the caller’s original register and stack state.
Conceptually, it helps to think of the unwind data as being stored as two-column table; each row in the table represents an instruction address within the binary (the first column), and (in the second column) are the unwind instructions necessary to restore the caller’s state. To perform the unwind operation, we first need to find the row that represents the instruction at which the crash occured, and then apply any restoration rules defined at that row.
In reality, such a direct encoding of the unwind table would be prohibitively enormous. To solve that, complex encoding schemes are used to minimize duplication and maximize data re-use; on Apple platforms, these are DWARF and Apple’s own Compact Unwind.
DWARF
DWARF is a (mostly) platform architecture-neutral standard for defining debugging information, including unwind data. To add support for a new architecture in PLCrashReporter’s DWARF implementation, it’s generally sufficient to simply add a mapping between DWARF register numbers defined by the platform vendor, and the actual registers they represent; interpreting the format operates entirely in terms of these abstract register numbers.
The encoding is capable of representing almost any possible set of unwind rules; the lookup tables and restoration rules are implemented as an versatile interpreted bytecode, including a turing complete set of DWARF expression opcodes. Amusingly, this aspect of DWARF has been used to implement in-process arbitrary code execution without native code. If you’re curious, you can peruse PLCrashReporter’s DWARF expression interpreter source here.
While enormously useful (and necessary!), the versatility of DWARF comes at the cost of the encoding’s conciseness; this is what Apple set out to address with their non-standard Compact Unwind encoding.
Compact Unwind
Apple’s Compact Unwind encoding is architecture-specific, non-portable, and is unique to Apple. Whereas DWARF can represent almost any set of rules necessary to perform unwinding, Apple took a different approach with the compact unwind encoding — it’s only capable of representing a limited set of unwinding rules, but these rules cover all (or just about all) of the code constructs actually emitted by the compiler. In exchange for these limitations, the compact unwind encoding is, well, compact; it’s much smaller than the corresponding DWARF representation, which means appreciably smaller binaries.
The Compact Unwind encoding can’t represent the full range of unwinding rules that may be required, and as such, it’s used in concert with DWARF. At link time, any DWARF rules that can be represented using the compact unwind encoding will be converted by ld
, and the DWARF data will be discarded.
Since the original DWARF data is discarded, this means that correct crash reporting (and, in the case of Crash Recovery, frame unwinding) requires both DWARF and Compact Unwind support. You can find you can find PLCrashReporter’s Compact Unwind implementation (for x86-64 and ARM64) here.
Applying the Unwind Changes
As part of our work to support crash reporting on 64-bit platforms, we already had implemented full DWARF and Compact Unwind support in PLCrashReporter, including the APIs necessary to represent register modifications across stack frames; we implemented this with the eventual goal of including non-volatile register state for *all *frames in all threads in the crash report.
We had to do very little to implement the Crash Recovery system — it was a simple matter of calling directly into our unwinding APIs from our signal handler, and applying the computed register results to the ucontext_t
containing the signal thread state. If you return directly from a signal handler, any changes made to ucontext_t
within the signal handler will be applied to the target thread — by modifying the ucontext_t
, we’re able to update the stack pointer, frame pointer, as well as any non-volatile registers. In addition, by setting the instruction pointer, we actually cause the thread to resume in the crashed function’s caller upon return from the signal handler.
Since it’s just a little bit of glue on top of PLCrashReporter’s existing async-safe APIs, the Crash Recovery code only took about a day to write; if you’d like to take a look, the signal handler additions can be found here.
Returning Nil to the Caller
Having implemented unwinding, the last thing we needed to do was set the return value to nil. I have to admit we cheated a bit here; we ignored floating point and structure return types.
To handle pointer return types (including Objective-C objects), we simply set the return address register to 0x0
. This handles most return values, but in the case where structures are returned on the stack, or a special handling is required for floating point, you’ll see unexpected results.
Conclusion
While the Crash Recovery implementation is an interesting technical exploration of what’s possible, it would be a terrible idea to actually use it as a blanket “fix” for crashes, even if it worked absolutely perfectly. The nature of crash is such that the current process state is, by default, undefined; if it was defined, it wouldn’t have crashed. Blindly attempting to proceed can do worse than crash; data corruption and deadlocks are entirely likely.
That doesn’t mean that this avenue of exploration is bereft of value, however. For example, if we extended the PLCrashReporter APIs to directly support the idea of “patch and continue”, we could support some pretty common operations that currently require custom per-architecture+platform implementations in runtime VMs, such as trapping “optimistic” error handling cases – managed code could use this mechanism to exclude NULL or divide-by-zero checks in generated machine code, instead trapping the signals, verifying that the failure occurs within managed code, and converting the signal into a language-level stack-unwinding NullPointerException or DivideByZero exception.
A more aggressive avenue of exploration is the idea of emergency “hot patches” deployable directly from a crash reporting service. If your shipped application is unexpectedly crashing across your entire user-base with a call to CFRelease(NULL)
, and you know it’s safe to work around the issue, a crash reporting service could support feeding a PLCrashReporter-based hotpatch to your application, working around the issue until you could actually ship a release.
After all, there’s no upside to having customers being frustrated and continuing to submit crash reports for a known issue.
It’s not clear that we’ll see any of these ideas — or any of the others floating around our heads — in an actual shipping product, but it’s mighty fun to hack something out and see what they might look like.