Partnership with BitStadium + HockeyApp
Since our first release of PLCrashReporter in 2008, it has come to be relied upon by analytics companies, developer tools providers, and internal corporate crash reporting services. We believe that PLCrashReporter is unrivaled as a reliable, stable, well-tested, and carefully constructed crash reporting tool.
Almost since the beginning, HockeyApp’s developers have contributed to the development of PLCrashReporter — contributing patches, developing both open-source and commercial services around it, and ultimately, funding the development of additional open-source features.
We are very excited to announce a long-term joint partnership between our two companies. This partnership will allow us to focus development efforts on further improving and expanding the reach of PLCrashReporter, as well as developing new features, services, and improvements for HockeyApp’s server and client products. The first result of this partnership will be the release of the next generation open source PLCrashReporter. Exclusive early access for HockeyApp customers and details are coming soon!
HockeyApp and Plausible Labs share a combined vision regarding the future of PLCrashReporter. We genuinely believe that complex tools such as PLCrashReporter should be open-source, in the same way that Apple provides kernel, compiler, and library sources, as to allow for peer review and validation of the approaches we have taken in our technical implementation. Integrators — whether they be application developers or platform providers — should be certain of the robustness of the software, that there is no use of private API or poor implementation that could harm their business or their customer’s interests.
To support this vision, we are launching plcrashreporter.org, a dedicated open-source project, administered by Plausible Labs. It is our goal to ensure that PLCrashReporter remains a trustworthy, free, and open-source solution to crash reporting on iOS, Mac OS X, and — in coming months — future platforms. We will also be founding the PLCrashReporter Consortium (modeled on SQLite’s), with the goal of sharing resources to fund the ongoing open-source development of PLCrashReporter.
HockeyApp is joining the PLCrashReporter Consortium as its founding member, and we look forward to other companies that rely on PLCrashReporter joining in supporting the project’s ongoing open-source development.
Exploring iOS Crash Reports
Introduction
As developers, when one of our applications crashes, we would like to gather enough information about the crash such that we can reason about its cause and (ideally) fix it. Crash reports generated and provided by tools and services such as iTunes Connect, PLCrashReporter, HockeyApp or others can look a bit daunting at first. We will demystify the important aspects of such reports, so that after reading this article one should be able to use crash report data more effectively.
We are going to be talking mainly about crashes of iOS applications, but some of the concepts can be carried over to crashes on other platforms.
Also, there are some reasons for crashes that are of special interest for iOS developers, such as when a watchdog timeout happens, or when the application is killed due to memory constraints. We are not going to cover those here, because others have already done that.
A First Look
Let us dive right in and consider a full, real-world crash report from a production app. There is a lot of information shown there. We are going to break it up into more easily digestable fragments and discuss those individually.
The Header
At the start of every crash report, you’ll find the basic header:
Incident Identifier: A8111234-3CD8-4FF0-BD99-CFF7FDACB212
CrashReporter Key: A54D28AF-3010-5839-BBA6-FE72C8AFCC2E
Hardware Model: iPod3,1
Process: OurApp [476]
Path: /Users/USER/OurApp.app/OurApp
Identifier: coop.plausible.OurApp
Version: 48
Code Type: ARM
Parent Process: launchd [1]
Date/Time: 2013-03-30T04:42:07Z
OS Version: iPhone OS 5.1.1 (9B206)
Report Version: 104
Most of these fields are self-explanatory, but a few deserve note:
- Incident Identifier: Client-assigned unique identifier for the report.
- CrashReporter Key: This is a client-assigned, anonymized, per-device identifier, similar to the UDID. This helps in determining how widespread an issue is.
- Hardware Model: This is the hardware on which a crash occurred, as available from the “hw.machine” sysctl. This can be useful for reproducing some bugs that are specific to a given phone model, but those cases are rare.
- Code Type: This is the target processor type. On an iOS device, this will always be ‘ARM’, even if the code is ARMv7 or ARMv7s.
- OS Version: The OS version on which the crash occurred, including the build number. This can be used to identify regressions that are specific to a given OS release. Note that while different models of iOS devices are assigned unique build numbers (eg, 9B206), crashes are only very rarely specific to a given OS build.
- Report Version: This opaque value is used by Apple to version the actual format of the report. As the report format is changed, Apple may update this version number. In PLCrashReporter, we generate and store reports in our own structured protobuf-based format, and generate Apple-compatible reports on-demand.
The Stack Trace
On iOS, an application is a single process that typically contains multiple active threads, including the main UI thread, the dispatch manager thread, and other worker threads for things like I/O. Each thread executes code, and a thread’s stack trace shows a list of the function calls the thread took to ultimately end up at its current place. A good crash report contains a stack trace for every thread at the time of crashing, and will also tell us in which thread the crash occurred. We read this from the bottom up, with the topmost entry being the top stack item, i.e. the innermost function that was called. The following shows an example crash on the main thread:
Thread 0 Crashed:
0 libsystem_kernel.dylib 0x3466e32c ___pthread_kill + 8
1 libsystem_c.dylib 0x3526829f abort + 95
2 OurApp 0x0015dfc3 uncaught_exception_handler + 27
3 CoreFoundation 0x3601b957 __handleUncaughtException + 75
4 libobjc.A.dylib 0x30f91345 __objc_terminate + 129
5 libc++abi.dylib 0x34d333c5 __ZL19safe_handler_callerPFvvE + 77
6 libc++abi.dylib 0x34d33451 __ZdlPv + 1
7 libc++abi.dylib 0x34d34825 ___cxa_current_exception_type + 1
8 libobjc.A.dylib 0x30f912a9 _objc_exception_rethrow + 13
9 CoreFoundation 0x35f7150d _CFRunLoopRunSpecific + 405
10 CoreFoundation 0x35f7136d _CFRunLoopRunInMode + 105
11 GraphicsServices 0x31876439 _GSEventRunModal + 137
12 UIKit 0x3192ecd5 _UIApplicationMain + 1081
13 OurApp 0x000938e7 main (main.m:16)
Thread 1:
...
If we eyeball that stack trace, we can easily see that the cause of the crash is an unhandled exception. (If you find yourself getting a lot of these, you might want to set up an Exception Breakpoint in Xcode, which takes you to the function that throws the exception when testing your application during development.)
It is probably safe to say that a crash report’s stack traces are what most programmers first consider, as they are relatively easy to comprehend. Often but not always a stack trace is all that’s required to understand the underlying cause of a crash.
Debug Symbols
It is important to note that at the time an application is built for deployment, the debug symbols (which associate constructs of the programming language with the machine code that the compiler generated from them) will be stripped so as to produce a smaller build product. For iOS applications, it is therefore important to keep a copy of the .dSYM
bundle that is generated alongside the binary, as it cannot easily be recovered even if the same set of source files is compiled again. The .dSYM
bundle is generated by the dsymutil
program, which crafts it from the executable and its intermediate object files (.o
, which still contain the DWARF debug information).
Using a binary’s .dSYM
, crash reporting services can later map the symbol addresses from the binary to more comprehensible, human-readable symbol names, and even file and line numbers.
Let’s illustrate the difference that symbolication makes using two short examples from stack traces.
// Before symbolication
8 OurApp 0x000029d4 0x1000 + 6612
The first column gives us the index of the stack frame in the stack trace. The second column indicates the name of the binary the function belongs to. The third column tells us the address of the function that was called in the process’ address space. The last column divides this into a base address for the library’s binary image (see section Binary Images) and an offset.
We can see that this is not all that intuitive. Let’s look at the same example after symbolication:
// After symbolication
8 OurApp 0x000029d4 -[OurAppDelegate applicationDidFinishLaunching:] (OurAppDelegate.m:128)
The first three columns are the same. The last column however, contains the actual file name, line number, and function name, which we humans can more easily look up.
The Exception Section
The exception section of a crash report provides us with the exception type, the exception codes, and the index of the thread where the crash occurred:
Exception Type: SIGABRT
Exception Codes: #0 at 0x3466e32c
Crashed Thread: 0
When we talk about ‘exceptions’ in this context, we do not refer to Objective-C exceptions (although those may be reason for a crash), but Mach Exceptions. The example also shows a UNIX signal, SIGABRT
, which many UNIX programmers will find familiar. There is a whole ecosystem of APIs built around Mach Exceptions and UNIX signals (e.g. to attach custom signal/exception handlers to given types of UNIX signals/Mach Exceptions), that we’re not going to cover here. If you’re interested, see the Further Reading section below.
The kernel will send such exceptions and signals under a variety of circumstances. For the sake of brevity, we limit our discussion to the most common ones that either lead to process termination or that are otherwise of interest in the face of crash analysis.
Signals
The following is a list of commonly encountered, process-terminating signals and a brief description:
Signal Description SIGILL Attempted to execute an illegal (malformed, unknown, or privileged) instruction. This may occur if your code jumps to an invalid but executable memory address. SIGTRAP Mostly used for debugger watchpoints and other debugger features. SIGABRT Tells the process to abort. It can only be initiated by the process itself using the abort()
C stdlib function. Unless you’re using abort()
yourself, this is probably most commonly encountered if an assert()
or NSAssert()
fails. SIGFPE A floating point or arithmetic exception occurred, such as an attempted division by zero. SIGBUS A bus error occurred, e.g. when trying to load an unaligned pointer. SIGSEGV Sent when the kernel determines that the process is trying to access invalid memory, e.g. when an invalid pointer is dereferenced. A signal has either a default signal handler, or a custom one (if the program set it up using sigaction
). As the second argument to the signal handler, a siginfo_t
structure is passed that contains further information about the error that occurred. Of special interest is the si_addr
field, which indicates the address at which the fault occurred. The following is a quote of a comment from the kernel’s bsd/sys/signal.h
file:
When the signal is SIGILL or SIGFPE, si_addr contains the address of the faulting
instruction. When the signal is SIGSEGV or SIGBUS, si_addr contains the address of
the faulting memory reference. Although for x86 there are cases of SIGSEGV for
which si_addr cannot be determined and is NULL.
Exceptions
On Darwin, UNIX signals are built on top of Mach Exceptions, and the kernel performs some mapping between the two. For a more comprehensive list of exception types, see osfmk/mach/exception_types.h
. Again, we list only the most important exception types:
Exception Description EXC_BAD_ACCESS Memory could not be accessed. The memory address where an access attempt was made is provided by the kernel. EXC_BAD_INSTRUCTION Instruction failed. Illegal or undefined instruction or operand. EXC_ARITHMETIC For arithmetic errors. The exact nature of the problem is also made available. It is also possible for an exception to have an associated exception code that contains further information about the problem. For instance, EXC_BAD_ACCESS
could point to a KERN_PROTECTION_FAILURE
, which would indicate that the address being accessed is valid, but does not permit the required form of access (seeosfmk/mach/kern_return.h
). EXC_ARITHMETIC
exceptions will also include the precise nature of the problem as part of the exception code.
Example
The example exception section shown above is an excerpt from this crash report. We can see that the reason for the crash is a SIGABRT
, which makes us think that this crash might have been caused by a failing assertion. If we inspect the exception code, we can see that the kernel included the address of the instruction in question (0x3466e32c
), and the crashed thread’s index. Sure enough, if we search for that address in the report, we’ll find it in both the program counter register (see below), and the crashing thread’s stack trace:
0 libsystem_kernel.dylib 0x3466e32c ___pthread_kill + 8
In this example, we can see that there’s even more to discover in the ‘Application Specific Information’ section, which tells us that a NSInternalInconsistencyException
(a Foundation exception) occurred which was not caught and led to a call to abort()
, which is ultimately why we saw the SIGABRT
signal.
Binary Images
At the end of a crash report, we find a list of the loaded binary images, which in essence tells us which libraries were loaded by the application, and what their address space is within the process. Each entry in this list also shows the UUID for the respective binary, which is generated and set by the linker as part of the build process. It is stored in the Mach-O binary and identified by the LC_UUID command. The UUID is the same for the binary and the .dSYM
bundle generated for it, which ensures that there’s no mismatch during symbolication.
For example, we might find the following in the list of binary images:
0x35f62000 - 0x36079fff CoreFoundation armv7 /System/Library/Frameworks/CoreFoundation.framework/CoreFoundation
The first two hexadecimal numbers indicate the beginning and end of the address space that the CoreFoundation image is loaded into. If we consider a line from a stack trace such as the following, we can see that the function that was traced falls into this library’s address space:
9 CoreFoundation 0x35fee2ad ___CFRunLoopRun + 1269
When is this information useful? Imagine that for some reason we desire to analyze the assembly of one of the libraries that our application is using, let’s say CoreFoundation. CoreFoundation is not statically linked (i.e. it’s not part of the application’s own binary), but is dynamically loaded at runtime. When such loading occurs, the library’s binary image ends up at some arbitrary location in the process’ address space.
Let’s now assume that we know the value of the program counter (PC, i.e. the address of the instruction to be executed next) of the process, and that the PC refers to an instruction somewhere in CoreFoundation’s address space, relative to the process. If on our development machine we disassemble a local, on-disk copy of the CoreFoundation binary that the application previously loaded, we’d not be able to map the process-relative PC to the address space of the local copy, given that CoreFoundation was mapped to some arbitrary address. If, however, we know the offset the CoreFoundation binary image was at during the process’ lifetime, we can easily map the process-relative PC to the corresponding value for the on-disk binary.
As an example, if CoreFoundation’s binary image was loaded into our process with an offset of 0x35f62000
, and the PC is 0x35fee2ad
, then we can compute the actual address of the CoreFoundation instruction as:
0x35fee2ad - 0x35f62000 == 0x8c2ad
In our locally disassembled CoreFoundation binary, we can now inspect the instruction at address 0x8c2ad
.
Register State
Further down in the crash report we find the ARM thread state of the crashed thread, which is essentially a list of the CPU registers and their respective values at the time of the crash. The section may look like so:
Thread 0 crashed with ARM Thread State:
r0: 0x00000000 r1: 0x00000000 r2: 0x00000001 r3: 0x00000000
r4: 0x00000006 r5: 0x3f09cd98 r6: 0x00000002 r7: 0x2fe80a70
r8: 0x00000001 r9: 0x00000000 r10: 0x0000000c r11: 0x00000001
ip: 0x00000148 sp: 0x2fe80a64 lr: 0x3526f20f pc: 0x3466e32c
cpsr: 0x00000010
A crashing thread’s register state is not always required to read a crash report, but there are certainly instances where this information can be very useful. For instance, if the crashing instruction tries to access a register that has a value of 0x0
, and the thread tries to access memory at that register’s address or with only a small offset the failure cause is extremely likely to be a NULL
dereference. That’s because the entire page from 0-4095 is mapped with read/write/execution permissions disabled, i.e. no access is allowed.
In the following section, we’ll give another more elaborate example.
An Example
Let us now consider a contrived example that illustrates why it can be handy to have the register state of a crashing application.
Assume that the crash report tells us the thread that the crash happened in, and that we’ve identified the line in our program that causes the crash. Imagine that the line reads as such:
new_data->ptr2 = [myObject executeSomeMethod:old_data->ptr2];
If we consider the UNIX signal that got sent (SIGSEGV), we can guess that the crash happened due to the application trying to dereference a memory address that for some reason is invalid. In this example, there are two pointers being evidently dereferenced (new_data
and old_data
). The question then becomes: Which one is responsible for the crash?
Assembly to the Rescue
If we still have a copy of the crashing binary, we can disassemble it and look at the exact instruction that was being executed when the application crashed (the address of which is made available as part of the SIGSEGV
‘ssi_addr
).
Assume that the application at the time of crashing was executing the following ARM instruction:
str r0, [r1, #4]
There are two registers being used here: r0
and r1
. Imagine that r1
points to the address of a C struct with the following declaration:
typedef struct { void *ptr1, void *ptr2 } data_t;
Then, given that pointers on ARM have a width of 4 bytes, we know that r1
plus four refers to the struct memberptr2
.
In the form above, the str
instruction takes the value stored in register r0
and attempts to store it at the address pointed to by r1
, plus four. We could read the assembly like so:
*(r1 + 4) = r0;
That is, we can think of str
being the equivalent of a C assignment here.
We now know two things:
- The application received a
SIGSEGV
(invalid memory access we already knew this; see above). - The crash happened while trying to store a value to some address with an offset of 4.
At this point we should be able to suspect that the pointer value of new_data
might not be what we were expecting it to be.
Looking at the ARM thread state (the registers and their respective values), we can confirm this theory:
r1: 0x00000000
In other words, we can now be certain that we are trying to dereference an address (which was computed based on an offset and a NULL
pointer) which in all likelihood points to invalid memory. This was ultimately why the application crashed. If this was a real example, the next step would be to look at our code and determine what path the code could have taken such that we ended up on the crashing line with new_data == NULL
, and to fix that.
Additional Notes
To drive the point home, dereferencing old_data
can’t be the cause of the crash, given that the access to it only involves reading the value, not writing it (i.e. we wouldn’t be seeing a str
instruction). That said, one should not be confused when looking at a more complete list of ARM assembly instructions, where one might see an instruction such as
str [r2, #4], r0
This is an example of how the argument is passed to the subroutine (assuming r2
points to a struct of typedata_t
). That is, the value of the second struct member is stored in register r0
prior to jumping into the subroutine, which then performs its work.
When the subroutine is about to return (assuming the Apple ARM ABI), the return value of the function is placed in r0 (if it fits there). In the above case, that would be the value returned by the executeSomeMethod:
call, which by the time the crashing instruction executes is already stored in r0
.
Conclusion
A fair number of crashes are easy to comprehend. For many crashes, however, comprehensive data is required for crash analysis. In those cases, a reliable crash reporting facility is desirable. Since you can’t know in advance when a crash will occur, and because you might not be able to narrow the cause down without a report, it’s advisable to set up a crash reporting solution for your app early during development. Using iTunes Connect does get you crash reports, but you can’t use it before your app is on the App Store, which means it’s not suitable for beta testing.
Using a service such as HockeyApp has the following advantages over iTunes Connect:
- You can get crash reports even during beta testing, before the app is on the iTunes Store.
- You can access crash reports easily through a convenient web interface.
- You can get notified via mail as soon as a crash happens.
- Crash reports have already been symbolicated for you (assuming proper setup).
- Users often opt-out of Apple’s data gathering, which means you wouldn’t get any crash reports. This is because Apple asks the user for permission to “improve its products and services by automatically sending daily diagnostics and usage data” once when a device is first used, which is a global setting that doesn’t easily convey the effect of declining the request.
It is therefore a good idea to find a crash reporting solution that works for you as soon as you deploy your application. (In the interest of full disclosure: HockeyApp has been a great sponsor of our open-source work on PLCrashReporter).
We hope that we have succeeded in shedding some light on the more advanced information provided by crash reports. In case you need some expert guidance regarding (our) crash reporting solutions, please note that we’re available for hire.
Further Reading
- A WWDC 2010 video on understanding crash reports (login required).
- Apple tech note 2151 on understanding crash reports.
- On Mach Exceptions and Signals, we recommend Amit Singh’s Mac OS X Internals book and Landon’s guest article on Mike’s Friday Q&A blog. As Landon highlights, please note that Mach Exceptions Handlers are a partially private API on iOS.
Swapping PCI Option ROMs
Prologue: Old Hardware Hacking
In my spare time in the Plausible Lab, I like to play with old Mac and video game hardware — it’s fun, appeals to my strong sense of nostalgia, and if I screw up, I won’t feel quite so terrible as I would if happened to destroy a piece of expensive modern hardware. Fortunately, I’ve yet to actually destroy anything, and actually have been able to fix a few things, like a failed Mac IIci power circuit.
In many cases, you can also find schematics for the hardware in question, if not actual vendor documentation. Official development manuals are available out there if you look around a bit, and crazy folks such as BoMarc Services sell reverse-engineered schematics for everything from the Super Nintendo to the Macintosh Quadra 840av.
This leads me to my small Friday evening project in getting a PC version of the Radeon 7000 64MB working on a PCI Power Macintosh, for which no 64MB cards were released. This requires desoldering, replacing, and reflashing the flash chip on the card, as recently documented by a friend of mine, Rob Braun.
PCI Option ROMs
Have you ever wondered how PCI cards are able to perform basic operations — such as accessing a disk, displaying graphics, or booting off the network — before their drivers have been loaded? Or why cards are platform specific, despite the fact that everyone is using standardized PCI interfaces?
The answer actually lies on the card itself, inside a bit of addressable flash memory called the “Option ROM”, which contains executable code that’s located and run at boot time. This code is responsible for interfacing with the underlying BIOS implementation, and providing the services required for disk, network, graphics, or similar access. Given where and when the code is running, there’s no real constraint — it could also display custom UI, or vend services other than what you might expect, such as a ‘kernel’ thatruns entirely from a PCI option ROM.
There are also security implications; malware that acquires sufficient privileges could re-image your PCI cards’ option ROMs to contain persistent code that re-infects the machine no matter how many times you re-image your system. One of the goals of the Secure Boot initiave is to require signing of firmware, preventing this sort of attack (… until another bug is found in the firmware responsible for validating the boot process).
The PCI option ROM is also why PCI cards are platform specific — the ROMs contain either native code, or an architecture neutral bytecode, and that code is targeted to the actual firmware API specification of the host architecture (e.g., BIOS, UEFI, or OpenFirmware).
On legacy Power Macintosh systems, a combination of Open Firmware bytecode (called FCode) and native PPC code is included on the PCI Option ROM to provide both boot-time and run-time drivers for cards; this approach of including the OS drivers is what allowed Mac OS graphics cards to be run without additional drivers being installed. On older Macs, the FCode is used at boot time to provide basic services required at boot time, and then the OS loads and uses Mac OS drivers loaded from the PCI option ROM at runtime.
This approach ties the hardware fairly closely to a specific operating system, and while this sort of driver bundling was the norm on Mac OS for years, modern systems tend to rely on installable drivers after boot time.
Converting a Radeon 7000 64MB
This leads us to the original goal, which was to get the PC version of the Radeon 7000 64MB card working on a PCI Macintosh. There was never a 64MB version of the card released, but there was a 32MB version of the card provided for Mac OS. Assuming that the drivers written for the 32MB card are compatible with the 64MB card, it should simply be a matter of reflashing the PCI Option ROM with the Mac OS drivers — something we can do using the vendor’s own flashing software.
Unfortunately, there’s just one hitch. The Mac ROM is 128KB, and the PC version of the card only shipped with a 64KB flash ROM. This was presumably done to save money; pennies add up, and the PC ROM only needs to provide basic BIOS services, which can fit easily within 64KB of flash. The Mac ROM actually needs to provide two different drivers; one in FCode, one for Mac OS, and these won’t fit on the PC cards’ smaller flash chip.
Thus, to re-image the card with Mac-compatible firmware, we need to desolder or clip the 64KB Flash chip from the PC card, solder in a new compatible 128KB flash chip, and then reflash the ROM using the vendor’s tools. Here’s the flash chip I replaced:
The flash chip is circled in red
Replacing the chip turns out to be pretty easy. The original flash chip conforms to an industry standard pinout and package size, for which many compatible replacements are available. The chips are communicated with over the defacto standard SPI Bus (Serial Peripheral Interface Bus), and use a common wire protocol. The protocol itselfuses 24-bit addressing; in theory, there’s no issue with swapping out a smaller 64KB chip for a larger 128KB chip, and having the card address the additional space. I was able to find a compatible replacement for the original Atmel chip from Digi-Key.
The first step is to remove the previous flash chip. If preserving the chip is important to you, you might use hot air rework or similar to lift the chip without damaging it. Given that you can easily copy a PCI option ROM’s contents from a running system, there’s no real need to preserve the actual chip; I went the more destructive approach of clipping the leads using a pair of small flush cutters, and then removing the remaining solder and pins with solder wick.
With the chip removed, the next step is to solder in the replacement. Soldering surface-mount components can be a bit daunting at first, but I’ve found it’s actually quite a bit easier (and less time consuming) than working with through-hole. Lately I’ve been getting the hang of hot air rework using solder paste and the Hakko FM-206 we have in the lab, but for replacing this chip, I wanted to minimize the risk of heat damage to surrounding components.
My approach for resoldering the chip was to use a standard soldering iron with a “hoof” soldering tip. These tips have a broad, flat or concave tip that can be easily dragged over a set of pins. After tacking down opposing corners of the chip, you can simply drag the tip across the pins, letting surface tension wick the solder around the pins. I’d recommend watching the EEVBlog‘s video tutorial for more details on hand-soldering surface mount components, including drag soldering.
After dropping the card into my Power Mac 9600, I was able to reflash the new ROM using ATI’s tools. One reboot later, and I could sit in awe of System 7.6 running at 1920×1200 over DVI. Now I just need to get copy of Marathon running…