Thoughts on mobile app crash detection and recovery

As I work on updates to my app, I’m happy to see it’s working quite well. However, inevitably there will be some boundary case that causes an exception, and as I add features (and more importantly, persistent settings) an addition risk comes into play: failed restarts after a crash.

Unlike a desktop app, where one could add a “safe mode” icon for such instances, there are few options in the mobile space. These include:

  • Letting the user delete the application data manually via the Settings menu
  • Automatically resetting to defaults after a crash
  • Attempting to automatically work around the broken area
  • Asking the user on restart if they want to reset to defaults
  • Falling back to the last known-good configuration

It’s one thing to be a power user, but another to be forced to become one thus the first option isn’t really practical. The second option guarantees recovery but may degrade the user experience (especially if there are lots of settings involved)… it should be a last resort, and a user choice. The third option is nice, but can become quite complex: this works best if settings are checkpointed frequently, but such frequent saves to flash memory are not usually a great idea. In addition, the more complex the fail-safe, the more likely it may cause a triggering exception.

Asking the user first is an important part of any solution, but it requires architecting a fail-safe startup sequence (one that is preference invariant). One can then offer the choice of a “fresh start” or the option of using a backed-up configuration, ensuring that the user has at least a chance of restoring to a previously working state with their settings intact.

A useful pattern for a single-instance app would be the following, using a persistent IsRunning flag:

  • App launches
  • Load settings
  • Check if the IsRunning flag is set: if it is, then the last run exited abnormally (none of the normal exit points were hit which would’ve cleared the flag):
    • Clear the IsRunning flag
    • Offer to restore from backup / defaults or attempt to continue as-is
  • Initialize states and display initial UI presentation
  • Save the current settings with a backup name
  • Set the IsRunning flag
  • Save the settings normally
  • On normal app exit / backgrounding, clear the IsRunning flag and save settings normally

There are other things you can do, including instrumenting for crash data collection and presenting the user an option to forward that data (anonymized!) to you for analysis.

Software quality is as much about how well you avoid defects in the first place as it is about how gracefully you recover from them! Mistakes happen, especially in apps with complex GUIs. Errors should be as rare as possible, but when errors occur it’s the unrecoverable ones that lead to lost users. Solid error handling and recovery goes a long way towards satisfied users who stick around for the release that irons out the bug they encountered and worked around

This is a critical difference between hardware and software quality, especially for smaller shops. If you ship a physically defective device, the typical recourse is a return for repair or replacement – a costly proposition in terms of logistics as well as reworked hardware / waste. The quality patterns for software and hardware have commonalities but are generally quite different in practice: in general, hardware is by its very nature far more costly and less forgiving of quality failures. This is why popular hardware quality initiatives tend to translate poorly to the software space.

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.