Context capture, restart points and UX

A different perspective on error handling

Rolf W. Rasmussen

This is not a talk about control flow mechanisms

Too much time has been spent arguing about error return codes vs exceptions already

Nor is it a talk about log file formats and distributed logging protocols

Structured log formats and GELF is pretty neat, but that is beside the point

Nor is it a talk about colors, fonts, layout, spacing and animation

UX ≠ UI + animation

Why do we want to handle errors?

Let's look at a motivating example

What the world needs is a program for calculating remainders

Task: Write a program that reads two integers a and b prints the result of a modulo b.

Easy! I know C++!

    #include <iostream>

    int main() {
        int a, b;
        std::cin >> a;
        std::cin >> b;
        int remainder = a % b;
        std::cout << remainder << std::endl;

What, if anything, is wrong with this code?

See, it works

User testing: The author

User testing: The astronomer

User testing: The mathematician

“Black holes are where God divided by zero”

“Black holes are where God divided by zero”

— Steven Wright, comedian

No, this is not an Einstein quote, despite what you read on the internet.

Platform testing: Amiga

The test engineer everyone fears

Issue triage

How would you deal with these reported issues?

Author: Fails when using Chicago Manual of Style which spells out single digits.FR/UX/CC
Astronomer: Sun to Saturn distrance in meters gives wrong AU remainderB/UX/FR
Mathematician: I need a domain and codomain of ℤ, not significand × 2exponentUX/CC
Mathematician: Sign should be equal to divisor, not dividend. See Mathematica.Doc/FR
Dinosaur: Sometimes crashes and reboots our salary processing machineB/RP
Tester: Inconsistent error messages under low memory conditionsDoc
Junior dev: Unable to reproduce reported crash. Can we add logging?CC
Trekkie: The core should only be dumped in the event of a warp core breachJoke
UI designer: Technical error messages are confusing and scary. Hide them.UX
UX designer: User would benefit from seeing preview of results while he typesUX/RP

Isolation through restart points

When a failure occurs, abort the directly affected activity. Let other isolated activities continue running.

OS process scheduler as restart point

Most modern OSes provide process isolation.

Use one process browser per tab

Service managers as restart points

Service managers like Linux systemd, macOS launchd, and Windows Service Manager provides the ability to restart service processes if they fail.

Application loops as restart points

If an application performs many independent tasks then it often makes sense to allow unaffected tasks to run to completion even if one of the tasks fail.

  • Each incoming request in a server could be considered a separate task.
  • Each file copy operation in a backup program could be considered a separate task.
  • Each HTTP GET request in a webcrawler could be considered a separate task.

Multiple layers of restart points

Imagine a modern HTTP server running as a service

  • If the server process fails, then the service manager restarts it
  • If an accepted network connection fails, the accept loop will continue to accept new connections
  • If HTTP request handler fails, an error will be send back, and the HTTP protocol parser will continue to process requests on the connection.

The biggest sin in error handling is corrupting state or result

If you can't isolate the error, then don't continue

Memory and data corruption side-effects are be very hard to debug when visible failure occur long after the initial error.

Restart ≠ Unsolicited retry

For low-level code code the most useful behavior is to fail fast error.

Only do retries at the outermost context. In interactive applications, the user should be kept in the loop.

Implementing retry logic at multiple layers causes unwanted delay amplification and catastrophic cascades.

Don't retry operations that encounter permanent errors


“Insanity Is Doing the Same Thing Over and Over Again and Expecting Different Results”

“Insanity Is Doing the Same Thing Over and Over Again and Expecting Different Results”

— Anonymous Al-Anon attendee

Still not an Einstein quote, despite what the internet says.

Context Capture

As you abort an activity and pass the error condition up to outer contexts, don't throw away the the information only known by the inner contexts.

Examples of things to capture

  • The URL of the failed HTTP request
  • The path of the file that could not be found
  • The arguments and the mathematical operation that failed due to division by zero
  • The line in the configuration file whose value did not work

Context capture layers

Each layer of context may potentially contribute to the description of the error condition, contributing higher level information in the outer layers.

  • What API call failed and why it failed.
  • What calling the API was trying to achieve.
  • What high level operation or task that was aborted as a result.
  • How, with what input, and why the high level operation was started in the first place.

The second biggest sin is discarding valuable information

They lossy propagation problem

Context information is frequently lost when propagating the error condition from the lower levels of a system to the higher levels of the system.

  • Throwing away information
  • Not augmenting the error information with context that only intermediate layers know

I Have No Mouth, and I Must Scream

What do we do when we're forced to implement an interface that does not allow us to report back all the error information that we have?

Side channels!

Side channels

Examples of standard side channels:

errnoPosix API
SetLastError()Windows Win32 API
Log filesCommon practice
Process dump filesCommon OS mechanism

Thread Local Storage is often used to create ad-hoc side-channels.

Corralation IDs

A unique identifier given to an ongoing activity that is provided both in main reporting channels and side channels to allow corralation of information passed through each channel.

Pass corralation IDs across network boundaries.

Error codes

Identifiers you can paste into Google to find other people complaining about the same problem.

UX and DX

UXThe User Experience of users that just want to get on with his work with as little fuzz as possible experiences when an error occurs.
DXThe Developer Experience of developers that are trying to diagnose and possibly eliminate errors that have been reported.

Don't confuse the two. Don't communicate with the user through log files.

UX: What users would like to know

  • Is it worth trying again later?
  • What is the current state of my work and the operation I tried to do?
  • What aspect of what I tried to do triggered the problem?
  • Is there anything I can do to make the problem go away?
  • If I need a developer to look at the problem, what do I tell them to look for?

Cop outs

  • Try again later.
  • Contact your system administrator.

Translation: The UI layer has no idea what happened.

UX is not putting lipstick on a pig

Good UX requires often improving context capturing.

Actionable conditions

  • If something is missing, guide the user to provide the information
  • If some input or setting is wrong, guide the user to fix it
  • If something needs to be installed or upgraded, help the user do so
  • If the user lacks permission, help the user request permission from the right person

Expected API errors are not application errors

Uninitialized, incomplete or unfinished are common states which users don't consider an error.

“A clever person solves a problem. A wise person avoids it.”

“A clever person solves a problem. A wise person avoids it.”

— Anonymous

“I never said half the crap people said I did.”

— Albert Einstein

Graceful degrade

Make the best of what you've got.

Don't close down the restaurant when you run out of parsley


Bonus slides...

API contracts

Context Capture behavior effectively becomes part of the API as soon as outer layers start depending on it.