Context capture, restart points and UX

A different perspective on error handling

Rolf W. Rasmussen

rolfwr.net

This is not a talk about control flow mechanisms

Too much time has been spent arguing about error return codes vs exceptions already

Nor is it a talk about log file formats and distributed logging protocols

Structured log formats and GELF is pretty neat, but that is beside the point

Nor is it a talk about colors, fonts, layout, spacing and animation

UX ≠ UI + animation

Why do we want to handle errors?

Let's look at a motivating example

What the world needs is a program for calculating remainders

Task: Write a program that reads two integers a and b prints the result of a modulo b.

Easy! I know C++!


    #include <iostream>

    int main() {
        int a, b;
        std::cin >> a;
        std::cin >> b;
        int remainder = a % b;
        std::cout << remainder << std::endl;
    }

What, if anything, is wrong with this code?

See, it works

User testing: The author

User testing: The astronomer

User testing: The mathematician

“Black holes are where God divided by zero”

— Steven Wright, comedian

No, this is not an Einstein quote, despite what you read on the internet.

Platform testing: Amiga

The test engineer everyone fears

Issue triage

How would you deal with these reported issues?

Author: Fails when using Chicago Manual of Style which spells out single digits.	FR/UX/CC
Astronomer: Sun to Saturn distrance in meters gives wrong AU remainder	B/UX/FR
Mathematician: I need a domain and codomain of ℤ, not `significand` × 2^exponent	UX/CC
Mathematician: Sign should be equal to divisor, not dividend. See Mathematica.	Doc/FR
Dinosaur: Sometimes crashes and reboots our salary processing machine	B/RP
Tester: Inconsistent error messages under low memory conditions	Doc
Junior dev: Unable to reproduce reported crash. Can we add logging?	CC
Trekkie: The core should only be dumped in the event of a warp core breach	Joke
UI designer: Technical error messages are confusing and scary. Hide them.	UX
UX designer: User would benefit from seeing preview of results while he types	UX/RP

Isolation through restart points

When a failure occurs, abort the directly affected activity. Let other isolated activities continue running.

OS process scheduler as restart point

Most modern OSes provide process isolation.

Use one process browser per tab

Service managers as restart points

Service managers like Linux systemd, macOS launchd, and Windows Service Manager provides the ability to restart service processes if they fail.

Application loops as restart points

If an application performs many independent tasks then it often makes sense to allow unaffected tasks to run to completion even if one of the tasks fail.

Each incoming request in a server could be considered a separate task.
Each file copy operation in a backup program could be considered a separate task.
Each HTTP GET request in a webcrawler could be considered a separate task.

Restarting integer acquisition


int acquire_int() {
    while (true) {
        try {
            return request_int();
        } catch (const parse_error& err) {
            print_error(err.what(), err.state);
        }

        std::cout << "Try entering an integer value again." <<
            std::endl;
    }
}

Modified main function


int main() {
    int a = acquire_int();
    int b = acquire_int();
    if (b == 0) {
        std::cout << "Remainder undefined when dividing by "
            "zero." << std::endl;
    } else {
        int remainder = a % b;
        std::cout << remainder << std::endl;
    }
}

Multiple layers of restart points

Imagine a modern HTTP server running as a service

If the server process fails, then the service manager restarts it
If an accepted network connection fails, the accept loop will continue to accept new connections
If HTTP request handler fails, an error will be send back, and the HTTP protocol parser will continue to process requests on the connection.

The biggest sin in error handling is corrupting state or result

If you can't isolate the error, then don't continue

Memory and data corruption side-effects are be very hard to debug when visible failure occur long after the initial error.

Restart ≠ Unsolicited retry

For low-level code code the most useful behavior is to fail fast on error.

Only do retries at the outermost context. In interactive applications, the user should be kept in the loop.

Implementing retry logic at multiple layers causes unwanted delay amplification and catastrophic cascades.

Don't retry operations that encounter permanent errors

Duh.

“Insanity Is Doing the Same Thing Over and Over Again and Expecting Different Results”

— Anonymous Al-Anon attendee
(probably)

Still not an Einstein quote, despite what the internet says.

Context Capture

As you abort an activity and pass the error condition up to outer contexts, don't throw away the the information only known by the inner contexts.

Examples of things to capture

The URL of the failed HTTP request
The path of the file that could not be found
The arguments and the mathematical operation that failed due to division by zero
The line in the configuration file whose value did not work

Context capture layers

Each layer of context may potentially contribute to the description of the error condition, contributing higher level information in the outer layers.

What API call failed and why it failed.
What calling the API was trying to achieve.
What high level operation or task that was aborted as a result.
How, with what input, and why the high level operation was started in the first place.

The second biggest sin is discarding valuable information

They lossy propagation problem

Context information is frequently lost when propagating the error condition from the lower levels of a system to the higher levels of the system.

Throwing away information
Not augmenting the error information with context that only intermediate layers know

Capturing integer parsing errors


int parse_int(parser_state& state) {
    skip_whitespace(state);
    int sign = parse_optional_sign(state);
    auto digit = parse_digit(state);
    if (!digit) {
        throw parse_error("Expected integer digit.", state);
    }

    int value = 0;
    do {
        value = value * 10 + sign * digit.value();
        if (value != 0 && ((value < 0) != (sign < 0))) {
            std::ostringstream oss;
            oss << "Only integers between " << std::numeric_limits<int>::min() <<
                " and " << std::numeric_limits<int>::max() << " are supported.";
            throw parse_error(oss.str().c_str(), state);
        }

        digit = parse_digit(state);
        skip_whitespace(state);
    } while (digit);

    if (state.pos != state.line.size()) {
        throw parse_error("Unexpected character.", state);
    }

    return value;
}

Captured context data


struct parser_state {
    std::string line;
    size_t pos;
};

struct parse_error : public std::runtime_error {
    parser_state state;
    parse_error(const char* what,
        const parser_state& error_state)
        : std::runtime_error(what), state(error_state)
    {
    }
};

I Have No Mouth, and I Must Scream

What do we do when we're forced to implement an interface that does not allow us to report back all the error information that we have?

Side channels!

Side channels

Examples of standard side channels:

errno	Posix API
SetLastError()	Windows Win32 API
Log files	Common practice
Process dump files	Common OS mechanism

Thread Local Storage is often used to create ad-hoc side-channels.

Corralation IDs

A unique identifier given to an ongoing activity that is provided both in main reporting channels and side channels to allow corralation of information passed through each channel.

Pass corralation IDs across network boundaries.

Error codes

Identifiers you can paste into Google to find other people complaining about the same problem.

UX and DX

UX	The User Experience of users that just want to get on with his work with as little fuzz as possible experiences when an error occurs.
DX	The Developer Experience of developers that are trying to diagnose and possibly eliminate errors that have been reported.

Don't confuse the two. Don't communicate with the user through log files.

UX: What users would like to know

Is it worth trying again later?
What is the current state of my work and the operation I tried to do?
What aspect of what I tried to do triggered the problem?
Is there anything I can do to make the problem go away?
If I need a developer to look at the problem, what do I tell them to look for?

Cop outs

Try again later.
Contact your system administrator.

Translation: The UI layer has no idea what happened.

UX is not putting lipstick on a pig

Good UX requires often improving context capturing.

Actionable conditions

If something is missing, guide the user to provide the information
If some input or setting is wrong, guide the user to fix it
If something needs to be installed or upgraded, help the user do so
If the user lacks permission, help the user request permission from the right person

Explaining problems to the user


void print_error(const std::string& message,
    const parser_state& state)
{
    std::cerr << message << std::endl;
    std::cerr << "    " << state.line << std::endl;
    std::cerr << std::string(4 + state.pos, ' ') << "^" <<
        std::endl;
}

User testing

Expected API errors are not application errors

Uninitialized, incomplete or unfinished are common states which users don't consider an error.

“A clever person solves a problem. A wise person avoids it.”

— Anonymous

“I never said half the crap people said I did.”

— Albert Einstein

Graceful degrade

Make the best of what you've got.

Don't close down the restaurant when you run out of parsley

Questions?

Bonus slides...

Parsing helpers


void skip_whitespace(parser_state& state) {
    while (state.pos < state.line.size() && state.line[state.pos] == ' ') {
        ++state.pos;
    }
}

int parse_optional_sign(parser_state& state) {
    if (state.pos < state.line.size() && state.line[state.pos] == '-') {
        ++state.pos;
        return -1;
    }

    return 1;
}

std::optional<int> parse_digit(parser_state& state) {
    if (state.pos < state.line.size()) {
        char c = state.line[state.pos];
        if (c >= '0' && c <= '9') {
            ++state.pos;
            return c - '0';
        }
    }

    return std::nullopt;
}

The remaining code


#include <iostream>
#include <ostream>
#include <limits>
#include <string>
#include <optional>
#include <sstream>


int request_int() {
    parser_state state {};
    std::getline(std::cin, state.line);
    return parse_int(state);
}

API contracts

Context Capture behavior effectively becomes part of the API as soon as outer layers start depending on it.