Undo Ltd is a company based in Cambridge, UK which develops reversible debugging technology. Stephen worked in the Undo engineering team and this page describes his work from his perspective.

In my time at Undo I had the honour of contributing to innovative technology; UndoDB is a debugger that supports running a program backwards as well as forwards, by returning to a previous time in the program’s execution. Furthermore, this feature is provided with a much lower overhead than technologies such as GDB’s reverse debugging.

As I worked with UndoDB I realised the value that this provides, since debugging is almost always a backwards process; you start at the crash (i.e. the end of the execution) and then follow the code back until you find the bug, which will have occurred some time earlier.

UndoDB is complex technology, essentially wrapping around the entire kernel ABI and this presents many interesting and novel technical challenges. This page summarises some of my key achievements at Undo.

Documentation

When I started at Undo (early 2015) it was still a very young company and so we had relatively limited external and internal documentation. I set about addressing this on my own initiative.

Internal documentation

Having used reStructuredText and Sphinx in the past with much success, I put together a system for writing internal documentation and started extracting information from knowledgeable members of the team and writing it down. This was useful in many ways:

  • The act of writing documentation about UndoDB’s internals helped to clarify for me how it worked.
  • If somebody asked me a question I could often point them to a document.
  • I could use the documentation as a quick reference.

Two primary output formats were supported: HTML and PDF. Cron jobs would re-generate these regularly and upload them to a central machine.

While writing documents I advocated the system to the team and soon the entire team was using the system and contributing more content. I also got agreement for documentation to be considered a key part of development (i.e. alongside coding, testing and code review) and that new product features would first have written specs.

External documentation

You can view Undo’s external documentation here.

Following on the success of the internal documentation I put together a very similar system for external documentation. I collaborated with another developer to write the initial content and the content was audited by the team before being put into releases.

Fixing PDF output

It turns out Sphinx’s LaTeX output (used to generate the PDF) has some issues, most notably that verbatim and \code{} blocks wouldn’t wrap. I fixed this by writing a Python script that would insert the necessary newlines (this really seemed to be the only solution) and it had to lex and parse the LaTeX to prevent inserting a newline in the middle of a LaTeX command.

Extensions

Many of the documents related to a specific Phabricator task, so I decided to write an extension that would turn task IDs into a link to the task. It turns out this is really easy to do and I subsequently created a second extension for links from the internal documentation into the external documentation.

Packaged GDB

Undo can work with many debugger frontends and the most commonly used is GDB. Substantial effort has been put into getting a reliable experience from the integration of Undo and GDB, and during my time at Undo I worked extensively on this (in the process finding many GDB bugs and communicating with upstream GDB developers).

A key problem affecting this integration is that there are many versions of GDB with small subtle differences. During my time at Undo I put together a system for building a particular GDB version that we knew would work well (and had some patches to fix any observed issues) and packaging this as part of our release.

This involved overcoming tedious details about how to get GDB and its dependencies to cross-compile correctly, creating a multi-target GDB (which means e.g. being able to debug a program running on an ARM board on an x86 desktop) and patching GDB to fix any bugs. I created some relevant infrastructure in Python which allowed building the packaged version of GDB in various ways and even building multiple versions of GDB (useful when upgrading packaged GDB to a new version of GDB).

GDB bisect

After fixing a few bugs around integration of Undo and GDB, I realised that the difficult part of most bugs was working out where the problem had initially appeared in GDB and (in some cases) where it was fixed.

To make this simpler, I decided to automate the process by creating a new in-house tool called ‘GDB bisect’. GDB bisect can be run on any integration test (i.e. just ./gdb_bisect.py <test>) and it will run UndoDB in that test against multiple versions of GDB (e.g. GDB 7.1, 7.2 etc.). Once it has identified the GDB version it then further bisects based on GDB commits between versions and ultimately reports the exact GDB commit.

It took me a weekend to create GDB bisect and it saved weeks worth of time. More importantly, GDB bisect allowed a more reliable and precise way of determining the cause of a test failure; you just run it in the background and at some point later it would point you in the right direction. The tool stored cached builds of GDB on a central machine so bisects would normally take under 10 minutes (assuming the test only takes a few seconds to run).

Asynchronous I/O

Linux supports asynchronous I/O, which allows userspace processes to issue multiple simultaneous I/O requests to the kernel which will complete in the background. The pattern of system calls is roughly (see the man pages for a more detailed description!):

  • io_setup() - Create an AIO context, which is basically a group of operations.
  • io_submit() - Submit one or more operations (e.g. a read) against a given AIO context.
  • io_getevents() - Get the status of any completed/failed/cancelled operations.
  • io_destroy() - Destroy the AIO context.

These system calls must be recorded and replayed by UndoDB, and this is actually very complex to perform correctly and efficiently. We had some initial code that supported the basic use cases and my task was to extend this for more advanced use cases.

In taking on this task, as with all my other software engineering work, I was careful to:

  • Discuss implementation approaches with the team and try to find the simplest approach.
  • Re-structure the existing code to improve clarity and prepare it for further changes (this often reveals and fixes many bugs).
  • Add new integration and unit tests to verify the new behaviour (Test Driven Development) and to check regressions later.
  • Document (internally) details of the implementation (useful as an informal specification).
  • Document (externally) how to use new functionality and any caveats.
  • Arrange review of the code for double-checking and to ensure the code complies with relevant standards.

Beyond this I significantly improved surrounding infrastructure and processes, ensuring I delivered a high quality result all round.