Last year on the Jazz project, I helped design and implement a simple REST protocol to implement long-running operations, or long-ops. I’ve explained the idea enough times in random conversations that I thought it would make sense to write it down.

I’ll first write about the concrete problem we solved and then talk about the more abstract class of problems that the solution supports.

Example: Jazz Lifecycle Project Creation

Rational sells three particular team products that deal with requirements management, development, and test management, respectively. These products must work individually but also together if more than one is present in a customer environment. Each product has a notion of “project”. In the case where a customer has more than one product installed in their environment, we wanted to be able to let a customer press a button and create a “lifecycle project” that is basically a lightweight aggregation of the concrete projects (e.g. the requirements project, the development project, and the test project).

So we created a rather simple web application called “Lifecycle Project Administration” that logically and physically sits outside the products and gives a customer the ability to press a button and create a lifecycle project, create the underlying projects, and link everything together.

This presented a couple of problems, but I want to focus on the UI problem that pushed us towards the RESTy long-op protocol. Creating a project area can take between 30 seconds to a minute, depending on the complexity of the initialization routine. Since the lifecycle project creation operation aggregated several project creation operations plus some other stuff, it could take several minutes. A crude way to implement this UI would be to just show a “Creating lifecycle project area, please wait” message and perhaps a fakey progress monitor for several minutes until all of the tasks complete. In a desktop UI operating on local resources, you would use a rather fine-grained progress monitor that provides feedback on the set of tasks that need to run, the current running tasks, and the current percent complete of the total task.

We brainstormed on a way that we could come up with something like a progress monitor that could show fine-grained progress while running the set of remote operations required to create a lifecycle project and its subtasks. The solution was the RESTy long-op protocol. First I’ll talk about how one would typically do “normal, simple RESTful creation”.

Simple RESTy Creation

A common creation pattern in RESTful web services is to POST to a collection. It goes something like this:

Request

POST /people HTTP/1.1
Host: example.com

{
    "name": "Bill Higgins",
    "userId": "billh"
}

Response

HTTP/1.1 201 Created
Location: http://example.com/people/billh

The 201 status code of course indicates that the operation resulted in the creation of a resource and the Location header provides the URI for the new resource.

From a UI point of view, this works fine for a creation operation that takes a few seconds, but not so well for a creation operation that takes several minutes, like the lifecycle project administration case. So let’s look at the RESTy long-op protocol.

The RESTy Long-op Protocol

In this example, I’ll use a simplified form of lifecycle project creation:

Creation Request

POST /lifecycle-projects HTTP/1.1
Host: example.com

{
    "name": "Bill's Lifecycle Project",
    "template": "com.ibm.team.alm.req-dev-test"
}

Just to explain the request body, the name is simply the display name and the template is the ID of a template that defines the set of concrete projects that should be created and how they should be linked together.

Here’s what the response looks like:

Response

HTTP/1.1 202 Accepted
Location: http://example.com/jobs/5933

Rather than responding with a URL for a resource that was created, the server responds with a 202 'Accepted' status, and the location of a “Job” resource, that basically reports on the status of the long-running task of creating (or updating) the resource.

Now the client polls the location of the “job”; the job is a hierarchal resource representing the state and resolution of the top level job and the sub-jobs (called ‘steps” below). It also includes a top-level property called resource that will eventually point to the URI of the resource that you are trying to create or update (in this case the lifecycle project).

Job Polling Request

GET /jobs/5933 HTTP/1.1
Host: example.com

Job Polling Response

HTTP/1.1 200 OK

{
    "title": "Creating lifecycle project 'Bill's Lifecycle Project'",
    "state": "IN_PROGRESS",
    "resolution": null,
    "resource": null,
    "steps": [
        {
            "title": "Creating requirements project",
            "state": "COMPLETE",
            "resolution": "SUCCESS"
        },
        {
            "title": "Creating development project",
            "state": "IN_PROGRESS",
            "resolution": null
        },
        {
            "title": "Creating project linkages",
            "state": "NOT_STARTED",
            "resolution": null
        },
        {
            "title": "Creating lifecycle project",
            "state": "NOT_STARTED",
            "resolution": null
        }
    ]
}

At some point the top-level task has a non-null resolution and a non-null resource, at which point the client can GET the resource, which is the complete URI for the original thing you tried to create/update (in this case the lifecycle project).

GET /lifecycle-projects/bills-lifecycle-project HTTP/1.1
Host: example.com

(I’ll omit the structure of the lifecycle project, as it’s not relevant to this discussion.)

Demo

Here’s a demo I recorded of an early version of Lifecycle Project Administration last year, that shows this protocol in action:

Uses

This protocol supports a set of related patterns:

  • Long-running operations
  • Asynchronous operations
  • Composite tasks

You can use this protocol to support one or a combination of these patterns. E.g. you could have a single task (i.e. not a composite) that takes a long time and therefore you still want to use an asynchronous user experience.

Critique

Here are a few good things about this protocol:

  • Facilitates better feedback to people who invoke long-running, perhaps composite operations, through your UI.
  • Decouples the monitoring of a long-running composite operation from its execution and implementation; for all you know the composite task could be running in parallel across a server farm or it could be running on a single node.
  • Supports a flexible user experience; you could implement a number of different progress monitor UIs based on the information above.

Here are a few not-so-nice things about this protocol:

  • Not based on a standard.
  • Requires some expectation that the original create/update request might result in a long-running operation, and the only way you have to know that it’s a job resource (vs. the actual created or updated resource) is by the 202 Accepted response code (which could be ambiguous) and/or by content sniffing.
  • Doesn’t help much with recovering from complete or partial failure, retrying, cancelation, etc. though I’m sure you can see ways of achieving these things with a few additions to the protocol. We just didn’t need/want the additional complexity.

Implementation Notes

I would like to write a bit about some of the implementation patterns, but I think this entry is long enough, so I’ll just jot down some important points quickly.

  • Your primary client for polling the jobs should be a simple headless client library type thing that allows higher level code to register to be notified of updates. In most cases you’ll have more than one observer (e.g. the progress widget itself that redraws with any step update and the page that updates when the ultimate resource becomes available).
  • Your backend should persist the job entries as it creates and updates them. This allows you to decouple where the tasks in the composite execute from where the front-end can fetch the current status. This also allows you to run analytics over your job data over time to understand better what’s happening.
  • The persistent form of the job should store additional data (e.g. the durations for each task to complete) for additional analytics and perhaps better feedback to the user (e.g. time estimate for the overall job and steps based on historical data).
  • Of course you’ll want to cache all over the place on the job resources since you poll them and in most cases the status won’t have changed.

In Closing

I don’t think this protocol is perfect, and I’m sure I’m not the first one to come up with such a protocol, but we’ve found it useful and you might too. I’d be interested if anyone has suggestions for improvement and/or pointers to similar protocols. I remember I first learned about some of these basic patterns from a JavaRanch article my IBM colleague Kyle Brown wrote way back in 2004. 🙂

Updates

Pretty much as soon as I published this, several folks on Twitter cited similar protocols:

Thanks very much William, Sam, and Dims.