RESTy long-ops // Bill Higgins' Blog

Last year on the Jazz project, I helped design and implement a simple REST protocol to implement long-running operations, or long-ops. I’ve explained the idea enough times in random conversations that I thought it would make sense to write it down.

I’ll first write about the concrete problem we solved and then talk about the more abstract class of problems that the solution supports.

Example: Jazz Lifecycle Project Creation

Rational sells three particular team products that deal with requirements management, development, and test management, respectively. These products must work individually but also together if more than one is present in a customer environment. Each product has a notion of “project”. In the case where a customer has more than one product installed in their environment, we wanted to be able to let a customer press a button and create a “lifecycle project” that is basically a lightweight aggregation of the concrete projects (e.g. the requirements project, the development project, and the test project).

So we created a rather simple web application called “Lifecycle Project Administration” that logically and physically sits outside the products and gives a customer the ability to press a button and create a lifecycle project, create the underlying projects, and link everything together.

This presented a couple of problems, but I want to focus on the UI problem that pushed us towards the RESTy long-op protocol. Creating a project area can take between 30 seconds to a minute, depending on the complexity of the initialization routine. Since the lifecycle project creation operation aggregated several project creation operations plus some other stuff, it could take several minutes. A crude way to implement this UI would be to just show a “Creating lifecycle project area, please wait” message and perhaps a fakey progress monitor for several minutes until all of the tasks complete. In a desktop UI operating on local resources, you would use a rather fine-grained progress monitor that provides feedback on the set of tasks that need to run, the current running tasks, and the current percent complete of the total task.

We brainstormed on a way that we could come up with something like a progress monitor that could show fine-grained progress while running the set of remote operations required to create a lifecycle project and its subtasks. The solution was the RESTy long-op protocol. First I’ll talk about how one would typically do “normal, simple RESTful creation”.

Simple RESTy Creation

A common creation pattern in RESTful web services is to POST to a collection. It goes something like this:

Request

POST /people HTTP/1.1
Host: example.com

{
    "name": "Bill Higgins",
    "userId": "billh"
}

Response

HTTP/1.1 201 Created
Location: http://example.com/people/billh

The 201 status code of course indicates that the operation resulted in the creation of a resource and the Location header provides the URI for the new resource.

From a UI point of view, this works fine for a creation operation that takes a few seconds, but not so well for a creation operation that takes several minutes, like the lifecycle project administration case. So let’s look at the RESTy long-op protocol.

The RESTy Long-op Protocol

In this example, I’ll use a simplified form of lifecycle project creation:

Creation Request

POST /lifecycle-projects HTTP/1.1
Host: example.com

{
    "name": "Bill's Lifecycle Project",
    "template": "com.ibm.team.alm.req-dev-test"
}

Just to explain the request body, the name is simply the display name and the template is the ID of a template that defines the set of concrete projects that should be created and how they should be linked together.

Here’s what the response looks like:

Response

HTTP/1.1 202 Accepted
Location: http://example.com/jobs/5933

Rather than responding with a URL for a resource that was created, the server responds with a 202 'Accepted' status, and the location of a “Job” resource, that basically reports on the status of the long-running task of creating (or updating) the resource.

Now the client polls the location of the “job”; the job is a hierarchal resource representing the state and resolution of the top level job and the sub-jobs (called ‘steps” below). It also includes a top-level property called resource that will eventually point to the URI of the resource that you are trying to create or update (in this case the lifecycle project).

Job Polling Request

GET /jobs/5933 HTTP/1.1
Host: example.com

Job Polling Response

HTTP/1.1 200 OK

{
    "title": "Creating lifecycle project 'Bill's Lifecycle Project'",
    "state": "IN_PROGRESS",
    "resolution": null,
    "resource": null,
    "steps": [
        {
            "title": "Creating requirements project",
            "state": "COMPLETE",
            "resolution": "SUCCESS"
        },
        {
            "title": "Creating development project",
            "state": "IN_PROGRESS",
            "resolution": null
        },
        {
            "title": "Creating project linkages",
            "state": "NOT_STARTED",
            "resolution": null
        },
        {
            "title": "Creating lifecycle project",
            "state": "NOT_STARTED",
            "resolution": null
        }
    ]
}

At some point the top-level task has a non-null resolution and a non-null resource, at which point the client can GET the resource, which is the complete URI for the original thing you tried to create/update (in this case the lifecycle project).

GET /lifecycle-projects/bills-lifecycle-project HTTP/1.1
Host: example.com

(I’ll omit the structure of the lifecycle project, as it’s not relevant to this discussion.)

Demo

Here’s a demo I recorded of an early version of Lifecycle Project Administration last year, that shows this protocol in action:

Uses

This protocol supports a set of related patterns:

Long-running operations
Asynchronous operations
Composite tasks

You can use this protocol to support one or a combination of these patterns. E.g. you could have a single task (i.e. not a composite) that takes a long time and therefore you still want to use an asynchronous user experience.

Critique

Here are a few good things about this protocol:

Facilitates better feedback to people who invoke long-running, perhaps composite operations, through your UI.
Decouples the monitoring of a long-running composite operation from its execution and implementation; for all you know the composite task could be running in parallel across a server farm or it could be running on a single node.
Supports a flexible user experience; you could implement a number of different progress monitor UIs based on the information above.

Here are a few not-so-nice things about this protocol:

Not based on a standard.
Requires some expectation that the original create/update request might result in a long-running operation, and the only way you have to know that it’s a job resource (vs. the actual created or updated resource) is by the 202 Accepted response code (which could be ambiguous) and/or by content sniffing.
Doesn’t help much with recovering from complete or partial failure, retrying, cancelation, etc. though I’m sure you can see ways of achieving these things with a few additions to the protocol. We just didn’t need/want the additional complexity.

Implementation Notes

I would like to write a bit about some of the implementation patterns, but I think this entry is long enough, so I’ll just jot down some important points quickly.

Your primary client for polling the jobs should be a simple headless client library type thing that allows higher level code to register to be notified of updates. In most cases you’ll have more than one observer (e.g. the progress widget itself that redraws with any step update and the page that updates when the ultimate resource becomes available).
Your backend should persist the job entries as it creates and updates them. This allows you to decouple where the tasks in the composite execute from where the front-end can fetch the current status. This also allows you to run analytics over your job data over time to understand better what’s happening.
The persistent form of the job should store additional data (e.g. the durations for each task to complete) for additional analytics and perhaps better feedback to the user (e.g. time estimate for the overall job and steps based on historical data).
Of course you’ll want to cache all over the place on the job resources since you poll them and in most cases the status won’t have changed.

In Closing

I don’t think this protocol is perfect, and I’m sure I’m not the first one to come up with such a protocol, but we’ve found it useful and you might too. I’d be interested if anyone has suggestions for improvement and/or pointers to similar protocols. I remember I first learned about some of these basic patterns from a JavaRanch article my IBM colleague Kyle Brown wrote way back in 2004. 🙂

Updates

Pretty much as soon as I published this, several folks on Twitter cited similar protocols:

William Vambenepe referenced this Tim Bray blog entry from 2009 on “Slow REST”
Sam Johnston referenced the O’Reilly RESTful Web Services Cookbook book’s chapter on “How to Use POST for Asynchronous Tasks” Here’s a peak.
Dims Srinivas referenced Leonard Richardson and Sam Ruby’s RESTful Web Services book’s chapter on REST and ROA Best Practices. Here’s a peak.

Thanks very much William, Sam, and Dims.

Donald Smith says:

27 April, 2011 at 11:36 pm

This is a nice approach. As to your critique, you mention that the original create request requires some expectation that might result in a long running operation by the client. Another approach could be instead of returning a job resource, return the created resource with the status data embedded into the representation. A client that understands the media type would therefore understand that the resource has multiple states that are possible and could update the UI the same way.

This is essentially the same thing, except that you are not returning a different resource than the one that was requested. Plus, the media type definition provides the knowledge to the client of the varying states of your resource.

Bill Higgins says:

27 April, 2011 at 11:39 pm

Hi Donald, thanks much for taking the time to write the suggestion.

Your suggestion definitely addresses the “what the heck do I get back?” problem, but it makes me a bit uncomfortable that it couples information about the sausage-making of how the resource gets created with the resource itself.

Still, worth thinking more about, so thanks again.

27 April, 2011 at 11:52 pm

No, problem. I think the general theme is that there is no standard way of doing this and that is actually a good thing. The job approach is actually quite nice and one could argue that by placing links in the job resource to point to the newly created resource would be sufficient to satisfy the REST constraints.

I appreciate solid REST posts that talk about the finer details of implementation. Since I begun building an API this year I find myself trying to see everything from the client perspective. It is like asking a roofer how to lay shingles so they won’t leak. Think like water.

Sam Johnston says:

We came across this problem during the development of the Open Cloud Computing Interface (OCCI) and felt that it was useful to have task resources so as tasks in progress could be manipulated (esp DELETEd) and so that there was a history available.

Simon Johnston says:

28 April, 2011 at 11:31 am

We had actually built this into the Jazz Foundation in a few places also and. It may not be a “standard” but it certainly seems to be best practice at a protocol level. If there were any advantage in standardization I would see value in defining some key properties of the job resource, some common meaning for your “state”, “resolution” and “resource” for example.

28 April, 2011 at 12:49 pm

Thanks Sam and Simon.

Simon: FYI, One of the motivations for this blog entry was to get it in front the emerging OSLC Automation workgroup.

Simon Kaegi says:

2 May, 2011 at 4:45 pm

Very nice job and clear.
The one thing I wasn’t altogether comfortable with is the use of the status JSON content type. It looks to me that as a client in order to act on the task you would have to understand the JSON. Instead I wonder if you could support a basic level of processing just using HTTP Status codes.

Something like…
1) Use of a 202 (Accepted) and a Location header to redirect to the task location.
2) Continue to use a 202 (Accepted) at the task location.

— once the task completes —
A) If the task succeeded and there is no result of the task other than to suggest the task is done use a 200 (Ok)
B) If the task succeeded and there is a result use a 303 (See Other) and a Location header to redirect the client to the result.
C) If the task failed use an appropriate 4xx or 5xx response to describe why the task failed.

—
At each stage above I don’t see anything wrong with “also” sending the JSON status. In that way if you do have a smarter client than can provide more context to a user or server process it has some info to show. With that said, I think it’s at least interesting to see how far we can get when creating a protocol like this without having to add interpretation of a new content type.

Hendy Irawan says:

14 July, 2011 at 11:05 am

For those who prefer XMPP-y long ops, with true multiple asynchronous updates (no need for polling) with any payload, check out:

http://xws4j.sourceforge.net/

It’s based on XEP-0244 IO Data standard.

As for client support in web browsers, with WebSockets anything is possible 🙂

Heikki Toivonen says:

30 September, 2011 at 12:01 pm

I did almost the same in a previous company. The create response was the standard 201 Created with Location header. The client then polled on the location where 202 status meant the resource was still being built and finally 200 status when it was done.

RESTy long-ops