Note to readers: For a while now, I’ve been looking for guidance on designing useful messages and message-based systems, but without much luck. To help others and also because I learn by writing, I’m going to use my blog to document some of the messaging lessons I’ve learned over the past couple of years. I hope this blog entry and future ones like it don’t seem overly-pedantic; my only goal is to help clarify my own thoughts and perhaps help others looking for similar information on a topic with which I’ve personally struggled.
In this blog entry, I talk about the fundamentals of caching resource representations in HTTP-based distributed systems using the language of basic concepts while avoiding HTTP terminology which might sidetrack novice readers. This entry does assume some knowledge of HTTP (e.g. requests, responses, URIs), so if you find these concepts sidetracking you, I’d suggest you read the first couple of chapters of a book like HTTP: The Definitive Guide to familiarize yourself.
If you’re already familiar with HTTP caching (e.g. most likely anyone reading this via Planet Intertwingly), you may wish to skip this entry altogether, unless you’re curious about my take on the topic or are interested in looking for mistakes or misrepresentations. If you do find a problem, please add a comment and I’ll attempt to correct and/or clarify.
Intro
One of the benefits of developing distributed applications using the REST architectural style with the HTTP protocol is their first-class support for caching documents (or ‘entities-bodies’ in HTTP terminology). If you’re simply serving files using a world-class web server like Apache HTTP Server, you get some degree of caching for free. But in dynamic web applications, you’re often generating dynamic documents (e.g. an XML document containing data from a row in a relational database) rather than simply serving files, where the resource and the representation are equivalent.
Unless you’re using an application framework that automatically generates caching information for HTTP responses based on the framework’s meta-data model, you’ll likely have to roll your own caching logic. This presents both a challenge and an opportunity. The challenge is that you must learn about the various HTTP caching options so that you can intelligently apply them to your particular data model; the opportunity is that you can often take advantage of your data model’s semantics to perform smarter caching logic than out-of-the-box file system caching.
In this entry I describe the basic rationale for caching and then discuss the basic caching options possible with the HTTP protocol. Note that I describe these caching options at a very high level, without getting into many implementation details, and at this level the ‘HTTP caching options’ are more like general caching patterns, but nevertheless I describe them in the context and using the language of HTTP, since it’s both a ubiquitously deployed protocol and also the protocol with which I’m most familiar.
Why Cache?
Caching may be one of the most boring topics in software, but if you’re working with distributed systems (like the web), smart cache design is absolutely vital to both system scalability and responsiveness, among other things. In brief, a cache is simply a local copy of data that resides elsewhere. A computing component (whether hardware or software) uses a data cache to avoid performing an expensive operation like fetching data over a network or executing a computationally-expensive algorithm. The trade-off is that your copy of the data may become out of sync with the original data source, or stale, in caching terminology. Whether or not staleness matters depends on the nature of the data and the needs of your application.
For example if your web site displays the average daily temperature for Philadelphia over the past hundred years, you probably display a simple stored data element (e.g. “59 degrees F”) rather than performing this very expensive computation in realtime. Because it would take a long period of unusual weather to noticably affect the result, it doesn’t really matter if your cached copy doesn’t consider very recent temperatures. At the other extreme, an automated teller machine (ATM) definitely should not use a cached copy of your checking account balance when determining whether you have enough money to make a withdrawl since this might allow a malicious customer to make simultaneous withdrawls of his entire balance from multiple ATMs.
Generally speaking, the cacheability of a particular piece of data varies along two axes:
- the volatility of the data
- the potential negative impact of using stale data
HTTP Caching Options
Caching is a first-class concern of the REST architectural style and the HTTP protocol. Indeed, one of the main goals of HTTP/1.1 was to enhance the basic caching capabilities provided by HTTP/1.0 (see chapter 7 of Krishnamurthy and Rexford’s Web Protocols and Practice for an excellent discussion on the design goals of HTTP 1.1). At the risk of oversimplifying, for a given RESTful HTTP URI, you have three basic caching options:
- don’t use caching
- use validation-based caching
- use expiration-based caching
These options demonstrate the trade-offs between the need to avoid stale data and the performance benefits of using cached data. The no caching option means that a client will always fetch the most recent data available from an origin server. This is useful in cases where the data is extremely volatile and using stale data may have dire consequences. For example, anytime you view a list of current auctions on eBay (e.g. for 19th Century Unused US Stamps), you’ll notice many anti-caching directives in the HTTP response included to ensure that you always see the most recent state of the various auctions. The downside of no caching is that every request is guaranteed to incur some cost in terms of client-perceived latency, server resources (e.g. CPU, memory), and network bandwidth.
Validation-based caching allows an HTTP response to include a logical ‘state identifier’ (such as an HTTP ETag or Last-Modified timestamp) which a client can then resend on subsequent requests for the same URI, potentially resulting in a short ‘not modified’ message from the server. Validation-based caching provides a useful trade-off between the need for fresh data and the goal to reduce consumption of network bandwidth and, to a lesser extent, server resources and client-perceived latency.
For example, imagine a web page that changes frequently but not on a regular schedule. This web page could use validation-based caching so that each time a client attempts to view the page, the request goes all the way back to the origin server but may result in either a full response (if the client either has an old version of the page or no cached version of the page) or a terse ‘not modified’ response (if the client has the most recent version of the page). All other things being equal, in the ‘not modified’ case the response will be smaller (since the server sent no document), the server will do less work (since it doesn’t have to stream the page bytes from disk or memory), and the client may observe a faster load time since the message is smaller and the user agent (e.g. the browser) may even have a cached rendering of the page. These are certainly superior non-functional characteristics to the ‘no caching’ case and we don’t have to worry about seeing stale data (assuming the client does the right thing). However, the server still did some work to determine that the client had the most recent resource, the client still experienced some latency waiting for the ‘not modified’ message, and we still used some network bandwidth to send the request and received the (albeit short) response.
Expiration-based caching allows an origin server associate an expiration timestamp on a particular document so that clients can simply assume that their cached copy is safe to use if it has not passed its expiration date. In other words, an origin server asserts that the document is ‘good’ or ‘good enough’ for a certain period of tme. This sort of caching has fantastic performance characteristics but requires the designer to ensure either that:
- the data won’t become stale before the expiration period ends, or
- the impact of a client using stale data is negligible
An example of a resource that is well-suited for expiration-based caching is an image of a book cover on Amazon.com (e.g. the image of the cover of Steve Krug’s Don’t Make Me Think). While it’s possible that the book cover could change, it’s extremely unlikely and since image files are relatively large, it would be wise for Amazon to set an expiration date so that clients load the image from their cache without even asking Amazon whether or not they have the most recent version. If somehow the cover of the book does change between when you cache your copy and when your cache copy expires, it’s not a big deal unless you base your purchasing decisions on book cover aesthetics.
Another performance benefit of expiration-based caching is that even in the case where a client doesn’t have a valid cached copy of a document, it’s possible that a network intermediary (e.g. a proxy server) does. In this case a client requests a particular URI and before the request reaches the origin server, an intermediary determines that it has a still-valid cached copy of the document and returns its copy immediately rather than forwarding the request to the next intermediary or the origin server. It should be clear from these examples that expiration-based caching results in significantly less user-perceived latency and consumes significantly less network bandwith and server resources. The trick is that you have to guarantee either no staleness or feel confident that the risks involved in a client processing stale data are justified by the performance benefits. Note that its generally not possible to take advantage of intermediary caching over an HTTPS connection.
Summary
In this entry I’ve explained the basic rationale for why we cache things in distributed systems and given an overview of the three basic caching options in REST/HTTP-based systems. This information represents a bare-bone set of fundamental caching concepts, but you must understand these concepts thoroughly before being able to make informed caching design choices vis-Ã -vis your data model.
In future entries, I’ll build upon these foundational concepts to discuss caching design strategies for various scenarios.