This is a reference guide for caching in distributed systems. Well-designed caching at scale enables faster response times for API endpoints and improves the end user and customer experience.
What are Cache Layers?
Caches are a way to improve performance and reduce latency by trading space for time. The simplest example to start with is when you have to recalculate a value in a recursive function such as the Fibonacci sequence. Instead of calculating the previous results over again, they are stored in a hash table or dictionary. Caching is sometimes called memoization. For a website where the content does not change frequently, the server can cache the content in memory and serve it again while the client, the web browser, can also cache the content before making a request back to the server.
Layered Mental Model
In the cache layers model, we are adding options when caching. There are three layers: in-memory/in-process cache, sidecar/local node cache, distributed cache.
The in-memory cache is the first layer and for many non-distributed programs it is the only layer. More intensive programs will have two layers: in-memory and sidecar/local cache. These programs are not distributed and will run a persistent cache across multiple runs of the program, it can be sqlite, redis, or a file with a custom format. Distributed applications add a third layer, the distributed cache.
Latency Budgets
When a user is accessing a part of the system, they have an expectation of a fast response. When they click a button, they expect something to happen within milliseconds or seconds, depending on the application and the action. Latency budgets can be set for core actions in the system to ensure that they are meeting expectations.
Let’s understand through an example. A user is filtering and sorting and searching through a table of data. There are 1 million items in the table, only 100 items are displayed on the page. The latency budget for sorting is < 0.5 seconds, since it is the same data, unchanged and only the sort order changes based on a selected column. The latency budget for filtering is slightly higher, < 1 second is the expected time and can scale with the amount of items, based on the user’s perception, and the latency budget for searching could be < 5 seconds. Let’s consider another dimension, newness of the item. If users are frequently searching for new items, the latency budget for those new items could be as low as < 0.5 seconds. For older items, users expect < 10 seconds, a lot more room. The expectation and perception depend on multiple factors including a user’s past usage of similar systems, their usage of the current system, and the user interface of the system.
When setting a latency budget, you can start with an observed baseline and add a percentage for the budget, and subtract a percentage to set as a goal.
Caching With Layers for a Web Service
In a real-world production web service, caching can be used in many ways for a variety of reasons. More complex web applications not only have a backend API server, they also have background workers. For example, there can be very expensive database queries that are used frequently, or there can be logic and calculations that do not change often.
L1 In-Process Cache
An incoming web API request can require multiple expensive database queries in order to return a response. The memory within a web server process can be used as a cache. For example, in Python, there is a @cached decorator function that will create an in-memory mapping between the parameters of the function it wraps and the resulting value. This can be used at different points along the code path. If another API endpoint also uses those functions, then the cache will be used. We can expand on layer one caching by caching more parts of the request, or by caching the whole request/response. When we request /items/123 repeatedly, the request and response can be cached in-memory.
The limitation of only layer 1 in-process/in-memory caching is that this caching only works for one process. If the web server runs 4 processes, the typical setup does not use sticky routing, which means a user request may go to any of the 4 processes and suffer from a cache miss. As more requests come in, the likelihood that all 4 process caches have the value is increased. One way to workaround this limitation is to pre-warm the cache, though this is very dependent on user request patterns and may require a lot of memory.
The other limitation is that the in-process memory cache is it increases the memory required for each process. If the web server initially requires 1GB, it could grow to double or more in order to have enough in-memory cache capacity. A cache key TTL and expiration time is required when using this type of caching. The in-memory cache competes with the rest of the program for memory.
L2 Sidecar / Local Node Cache
The second layer of caching requires running Redis (or Valkey another cache) as a sidecar or as shared service.
Continuing with the example, the web server runs 4 processes and we run only 1 web server. However we are scaling up to 3 web servers for a total of 12 processes. With a side-car, each of the web servers will have its own cache shared by 4 processes. Another option is to run the cache as a daemon service on each compute node. When vertically scaling, the additional web servers will make use of the cache already existing on the compute node. When horizontal scaling, new nodes will have a fresh cache so initial requests on the new node will be slower.
L3 Redis, Memcached
Using a separate cache server such as Redis or Memcached is the third layer of caching. The advantage is that the cache can persist between deployments of services and compute nodes, it is a de-coupling of the cache from the rest of the services.
In the example, we have multiple nodes running multiple web servers with many processes. By having this third layer, the first request to a web server can cache the request and response, and it can be re-used no matter if we scale to 1000 web server processes or down to 1 server.
The third layer of caching can be applied to storing serialized objects from database queries, storing HTTP responses, and more. Pre-warming the cache on deployments becomes more valuable especially if there are common access patterns from particular customers based on past analytics data.