How to design a scalable API rate limiter

In the previous article, we saw the different algorithms which can be used for the rate limiting purpose. Here we'll go through the real juicy part, the architecture of the system which can be considered for providing a reliable service. There is always room for improvements :)

Basic High level architecture

The common idea of all the algorithms is very simple that we need to keep track of how many requests are coming in from same IP address, same user and need to compare it with a counter. If the counter exceeds the limit, then the request is disallowed.

This counter will be accessed at a very high rate and so storing it in disk based database is not a good idea. So in-memory cache like Redis or tarantool can be used.

Working

The client makes a request to the service.
It lands the rate limiting middleware which we saw in the previous post.
Middleware will be able to communicate with the cache.
It checks if the limit of the particular request is reached and if it is, then the request is rejected or else the cache counter is incremented as depending on rule and forwards/sends the request to the app servers.

How it works?

Ok, so this will work for all the basic cases. But we will have some doubts now on few things like what will happen to those requests which are throttled and also where does the rate limiting rules such as this particular type of request can be allowed 10 time in a minute but another type of request can be allowed only 5 times in a minute is stored.

What happens when rate limit is reached.

There shall be two ways of handling it depending on the use case.

Sending 429 (too many requests) response to the client
Queue the requests to be processed later

How a client can be informed that it's rate limited?

Simple answer is HTTP response headers. We can send some headers as follows

X-Ratelimit-Remaining -> the remaining number of allowed requests in a window
X-Ratelimit-Limit -> The limit in a window
X-Ratelimit-Retry-After -> Letting the client know to send request after this time to succeed.

How does the design work?

The client makes the request to the service and reaches the middleware first.
Middleware will be having the cache of the rules to be applied on the different types of requests to be received.
The middleware checks with the cache counter, checks with the rules to be applied on that request and decide if it should be forwarded or rate limited.

So now we have built a rate limiter that works with all single server environment. The real juice is when we build it in distributed environment.

Scaling in distributed environment

There will be two main challenges that we might come across such as

Race condition

In highly concurrent system, there is a possibility of having a race condition while checking the counter value from the cache. The solution could be to use locks and this might slow down the system too. Sorted sets in Redis with MULTI command can help in achieving Atomicity and thus helping with Race condition too. There can be other ways of achieving through different locks too.

Synchronization of the middleware if there are many

Having only one middleware will be enough for a particular amount of load but it wont be enough in the longer run with more load coming in. So basic scaling of horizontal scaling of the middleware can be done.. So requests can be load balanced to different middleware instances and stickiness can also be used but it wont be able to solve the problem of having same counter data in both the middlewares. So the cache can be placed outside the middleware and can be common to all the middleware instances. This solves the synchronization problem

So this is somewhat an overview and a little bit of deep dive into designing the system. Improvements and tweakings can be done on the system for optimization and squeezing the best out of the system too.

This is just a HTTP layer of rate limiting, such a system can also be built for L3 (IP layer) and so on.

One of the nicest open source Rate limiter can be Envoy proxy. Check out if needed.

Have a great day :)

Reach out to me on Twitter Linkedin

How to design a scalable API rate limiter - Part II

Dive into the architecture of the system

Basic High level architecture

Working

How it works?

What happens when rate limit is reached.

How a client can be informed that it's rate limited?

How does the design work?

Scaling in distributed environment

How to design a scalable API rate limiter - Part II

Dive into the architecture of the system

Basic High level architecture

Working

How it works?

What happens when rate limit is reached.

How a client can be informed that it's rate limited?

How does the design work?

Scaling in distributed environment

Did you find this article valuable?