How to design a scalable API rate limiter?

Introduction

Hi Everyone, I'm about to write a series on designing software systems. This post consist of a System design on API rate limiter. There is always room for expanding and making better architectural decisions. This is just my take and please be free to let me know how this can be improved too. Let's get started...

What is API rate limiter?

There is always a need to control the rate of traffic sent by the client to the server and rate limiter is the one which takes care of that. For example, if a specific API say "post a picture" is allowed for only 5 times in a minute, the rate limiter should be able to control the specific traffic and should block the exceeding calls.

Why API rate limiter?

Control the load to the servers and thus giving more uptime.
Prevent DOS (Denial of service) attacks.
Some APIs might be priced based on usage and thus controlling the rate of calls up to a limit should help with predictable price for the client and as well as reduced cost for the server as the number of the servers can be set up at a predictable rate.

Let's spear head into the problem.

What are the requirements?

Allow just the API requests under the threshold.
Should be scalable.
Highly available.
High fault tolerance.
Should be very fast as this should not slow down the HTTP response time.

Which layer the Rate limiter should be placed?

We can implement the rate limiter either in client side or server side. But having them in client side will not be appropriate as it can be easily be used maliciously. So having it within the app servers sounds like a good idea but coupling is the most toxic thing that should be avoided as much as could while designing software. So we can try placing the rate limiter as a middle-ware between the client and server.

What's the behavior of this middle-ware?

It receives the request from the client and if the threshold of that API is not yet reached it's forwarded to the app server or it throws 429 (Too many requests) response back to client. This is something like a component which we call API gateway with just the rate limiter enabled.

So what should be my rate limiter algorithm?

There can be different rate limiting algorithms which has its own pros and cons. There are around 5 popular algorithms such as

Token bucket
Leaked bucket
Fixed window counter
Sliding window log
Sliding window counter

We'll go through each of the algorithms and you can decide which one you can use for your use-case.

Token bucket algorithm

It consists of a container which we call a bucket here with a maximum size n. It will be refilled periodically with tokens. For example, like 5 tokens per second (1/r). Even while refilling, the number of tokens in the bucket can't exceed more than n.
So a request comes in, it consumes a token if a token is available and allows the request to backend servers. If it's unavailable, the request can be throttled and dropped.

The algorithm takes 2 parameters

Bucket size
Refill rate

The number of buckets needed really depends on the use case. For example,

If throttled based on IP, then each IP address can have a bucket.
If globally throttled for a system, then 1 bucket is enough.

The algorithm is easy to be implemented and it also allows burst of traffic for short periods depending on the number of tokens left at the moment. According to this , AWS API gateway uses this same algorithm.

Leaking bucket algorithm

It's similar to Token bucket algorithm but works in a little reversed manner.
Consider the bucket a queue with fixed max size here and empty at beginning.
When a request comes in, its pushed into the queue if the queue has space or else the request is discarded.
The requests in the queue are periodically processed and the respective entry from queue is removed thus making space for more requests to happen.

The algorithm takes 2 parameters

Bucket / queue size.
Outflow rate - the rate at which the requests are processed and sent to backend.

It always provides stable processing rate as the outflow rate is defined by us for the system and it doesn't depend on the burst of requests too. The burst of requests might need to be waited in queue depending on the outflow rate which may cause most recent requests to be throttled.

Fixed window counter algorithm

Here the timeline is divided into a fixed sized time window with a counter assigned to it.
Whenever a request comes in, the defined threshold and current counter is checked.
If the current counter is lesser than the threshold, it increments the counter by 1 and if not then the request is dropped.

Here there is an obvious problem that when a burst of traffic comes in at the edge of the time window, then there is a possibility of burst of load reaching the backend. Example: Consider the time windows 0:00:00 - 0:05:00 and 0:00:00 - 0:10:00.. Suppose if the rule is to allow 10 requests in a time window and if there is a spike of 20 requests at the end of the first time window and the beginning of the second time window, then there will be an unusual load in the backend.

But if the backend is fine to compromise this, then it is okay to use this algorithm for the rate limiting.

Sliding window log algorithm

This Algorithm fixes the problem which the Fixed window counter brings.

Here, it keeps track of the request timestamps.
Whenever a new request comes in, all the outdated timestampls will be removed from the log.
This new request marks a new timestamp on the log.
If the log size is lesser than threshold, it allows the requests. If not, the request will be dropped.

Here the accuracy of rate limiting is great but it consumes more memory than other algorithms as each timestamp are being stored.

Sliding window counter algorithm

This algorithm is a combination of both sliding window and fixed window algorithm. It is better when explained with example,

Consider there is maximum of 7 requests per minute is allowed and there is 4 request is previous minute and 3 in the current minute.
A new request comes in the current minute which around in 30% completion of the current minute. Say around 18th second
Apply [requests_in_current_window + (requests_in_prev_window * overlap percentage with previous minute) ] formula
It gives 3+5(0.7%) = 6.5 which can be rounded to 6 and the threshold being 7 requests, this can be allowed.

This handles the spikes in traffic pretty well and it'll be a little bit lenient than other algorithms.

Thus using one of these must be helpful with API rate limiting. The High level design and some detailing on different components for the system will be covered in the next part. Stay tuned! Feel free to let me know if any trade offs can be improved too..

Continued here

Reach out to me on Twitter Linkedin