The token bucket models a bucket refilled with tokens at a steady rate up to a maximum capacity. Each request removes one token; if the bucket is empty, the request is rejected or delayed. Its key property is that it allows bursts: a client that has been idle accumulates tokens and can briefly exceed the steady rate, up to the bucket size. This flexibility — smooth average rate plus tolerated bursts — makes token bucket the most popular API rate-limiting algorithm.
2 / 5
How does the leaky bucket algorithm differ from token bucket?
The leaky bucket models requests entering a bucket that drains (is processed) at a constant rate, like water leaking from a hole. Incoming bursts fill the bucket, but the output is perfectly smooth; if the bucket overflows, requests are dropped. The contrast with token bucket: leaky bucket enforces a steady outflow (good for protecting a downstream system that needs even load), while token bucket permits bursts (good for user-facing APIs where occasional spikes are fine). The choice depends on whether you want to smooth traffic or merely cap the average.
3 / 5
What is the weakness of the fixed window counter, and how does the sliding window fix it?
A fixed window counter resets the count at fixed intervals (e.g. each minute). Its flaw is the boundary burst: a client can send the full limit at 11:59:59 and the full limit again at 12:00:00 — double the intended rate across that boundary. The sliding window approach (log or weighted) considers a rolling time range rather than discrete buckets, so the limit is enforced continuously and the boundary burst is eliminated. Sliding window log is precise but memory-heavy; sliding window counter approximates it cheaply by weighting the previous window.
4 / 5
In a distributed system, why is rate limiting harder than on a single server, and what is a common solution?
With multiple application servers behind a load balancer, a local per-server counter is wrong: if the limit is 100/min and you have 5 servers each enforcing 100, the real limit becomes 500. The standard fix is a shared, centralized counter — typically Redis — that all nodes increment atomically (e.g. via INCR with expiry, or a Lua script for sliding windows). This gives a single source of truth at the cost of a network round-trip per request. Alternatives include rate limiting at the API gateway/load balancer layer, or approximate local limits that sync periodically.
5 / 5
What HTTP status code and header should a well-behaved API return when a client is rate limited?
The correct response is HTTP 429 Too Many Requests, which explicitly signals the client has exceeded its rate limit. A well-behaved API also includes a Retry-After header (seconds or a date) so the client knows when to retry, plus often X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can self-throttle proactively. Returning 429 with guidance lets well-built clients back off gracefully (ideally with exponential backoff and jitter) rather than hammering the API and making the overload worse.