Problem Statement
Design a rate limiter that controls the number of requests a client can send to an API within a given time window. The rate limiter should be distributed (works across multiple API servers), configurable per endpoint, and add minimal latency to request processing.
Requirements
Functional Requirements
- Limit requests per client (by API key, IP address, or user ID)
- Support different rate limits per API endpoint
- Return appropriate HTTP 429 (Too Many Requests) with retry-after header
- Support multiple rate limiting algorithms
- Allow rate limit configuration changes without restart
Non-Functional Requirements
- Latency: Add less than 5ms overhead per request
- Availability: Rate limiter failure should not block legitimate traffic (fail-open)
- Accuracy: Minimal over-counting or under-counting of requests
- Scale: Handle 1M+ requests per second across the fleet
Rate Limiting Algorithms
1. Token Bucket
Tokens are added to a bucket at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.
Pros: Smooth rate limiting, allows short bursts Cons: Two parameters to tune (bucket size, refill rate)
2. Fixed Window Counter
Count requests in fixed time windows (e.g., per minute). Reset count at the start of each window.
Pros: Simple to implement, memory efficient Cons: Burst at window boundaries — a client can send 2x the limit by timing requests at the end of one window and start of the next
3. Sliding Window Log
Store the timestamp of each request. Count requests within the sliding window by removing expired timestamps.
Pros: Accurate, no boundary issues Cons: Memory intensive — stores every request timestamp
4. Sliding Window Counter
Hybrid approach: combine the current window count with a weighted portion of the previous window count.
Pros: Memory efficient, reasonably accurate Cons: Approximate — not perfectly accurate but within acceptable bounds
Architecture
Components
- Rate Limiter Middleware: Intercepts every API request before it reaches the application
- Rules Engine: Stores rate limit configurations per endpoint and client tier
- Counter Store: Redis cluster storing request counts
- Monitoring: Tracks rate limit hits, false positives, and latency overhead
Request Flow
- Request arrives at API gateway
- Rate limiter middleware extracts client identifier (API key)
- Looks up rate limit rules for the endpoint + client tier
- Queries Redis for current request count
- If under limit: increment counter, forward request
- If over limit: return 429 with
Retry-Afterheader
Database Choice
Redis is the ideal choice for rate limiting:
- Speed: Sub-millisecond operations for increment and get
- Atomic operations:
INCRwithEXPIREprovides atomic counter increment with TTL - Distributed: Redis Cluster supports the distributed counter pattern
- Lua scripting: Complex rate limiting logic can run atomically on the server side
Redis Implementation (Token Bucket)
-- Lua script for atomic token bucket rate limiting
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now
-- Calculate tokens to add based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))
if new_tokens >= 1 then
redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
return 1 -- Request allowed
else
return 0 -- Request denied
end
Scaling Strategies
Local + Global Rate Limiting
Use a two-tier approach:
- Local: Each API server maintains an in-memory counter (fast, no network hop)
- Global: Periodically sync with Redis for cross-server accuracy
- Tradeoff: Slightly less accurate but significantly lower latency
Redis Cluster Partitioning
Partition rate limit keys across Redis nodes by client ID hash. This distributes the load and prevents any single Redis node from becoming a bottleneck.
Failure Handling
Redis Unavailable
Fail open: Allow all requests through. It's better to temporarily allow extra traffic than to block all users. Log the failure for alerting.
Clock Skew
Use Redis server time (TIME command) instead of application server time to avoid issues with clock skew across distributed servers.
Race Conditions
The Lua script approach eliminates race conditions by executing the check-and-increment atomically on the Redis server.
Tradeoffs
- Accuracy vs. Latency: Local counters are faster but less accurate; centralized Redis is accurate but adds network latency
- Memory vs. Precision: Sliding window log is perfectly accurate but uses O(n) memory per client; fixed window uses O(1) but has boundary issues
- Fail-open vs. Fail-closed: Fail-open risks abuse during outages; fail-closed risks blocking legitimate traffic
- Per-server vs. Global limits: Per-server limits are simpler but require knowing the fleet size; global limits handle auto-scaling correctly