Designing a Rate Limiter — System Design

Problem Statement

Design a rate limiter that controls the number of requests a client can send to an API within a given time window. The rate limiter should be distributed (works across multiple API servers), configurable per endpoint, and add minimal latency to request processing.

Requirements

Functional Requirements

Limit requests per client (by API key, IP address, or user ID)
Support different rate limits per API endpoint
Return appropriate HTTP 429 (Too Many Requests) with retry-after header
Support multiple rate limiting algorithms
Allow rate limit configuration changes without restart

Non-Functional Requirements

Latency: Add less than 5ms overhead per request
Availability: Rate limiter failure should not block legitimate traffic (fail-open)
Accuracy: Minimal over-counting or under-counting of requests
Scale: Handle 1M+ requests per second across the fleet

Rate Limiting Algorithms

1. Token Bucket

Tokens are added to a bucket at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Pros: Smooth rate limiting, allows short bursts Cons: Two parameters to tune (bucket size, refill rate)

2. Fixed Window Counter

Count requests in fixed time windows (e.g., per minute). Reset count at the start of each window.

Pros: Simple to implement, memory efficient Cons: Burst at window boundaries — a client can send 2x the limit by timing requests at the end of one window and start of the next

3. Sliding Window Log

Store the timestamp of each request. Count requests within the sliding window by removing expired timestamps.

Pros: Accurate, no boundary issues Cons: Memory intensive — stores every request timestamp

4. Sliding Window Counter

Hybrid approach: combine the current window count with a weighted portion of the previous window count.

Pros: Memory efficient, reasonably accurate Cons: Approximate — not perfectly accurate but within acceptable bounds

Architecture

Components

Rate Limiter Middleware: Intercepts every API request before it reaches the application
Rules Engine: Stores rate limit configurations per endpoint and client tier
Counter Store: Redis cluster storing request counts
Monitoring: Tracks rate limit hits, false positives, and latency overhead

Request Flow

Request arrives at API gateway
Rate limiter middleware extracts client identifier (API key)
Looks up rate limit rules for the endpoint + client tier
Queries Redis for current request count
If under limit: increment counter, forward request
If over limit: return 429 with Retry-After header

Database Choice

Redis is the ideal choice for rate limiting:

Speed: Sub-millisecond operations for increment and get
Atomic operations: INCR with EXPIRE provides atomic counter increment with TTL
Distributed: Redis Cluster supports the distributed counter pattern
Lua scripting: Complex rate limiting logic can run atomically on the server side

Redis Implementation (Token Bucket)

-- Lua script for atomic token bucket rate limiting
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now

-- Calculate tokens to add based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))

if new_tokens >= 1 then
    redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
    return 1  -- Request allowed
else
    return 0  -- Request denied
end

Scaling Strategies

Local + Global Rate Limiting

Use a two-tier approach:

Local: Each API server maintains an in-memory counter (fast, no network hop)
Global: Periodically sync with Redis for cross-server accuracy
Tradeoff: Slightly less accurate but significantly lower latency

Redis Cluster Partitioning

Partition rate limit keys across Redis nodes by client ID hash. This distributes the load and prevents any single Redis node from becoming a bottleneck.

Failure Handling

Redis Unavailable

Fail open: Allow all requests through. It's better to temporarily allow extra traffic than to block all users. Log the failure for alerting.

Clock Skew

Use Redis server time (TIME command) instead of application server time to avoid issues with clock skew across distributed servers.

Race Conditions

The Lua script approach eliminates race conditions by executing the check-and-increment atomically on the Redis server.

Tradeoffs

Accuracy vs. Latency: Local counters are faster but less accurate; centralized Redis is accurate but adds network latency
Memory vs. Precision: Sliding window log is perfectly accurate but uses O(n) memory per client; fixed window uses O(1) but has boundary issues
Fail-open vs. Fail-closed: Fail-open risks abuse during outages; fail-closed risks blocking legitimate traffic
Per-server vs. Global limits: Per-server limits are simpler but require knowing the fleet size; global limits handle auto-scaling correctly