Backend SystemsFeatured

Amazon Work Allocation System

Built a scalable work allocation engine for Amazon's Product Safety & Compliance team, distributing compliance review tasks across global teams with intelligent routing and priority management.

June 10, 20233 min read
JavaAWSDynamoDBSQSLambdaReactREST APIs

Overview

The Work Allocation System is the backbone of Amazon's Product Safety & Compliance operations. When products are flagged for potential safety concerns — through customer reports, automated scans, or regulatory alerts — review tasks need to be routed to the right compliance specialists based on product category, jurisdiction, reviewer expertise, and current workload. This system handles millions of allocation decisions per month across multiple global regions.

Problem

Prior to this system, work allocation was largely manual. Team leads would review incoming tasks each morning and manually assign them to reviewers. This led to uneven workload distribution, missed SLA targets for time-sensitive safety reviews, and poor utilization of specialist knowledge. A product recalled in the EU might sit unreviewed because the European regulatory specialist was overloaded while US-focused reviewers had capacity.

Architecture

The system uses an event-driven architecture on AWS:

  • Ingestion Layer: SQS queues receive incoming review tasks from upstream detection systems
  • Routing Engine: A rules-based engine with priority scoring evaluates tasks against reviewer profiles, current workload, and SLA requirements
  • Assignment Service: Atomic task assignment with optimistic locking in DynamoDB to prevent double-assignment
  • Dashboard: React-based UI for team leads to monitor allocation metrics and override assignments
  • Feedback Loop: Reviewer completion data feeds back into the routing engine to improve future allocations

Tradeoffs

  • Rules-based routing over ML-based: We chose explicit, auditable rules over ML models because regulatory compliance requires explainability — stakeholders need to know exactly why a task was assigned to a specific reviewer
  • DynamoDB over RDS: The access patterns (single-item lookups by task ID, range queries by reviewer) fit DynamoDB perfectly, and the scale requirements made a serverless data store more cost-effective
  • Eventually consistent reads: We accepted eventual consistency for dashboard metrics to avoid the cost of consistent reads at scale, adding a 5-second delay disclaimer in the UI

Implementation

Core components built in Java with Spring Boot:

  • Priority scoring algorithm: Multi-factor scoring considering task urgency, product risk level, reviewer expertise match, and current queue depth
  • Work stealing: Idle reviewers can pull tasks from overloaded peers' queues with automatic rebalancing
  • SLA tracking: Real-time SLA countdown with escalation triggers at 75% and 90% of deadline
  • Marketplace onboarding: Self-service onboarding flow for new marketplace regions with configurable routing rules

Challenges

  • Handling timezone-aware allocation across 14 global sites with different working hours and holidays
  • Preventing task starvation for low-priority items while ensuring high-priority safety reviews meet strict SLA
  • Designing the system to gracefully degrade during traffic spikes from mass product recalls
  • Integration testing across 8 upstream services with different release cycles and API versions

Future Improvements

  • ML-based reviewer matching using historical performance data
  • Predictive workload forecasting for capacity planning
  • Real-time WebSocket updates for the dashboard instead of polling
  • Cross-team work sharing during peak periods