Overview
The WorkRequest Service is a core infrastructure component within Oracle Cloud Infrastructure (OCI) Compute that tracks the lifecycle of long-running asynchronous operations. When a customer provisions a VM, scales a bare metal instance, or modifies networking configurations, these operations can take minutes to complete. The WorkRequest Service provides a unified API for customers and internal services to monitor progress, handle failures, and maintain audit trails.
Problem
OCI Compute operations are inherently asynchronous — provisioning a bare metal server involves BIOS configuration, network setup, storage attachment, and OS installation. Without a centralized tracking mechanism, customers had no reliable way to know if an operation was still in progress, had failed, or completed. Internal services also lacked a standardized way to report progress, leading to inconsistent error handling and poor customer experience.
Architecture
The service follows a producer-consumer architecture built on top of OCI's internal messaging infrastructure:
- API Layer: RESTful endpoints that allow customers to query work request status, list operations, and receive progress updates
- State Machine: Each work request transitions through well-defined states (ACCEPTED → IN_PROGRESS → SUCCEEDED/FAILED/CANCELED)
- Event Ingestion: Internal services publish state transitions via an event bus
- Persistence Layer: Work request state is durably stored with full history for audit compliance
- Polling & Webhooks: Customers can poll for status or register for webhook notifications
Tradeoffs
- Eventual consistency over strong consistency: Work request status may lag behind actual operation state by a few seconds, which is acceptable for the use case but required careful API documentation
- Pull-based over push-based: We chose polling as the primary interface with optional webhooks, reducing infrastructure complexity while meeting most customer needs
- Flat state model over hierarchical: Operations with sub-tasks flatten progress into a single percentage rather than exposing a tree of sub-operations, simplifying the API at the cost of granularity
Implementation
The service was implemented in Java with a Go-based sidecar for health monitoring. Key implementation details:
- Idempotent state transitions: Duplicate events from retry storms are handled gracefully using event deduplication
- TTL-based cleanup: Completed work requests are retained for 30 days before archival, balancing storage costs with audit requirements
- Rate limiting: Per-customer rate limits prevent API abuse during outage scenarios when customers may poll aggressively
- Terraform integration: Custom Terraform providers wait on work request completion, enabling infrastructure-as-code workflows
Challenges
- Handling partial failures where the underlying operation succeeds but the work request update fails, requiring reconciliation jobs
- Designing backward-compatible API changes as new operation types were onboarded with different progress semantics
- Managing cross-region consistency for work requests that span multiple availability domains
- Performance optimization under high fan-out scenarios where a single customer operation triggers hundreds of sub-operations
Future Improvements
- Server-Sent Events (SSE) for real-time progress streaming
- GraphQL API for flexible querying of work request metadata
- Machine learning-based ETA predictions using historical operation duration data
- Integration with OCI's notification service for richer alerting workflows