OCI Compute WorkRequest Service

Overview

The WorkRequest Service is a core infrastructure component within Oracle Cloud Infrastructure (OCI) Compute that tracks the lifecycle of long-running asynchronous operations. When a customer provisions a VM, scales a bare metal instance, or modifies networking configurations, these operations can take minutes to complete. The WorkRequest Service provides a unified API for customers and internal services to monitor progress, handle failures, and maintain audit trails.

Problem

OCI Compute operations are inherently asynchronous — provisioning a bare metal server involves BIOS configuration, network setup, storage attachment, and OS installation. Without a centralized tracking mechanism, customers had no reliable way to know if an operation was still in progress, had failed, or completed. Internal services also lacked a standardized way to report progress, leading to inconsistent error handling and poor customer experience.

Architecture

The service follows a producer-consumer architecture built on top of OCI's internal messaging infrastructure:

API Layer: RESTful endpoints that allow customers to query work request status, list operations, and receive progress updates
State Machine: Each work request transitions through well-defined states (ACCEPTED → IN_PROGRESS → SUCCEEDED/FAILED/CANCELED)
Event Ingestion: Internal services publish state transitions via an event bus
Persistence Layer: Work request state is durably stored with full history for audit compliance
Polling & Webhooks: Customers can poll for status or register for webhook notifications

Tradeoffs

Eventual consistency over strong consistency: Work request status may lag behind actual operation state by a few seconds, which is acceptable for the use case but required careful API documentation
Pull-based over push-based: We chose polling as the primary interface with optional webhooks, reducing infrastructure complexity while meeting most customer needs
Flat state model over hierarchical: Operations with sub-tasks flatten progress into a single percentage rather than exposing a tree of sub-operations, simplifying the API at the cost of granularity

Implementation

The service was implemented in Java with a Go-based sidecar for health monitoring. Key implementation details:

Idempotent state transitions: Duplicate events from retry storms are handled gracefully using event deduplication
TTL-based cleanup: Completed work requests are retained for 30 days before archival, balancing storage costs with audit requirements
Rate limiting: Per-customer rate limits prevent API abuse during outage scenarios when customers may poll aggressively
Terraform integration: Custom Terraform providers wait on work request completion, enabling infrastructure-as-code workflows

Challenges

Handling partial failures where the underlying operation succeeds but the work request update fails, requiring reconciliation jobs
Designing backward-compatible API changes as new operation types were onboarded with different progress semantics
Managing cross-region consistency for work requests that span multiple availability domains
Performance optimization under high fan-out scenarios where a single customer operation triggers hundreds of sub-operations

Future Improvements

Server-Sent Events (SSE) for real-time progress streaming
GraphQL API for flexible querying of work request metadata
Machine learning-based ETA predictions using historical operation duration data
Integration with OCI's notification service for richer alerting workflows