Cloud InfrastructureFeatured

OCI Compute WorkRequest Service

Designed and built an asynchronous work tracking service for Oracle Cloud Infrastructure Compute, enabling reliable long-running operation management across distributed compute resources.

January 15, 20243 min read
JavaGoTerraformOCIREST APIsDistributed Systems

Overview

The WorkRequest Service is a core infrastructure component within Oracle Cloud Infrastructure (OCI) Compute that tracks the lifecycle of long-running asynchronous operations. When a customer provisions a VM, scales a bare metal instance, or modifies networking configurations, these operations can take minutes to complete. The WorkRequest Service provides a unified API for customers and internal services to monitor progress, handle failures, and maintain audit trails.

Problem

OCI Compute operations are inherently asynchronous — provisioning a bare metal server involves BIOS configuration, network setup, storage attachment, and OS installation. Without a centralized tracking mechanism, customers had no reliable way to know if an operation was still in progress, had failed, or completed. Internal services also lacked a standardized way to report progress, leading to inconsistent error handling and poor customer experience.

Architecture

The service follows a producer-consumer architecture built on top of OCI's internal messaging infrastructure:

  • API Layer: RESTful endpoints that allow customers to query work request status, list operations, and receive progress updates
  • State Machine: Each work request transitions through well-defined states (ACCEPTED → IN_PROGRESS → SUCCEEDED/FAILED/CANCELED)
  • Event Ingestion: Internal services publish state transitions via an event bus
  • Persistence Layer: Work request state is durably stored with full history for audit compliance
  • Polling & Webhooks: Customers can poll for status or register for webhook notifications

Tradeoffs

  • Eventual consistency over strong consistency: Work request status may lag behind actual operation state by a few seconds, which is acceptable for the use case but required careful API documentation
  • Pull-based over push-based: We chose polling as the primary interface with optional webhooks, reducing infrastructure complexity while meeting most customer needs
  • Flat state model over hierarchical: Operations with sub-tasks flatten progress into a single percentage rather than exposing a tree of sub-operations, simplifying the API at the cost of granularity

Implementation

The service was implemented in Java with a Go-based sidecar for health monitoring. Key implementation details:

  • Idempotent state transitions: Duplicate events from retry storms are handled gracefully using event deduplication
  • TTL-based cleanup: Completed work requests are retained for 30 days before archival, balancing storage costs with audit requirements
  • Rate limiting: Per-customer rate limits prevent API abuse during outage scenarios when customers may poll aggressively
  • Terraform integration: Custom Terraform providers wait on work request completion, enabling infrastructure-as-code workflows

Challenges

  • Handling partial failures where the underlying operation succeeds but the work request update fails, requiring reconciliation jobs
  • Designing backward-compatible API changes as new operation types were onboarded with different progress semantics
  • Managing cross-region consistency for work requests that span multiple availability domains
  • Performance optimization under high fan-out scenarios where a single customer operation triggers hundreds of sub-operations

Future Improvements

  • Server-Sent Events (SSE) for real-time progress streaming
  • GraphQL API for flexible querying of work request metadata
  • Machine learning-based ETA predictions using historical operation duration data
  • Integration with OCI's notification service for richer alerting workflows