How to coordinate distributed services reliably? (Part 2)

In Part-1 of this article, we discussed how to co-ordinate services in a distributed system reliably and the differences between Orchestration vs Choreography. In part-2, we shall discuss how you can hand-craft your own Orchestrator without using any Orchestration tools.

Orchestration example - Happy path

Suppose you want to build your own Orchestrator. Let’s see how anAsynchronous Orchestrator might handle the happy path of an Order workflow — where each step completes successfully.

The Orchestrator coordinates multiple services by recording intent for each step. For every action, it writes two things in the same transaction:

A record in the Saga table to track the workflow’s progress.
A record in the Outbox table with state PENDING to signal work for a service (worker).

Workers process the events in the Outbox and perform the following actions.

Update the state of the saga with the action that was just completed
Update the outbox record status to DONE after the action is performed.

The use of the Outbox table makes this Orchestrator inherently asynchronous.

Here`s a step-by-step overview of how the orchestrator and outbox workers interact. The saga and outbox datastores don`t have to be relational databases; they`re shown this way here for easier visualisation. Some might refer to this pattern as Orchestrated Choreography

Click to view 5 images

Orchestration example - Failure state

How would the saga and outbox tables look when there is a shipping failure?

Click to view 1 image

Orchestrated Choreography

With the Outbox pattern as demonstrated above, even though we are calling it Orchestration, the line between Orchestration and Choreography becomes blurry. The Orchestrator still drives the workflow by recording intent and publishing events, but from the perspective of the workers consuming and processing those events, it looks just like Choreography. Some refer to this as Orchestrated Choreography.

With Orchestrated Choreography, workers are still reacting to events (“reserve inventory,” “process payment,” etc.), just as they would in Choreography. The difference is who decides what comes next:

Orchestrator → central decision maker
Choreography → services collectively drive the flow.

Practical Considerations

1.Can the saga and outbox datastores be something other than database tables?

2.My services (workers) can only deal with Kafka events. I don't want to make my workers read from an outbox database table. What can I do?

3.How do retries and idempotency fit into the saga?

4.Should the orchestrator and workers share the same datastore?

5.Do sagas require strict ordering of events from the outbox?

6.How do I evolve the outbox message schema without breaking workers?

7.How do I scale the orchestrator without duplicate processing?

8.How do I make saga failures visible to operators or support teams?

Bonus : Orchestrator Saga Pseudocode

For the above Order workflow, here`s how the pseudo code might look like.


# Orchestrator
orchestrator process_order(order_id, amount):

  saga = fetch_saga(order_id)
  if saga is null:
    saga = create_saga(order_id, state = started, version = 0)

  # step 1: payment intent
  if saga.state == started:
    payment_outbox = create_outbox_entry(
      order_id = order_id,
      action = charge_payment,
      payload = {amount: amount, idempotency_key: order_id}
    )
    persist_atomic(saga_update = {state: payment_intent}, outbox_entry = payment_outbox)
    return

  # step 2: inventory intent
  if saga.state == payment_success:
    inventory_outbox = create_outbox_entry(
      order_id = order_id,
      action = reserve_inventory,
      payload = {idempotency_key: order_id}
    )
    persist_atomic(saga_update = {state: inventory_intent}, outbox_entry = inventory_outbox)
    return

  # step 3: shipping intent
  if saga.state == inventory_reserved:
    shipping_outbox = create_outbox_entry(
      order_id = order_id,
      action = schedule_shipping,
      payload = {idempotency_key: order_id}
    )
    persist_atomic(saga_update = {state: shipping_intent}, outbox_entry = shipping_outbox)
    return

  # final step: mark completed if shipping succeeded
  if saga.state == shipping_scheduled:
    persist_saga(saga_update = {state: completed})
    notify_customer(order_id, success = true)
    return


# global compensation function
compensate_order(order_id):

  saga = fetch_saga(order_id)

  # only trigger refund/compensation if a downstream step failed
  if saga.state in [inventory_failed, shipping_failed]:
    payment_id = saga.metadata.payment_id
    refund_outbox = create_outbox_entry(
      order_id = order_id,
      action = refund_payment,
      payload = {payment_id: payment_id, idempotency_key: "refund-" + order_id}
    )
    persist_atomic(saga_update = {state: refund_intent}, outbox_entry = refund_outbox)


# Payment worker
worker paymentworker:

  loop forever:
    batch = fetch_and_lock_outbox_rows(action="charge_payment", limit=100, stale_after=5m)

    if batch is empty:
      sleep(1s)
      continue

    for each entry in batch:
      process_payment_entry(entry)


process_payment_entry(entry):

  try:
    # call external payment service (idempotent)
    result = payment_service.charge(entry.payload)

    if result.success:
      -- transaction begin
         update saga
         set state = payment_success,
             metadata.payment_id = result.payment_id,
             version = version + 1
         where order_id = entry.order_id

         update outbox
         set status = done,
             attempts = attempts + 1,
             locked_at = null,
             next_try = null,
             last_error = null
         where outbox_id = entry.outbox_id
      -- transaction commit
      return

    else:
      raise external_transient_error   # treat as retryable failure

  catch external_transient_error:
    entry.attempts += 1

    if entry.attempts >= entry.max_attempts:
      -- transaction begin
         update saga
         set state = manual_intervention_required,
             version = version + 1
         where order_id = entry.order_id

         update outbox
         set status = failed,
             attempts = entry.attempts,
             locked_at = null
         where outbox_id = entry.outbox_id
      -- transaction commit
      alert_ops(entry.order_id)

    else:
      backoff = compute_backoff(entry.attempts)
      -- transaction begin
         update outbox
         set status = pending,
             next_try = now() + backoff,
             attempts = entry.attempts,
             locked_at = null
         where outbox_id = entry.outbox_id
      -- transaction commit

  catch external_permanent_error:
      -- transaction begin
         update saga
         set state = manual_intervention_required,
             version = version + 1
         where order_id = entry.order_id

         update outbox
         set status = failed,
             attempts = entry.attempts + 1,
             locked_at = null
         where outbox_id = entry.outbox_id
      -- transaction commit
      alert_ops(entry.order_id)


# Used by workers to lock and process entries in the outbox table
function fetch_and_lock_outbox_rows(action, limit=100, stale_after_minutes=5):

  now = current_timestamp()

  -- start transaction
  rows = select *
         from outbox
         where action = action
           and (
               status = 'pending'
               or (status = 'in_progress' and locked_at < now - interval 'stale_after_minutes minutes')
           )
         order by created_at asc
         limit limit
         for update skip locked

  if rows is empty:
    -- nothing to process, commit and return empty list
    commit
    return []

  -- mark rows as locked/in_progress
  for each row in rows:
    update outbox
    set status = 'in_progress',
        locked_at = now
    where outbox_id = row.outbox_id

  commit
  return rows

On this page

About the author