Building Multi-Agent Workflows on Lambda Durable Functions

May 20, 2026

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

Multi-agent systems are everywhere now. Every team wants autonomous agents collaborating on complex tasks, research, planning, execution, validation. The problem? Orchestrating multiple agents that need to wait on each other, handle failures gracefully, and maintain state across long-running conversations is painful. We’ve been duct-taping this together with Step Functions, SQS queues, and DynamoDB state tables for too long.

Lambda Durable Functions change the game here. In this article we will walk through why durable functions are a natural fit for multi-agent orchestration, design a document processing pipeline with four agents, and show how the architecture holds together without a single line of state management code.

Why Durable Functions for Multi-Agent

Standard Lambda functions run start-to-finish in a single invocation. If something fails midway, you retry everything. For a multi-agent workflow where Agent A researches, Agent B plans, Agent C executes, and Agent D validates. That’s unacceptable. You can’t re-run a 4-minute research phase because the validation agent hit a transient error.

Durable functions automatically checkpoint progress, suspend execution for up to one year during long-running tasks, and recover from failures. No custom state management. No DynamoDB tables tracking “which step are we on.” The runtime handles it.

This is exactly what multi-agent orchestration needs: reliable progress tracking across agents that may take seconds or minutes to respond, with automatic recovery when things go wrong.

The Example: Document Processing Pipeline

Let’s see an example of a document processing pipeline and how we can build it with Lambda durable functions. We have four agents and one durable function orchestrating them. The agents are:

Classifier Agent — Reads an incoming document, determines its type (invoice, contract, support ticket), and extracts metadata.
Enrichment Agent — Takes the classification, pulls additional context from internal systems, and augments the document with business context.
Decision Agent — Evaluates the enriched document against business rules and decides the routing: auto-approve, escalate, or reject.
Action Agent — Executes the decision: files the document, notifies stakeholders, or triggers downstream workflows.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Lambda Durable Function (Orchestrator)                  │
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐   │
│  │ Checkpoint 1 │    │ Checkpoint 2 │    │ Checkpoint 3 │    │ Checkpoint 4 │  
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘   │
│         │                  │                  │                  │          │
│         ▼                  ▼                  ▼                  ▼          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐   │
│  │  Step 1:    │    │  Step 2:    │    │  Step 3:    │    │  Step 4:    │   │
│  │  Classify   │───▶│  Enrich     │───▶│  Decide     │───▶│  Act        │   │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘   │
│         │                  │                  │                  │          │
└─────────┼──────────────────┼──────────────────┼──────────────────┼──────────┘
          │                  │                  │                  │
          ▼                  ▼                  ▼                  ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Classifier     │  │  Enrichment     │  │  Decision       │  │  Action         │
│  Agent          │  │  Agent          │  │  Agent          │  │  Agent          │
│  (Lambda)       │  │  (Lambda)       │  │  (Lambda)       │  │  (Lambda)       │
│                 │  │                 │  │                 │  │                 │
│  - Reads doc    │  │  - Pulls context│  │  - Evaluates    │  │  - Files doc    │
│  - Classifies   │  │  - Calls APIs   │  │    rules        │  │  - Notifies     │
│  - Extracts     │  │  - Augments     │  │  - Routes       │  │  - Triggers     │
│    metadata     │  │    metadata     │  │                 │  │    downstream   │
└────────┬────────┘  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘
         │                    │                    │                     │
         ▼                    ▼                    ▼                     ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Amazon Bedrock │  │  Internal APIs  │  │  Human Review   │  │  S3 / SNS /     │
│  (LLM)          │  │  (DynamoDB,     │  │  (Callback -    │  │  EventBridge    │
│                 │  │   other svcs)   │  │   Wait/Resume)  │  │                 │
└─────────────────┘  └─────────────────┘  └─────────────────┘  └─────────────────┘

The Flow

The durable function receives a document event. It invokes the Classifier Agent as a durable step and progress is checkpointed. If Lambda recycles the execution environment after classification completes, the function resumes from that checkpoint, not from scratch.

The Classifier’s output feeds into the Enrichment Agent. Another durable step, another checkpoint. The Enrichment Agent might call external APIs that take 30 seconds. The durable function suspends, pays nothing while waiting, and resumes when enrichment completes.

Here’s where it gets interesting. The Decision Agent might determine that a human needs to review this document. The durable function uses a wait where it suspends execution entirely, for hours or days if needed, until a callback arrives with the human’s decision. No polling. No idle compute. The function simply resumes where it left off.

Finally, the Action Agent executes. If it fails, maybe a downstream system is temporarily unavailable, the durable function retries that specific step without re-running classification, enrichment, or decision. Four agents, one orchestration function, zero state management infrastructure.

Why This Beats the Alternative

Before durable functions, this same workflow required: a Step Functions state machine, DynamoDB for intermediate state, SQS queues between agents, dead-letter queues for failures, and CloudWatch alarms for stuck executions. That’s five services to manage for what is conceptually a single workflow.

With durable functions, it’s one Lambda function. Same programming model you already know. Same event handler. Same integrations. The durability is built into the execution model itself.

The Trade-Off

Durable functions use a checkpoint-replay model. Every time execution resumes, it replays from the last checkpoint. This means your orchestration logic must be deterministic, so no random values, no reading the current time for branching decisions outside of durable steps. This is a constraint worth understanding upfront.

For multi-agent workflows specifically, this is rarely a problem. Your orchestration logic is typically: call agent, get result, pass to next agent. That’s inherently deterministic.

When to Reach for This

Multi-agent workflows that involve waiting on humans, on slow external systems, on other agents, are the sweet spot. If your agents all respond in under a second and never fail, you probably don’t need durability. But that’s not the real world.

In the real world, agents call LLMs that timeout, external APIs that rate-limit, and humans that go to lunch. Durable functions handle all of that without you writing a single line of state management logic.

The Cloud Engineers

Discussion about this post

Ready for more?