Building Multi-Agent Workflows on Lambda Durable Functions
Hey, itβs Lefteris π Iβm the voice behind the weekly newsletter βThe Cloud Engineers.β
Multi-agent systems are everywhere now. Every team wants autonomous agents collaborating on complex tasks, research, planning, execution, validation. The problem? Orchestrating multiple agents that need to wait on each other, handle failures gracefully, and maintain state across long-running conversations is painful. Weβve been duct-taping this together with Step Functions, SQS queues, and DynamoDB state tables for too long.
Lambda Durable Functions change the game here. In this article we will walk through why durable functions are a natural fit for multi-agent orchestration, design a document processing pipeline with four agents, and show how the architecture holds together without a single line of state management code.
Why Durable Functions for Multi-Agent
Standard Lambda functions run start-to-finish in a single invocation. If something fails midway, you retry everything. For a multi-agent workflow where Agent A researches, Agent B plans, Agent C executes, and Agent D validates. Thatβs unacceptable. You canβt re-run a 4-minute research phase because the validation agent hit a transient error.
Durable functions automatically checkpoint progress, suspend execution for up to one year during long-running tasks, and recover from failures. No custom state management. No DynamoDB tables tracking βwhich step are we on.β The runtime handles it.
This is exactly what multi-agent orchestration needs: reliable progress tracking across agents that may take seconds or minutes to respond, with automatic recovery when things go wrong.
The Example: Document Processing Pipeline
Letβs see an example of a document processing pipeline and how we can build it with Lambda durable functions. We have four agents and one durable function orchestrating them. The agents are:
Classifier Agent β Reads an incoming document, determines its type (invoice, contract, support ticket), and extracts metadata.
Enrichment Agent β Takes the classification, pulls additional context from internal systems, and augments the document with business context.
Decision Agent β Evaluates the enriched document against business rules and decides the routing: auto-approve, escalate, or reject.
Action Agent β Executes the decision: files the document, notifies stakeholders, or triggers downstream workflows.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Lambda Durable Function (Orchestrator) β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Checkpoint 1 β β Checkpoint 2 β β Checkpoint 3 β β Checkpoint 4 β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Step 1: β β Step 2: β β Step 3: β β Step 4: β β
β β Classify βββββΆβ Enrich βββββΆβ Decide βββββΆβ Act β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β β
βββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Classifier β β Enrichment β β Decision β β Action β
β Agent β β Agent β β Agent β β Agent β
β (Lambda) β β (Lambda) β β (Lambda) β β (Lambda) β
β β β β β β β β
β - Reads doc β β - Pulls contextβ β - Evaluates β β - Files doc β
β - Classifies β β - Calls APIs β β rules β β - Notifies β
β - Extracts β β - Augments β β - Routes β β - Triggers β
β metadata β β metadata β β β β downstream β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Amazon Bedrock β β Internal APIs β β Human Review β β S3 / SNS / β
β (LLM) β β (DynamoDB, β β (Callback - β β EventBridge β
β β β other svcs) β β Wait/Resume) β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
The Flow
The durable function receives a document event. It invokes the Classifier Agent as a durable step and progress is checkpointed. If Lambda recycles the execution environment after classification completes, the function resumes from that checkpoint, not from scratch.
The Classifierβs output feeds into the Enrichment Agent. Another durable step, another checkpoint. The Enrichment Agent might call external APIs that take 30 seconds. The durable function suspends, pays nothing while waiting, and resumes when enrichment completes.
Hereβs where it gets interesting. The Decision Agent might determine that a human needs to review this document. The durable function uses a wait where it suspends execution entirely, for hours or days if needed, until a callback arrives with the humanβs decision. No polling. No idle compute. The function simply resumes where it left off.
Finally, the Action Agent executes. If it fails, maybe a downstream system is temporarily unavailable, the durable function retries that specific step without re-running classification, enrichment, or decision. Four agents, one orchestration function, zero state management infrastructure.
Why This Beats the Alternative
Before durable functions, this same workflow required: a Step Functions state machine, DynamoDB for intermediate state, SQS queues between agents, dead-letter queues for failures, and CloudWatch alarms for stuck executions. Thatβs five services to manage for what is conceptually a single workflow.
With durable functions, itβs one Lambda function. Same programming model you already know. Same event handler. Same integrations. The durability is built into the execution model itself.
The Trade-Off
Durable functions use a checkpoint-replay model. Every time execution resumes, it replays from the last checkpoint. This means your orchestration logic must be deterministic, so no random values, no reading the current time for branching decisions outside of durable steps. This is a constraint worth understanding upfront.
For multi-agent workflows specifically, this is rarely a problem. Your orchestration logic is typically: call agent, get result, pass to next agent. Thatβs inherently deterministic.
When to Reach for This
Multi-agent workflows that involve waiting on humans, on slow external systems, on other agents, are the sweet spot. If your agents all respond in under a second and never fail, you probably donβt need durability. But thatβs not the real world.
In the real world, agents call LLMs that timeout, external APIs that rate-limit, and humans that go to lunch. Durable functions handle all of that without you writing a single line of state management logic.

