The Cloud Engineers

How Writing in Public Accelerated My Cloud Career (And How to Start)

Lefteris Karageorgiou — Wed, 29 Jul 2026 09:30:20 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

Here’s a career truth that took me too long to learn: being good at your job is necessary, but it isn’t enough to advance quickly. The engineers who move fastest aren’t always the most technically brilliant, they’re the ones whose work is visible. And the single highest-leverage way to make your work visible is to write about it in public.

I know, because it changed my trajectory. Writing this newsletter, publishing what I learn, and sharing it openly opened doors I didn’t know existed, such as speaking opportunities, connections with people I admired, and a reputation that arrives in the room before I do. This article is the practical case for doing the same, and a no-excuses playbook to start.

Why Writing in Public Works (The Mechanics)

This isn’t motivational fluff. There are concrete, compounding mechanisms at play:

It compounds while you sleep. A good article you wrote two years ago is still working for you today — being read, shared, and building your reputation. Almost nothing else in a career compounds like published work.
It forces clarity. You don’t truly understand something until you can explain it simply. Writing exposes the gaps in your knowledge and closes them. You learn the topic twice — once to do it, once to explain it.
It builds a surface area for luck. Opportunities can’t find you if you’re invisible. Every article is another door people can knock on. Jobs, collaborations, and invitations find people who are findable.
It’s proof of skill that outlasts any résumé. “I understand event-driven architectures” is a claim. A clear article explaining a real trade-off is evidence. Evidence wins.

The Objections (And Why They’re Wrong)

Almost everyone talks themselves out of starting. Let’s kill the four excuses head-on:

“I’m not an expert.” You don’t need to be. You need to be one step ahead of your reader. The person who learned something last month explains it better to a beginner than the expert who forgot what confusion feels like. Write from where you are.

“It’s all been written already.” It hasn’t been written by you, with your examples, in your voice. Your specific angle like the mistake you made, the trade-off you hit in production is what makes it worth reading.

“I don’t have time.” You don’t need much. A useful post can be 400 words about one thing you figured out this week. Consistency at small scale beats a heroic effort you never repeat.

“What if I’m wrong?” You will be, occasionally. That’s fine becayse the internet corrects you, you learn, you get better. The cost of a public mistake is far lower than the cost of staying invisible for years.

How to Actually Start

Here’s the playbook how to start:

Pick a lane, loosely. Write about the thing you’re already doing at work. For me it’s cloud and serverless. You don’t need a grand content strategy but a topic you touch every day so you never run out of material.
Start with what you just learned. The best first article is “here’s a problem I hit this week and how I solved it.” It’s authentic, it’s useful, and you already have the material in your head.
Choose the lowest-friction platform. Don’t build a custom blog first, that’s procrastination in disguise. Start where writing is free and distribution is built in: a newsletter platform like LinkedIn, Medium, or Dev.to. Reduce every barrier between you and publish.
Commit to a cadence you can actually keep. I write every Wednesday. The specific rhythm matters less than the promise you keep to yourself. Weekly is great; even monthly beats sporadic. The cadence is what turns one article into a body of work.
Write like you talk. Drop the corporate voice. Explain it the way you’d explain it to a colleague at lunch. Clarity and personality beat polish.
Ship before it’s perfect. Your first posts will make you cringe later and that’s a sign you’ve grown, not a reason to have waited. Hit publish. Iterate in public.

What to Expect (A Realistic Timeline)

Let me set honest expectations, because unrealistic ones are why people quit:

Weeks 1–4: Almost no one reads it. This is normal. You’re building the habit, not the audience yet.
Months 2–3: You get better and faster at writing. The occasional comment or share appears. The compounding hasn’t kicked in so keep going.
Months 4–12: A back catalog accumulates. Search and shares start bringing readers you never reached directly. People begin to recognize your name.
Beyond a year: This is where the doors open — speaking invitations, inbound opportunities, conversations with people you respect. Not because you got lucky, but because you became findable and stayed consistent.

The engineers who win at this aren’t more talented. They just didn’t quit in month two.

Conclusion

Writing in public is the rare career move with almost no downside and enormous, compounding upside. It sharpens your thinking, builds your reputation while you sleep, and turns your everyday work into a body of evidence that advances your career on your behalf.

You don’t need to be an expert. You don’t need a perfect platform. You don’t need much time. You need one topic, a cadence you can keep, and the willingness to hit publish before you feel ready.

Start this week. Write 400 words about one thing you figured out. Your future career will thank you. I promise mine did.

DynamoDB Single-Table Design: The Pattern Everyone Fears (Explained Simply)

Lefteris Karageorgiou — Wed, 22 Jul 2026 09:31:15 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

The first time I saw a DynamoDB single-table design, I closed the tab.

One table holding users, orders, products, and reviews. All mixed together, with cryptic keys like USER#123 and ORDER#456 sitting in the same partition? It looked insane.

It looked insane because I was thinking in relational terms. Once I flipped the mental model, design for access patterns, not entities, it clicked. Here’s the version of this pattern I wish someone had shown me on day one.

Why We Fear It: The Relational Hangover

SQL trains us to normalize. One table per entity, JOINs at query time, let the database figure out the rest. It’s a beautiful model and it’s the exact instinct that makes single-table design feel wrong.

DynamoDB has no JOINs. Every relationship you’d resolve with a JOIN in SQL becomes either multiple round-trips or a modeling decision you make up front. The fear isn’t irrational. It’s a paradigm mismatch. Name it, and it stops being scary.

The One Rule That Changes Everything: Access Patterns First

Here’s the shift: you list your queries before you design your table.

In SQL you model the data and figure out queries later. In DynamoDB you do the opposite. Take a simple e-commerce app and write down what it needs:

Get a user by ID
Get all orders for a user
Get a single order with its line items
Get all products in a category

If you can’t list your access patterns, you’re not ready to model. That discipline is exactly what scares people but it’s also the superpower. You design for what your app actually does, nothing more.

PK, SK, and the Item Collection

Two concepts do most of the work:

Partition Key (PK): determines where an item lives.
Sort Key (SK): orders items within a partition.

The key insight: items sharing the same PK form an “item collection” and are stored together. One query can retrieve the whole collection. That’s how you replace a JOIN, you co-locate related data on purpose.

A Worked Example

Here’s an actual single-table layout for the e-commerce app:

Now watch the access patterns fall out:

Get user 123:
PK = USER#123 AND SK = PROFILE
Get all orders for user 123:
PK = USER#123 AND begins_with(SK, "ORDER#")One query, no JOIN, no second round-trip.
Get order 555 with its items:
PK = ORDER#555The order and every line item come back together.

The cryptic keys aren’t chaos. They’re a query language you designed on purpose.

GSI Overloading: The Advanced Bit, Made Simple

Some queries go against the grain of your PK/SK like “get all orders with status SHIPPED.” Your main keys can’t answer that.

The solution is a Global Secondary Index built on generic attributes (GSI1PK, GSI1SK) that you reuse across entities. Point GSI1PK at the status, and one index can serve several “sideways” queries. Don’t worry about mastering this on day one, it’s the last 20% you grow into once the basics feel natural.

When NOT to Use Single-Table Design

I draw boundaries, because the pattern isn’t free:

Evolving access patterns. Early-stage products whose queries change weekly, multiple tables are easier to refactor.
Ad-hoc analytics. DynamoDB is for known patterns. Offload analytics to S3 + Athena.
Small apps. Sometimes the operational simplicity of separate tables beats query efficiency.
Team unfamiliarity. The real cost is the learning curve, not the technology.

The Cost & Performance Payoff

This is where it earns its place in a scalable, cost-efficient stack:

Fewer tables to provision, monitor, and pay for.
One query instead of N round-trips means lower latency and fewer read capacity units consumed.
Item collections keep related data on one partition, giving you predictable single-digit-millisecond reads at scale.

Where to Start

Don’t boil the ocean. Pick one feature in your app. List its queries. Model just that as a single table. Once you’ve watched a JOIN disappear into a single Query call, the fear is gone for good, and you’ll never look at those USER#123 keys the same way again.

I Asked 20 Hiring Managers What They Look For in Cloud Interviews - Here's What They Told Me

Lefteris Karageorgiou — Wed, 15 Jul 2026 09:30:29 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

If you’ve ever walked out of a cloud engineering interview thinking “I answered everything correctly… so why didn’t I get the offer?” then this article is for you.

Over the past few months, I spoke with 20 hiring managers across startups, scale-ups, and large enterprises who regularly interview for Cloud Engineer, DevOps, SRE, and Solutions Architect roles. I asked them a simple question: “When two candidates have the same technical skills, what makes you say yes to one and no to the other?”

The answers were remarkably consistent. And almost none of them were about knowing more AWS services.

Here’s what they’re actually looking for.

The #1 theme: candidates who tell stories, not spec sheets

Nineteen of the twenty managers used some version of the same phrase: “I want to understand how they think.”

The weakest candidates answer questions like they’re reciting documentation. Ask them how they’d design a highly available system and you get a list: “Multi-AZ, Auto Scaling, load balancer, RDS with a read replica.” All correct. All forgettable.

The strongest candidates answer with a story that has a shape:

Context: what was the situation and the constraint?
Decision: what did you choose, and critically, what did you choose not to do?
Impact: what changed as a result, measured in numbers?

Managers don’t just want to know that you can use CloudFront. They want to know that you understood why it was the right call over the three other options you considered, and what it did for the business.

“Anyone can list services. I’m hiring the person who can tell me why they picked one over another and what it cost them when they got it wrong.” — Engineering Manager, fintech scale-up

Why storytelling wins (even for deeply technical questions)

There’s a misconception that storytelling is only for behavioural rounds. It isn’t. The best technical answers are also stories, because storytelling is really just structured reasoning made visible.

When you narrate your thinking as a story, you demonstrate four things at once that a bullet-point answer never can:

Judgment: you weighed trade-offs, not just memorized the “right” answer.
Ownership: you were close enough to the outcome to know what actually happened.
Communication: you can explain a complex system to a stakeholder who isn’t in the weeds.
Self-awareness: you know where it went wrong and what you learned.

Those four qualities are exactly what separates a mid-level engineer from a senior one. And they’re impossible to fake with a list of services.

Impact is the word that closes the interview

The second theme was even blunter. When I asked what makes an answer land, managers kept coming back to one word: impact.

Engineers love to talk about what they built. Hiring managers want to know what changed because you built it.

Compare these two answers to “Tell me about a system you optimized.”

❌ Without impact:

“I migrated our batch jobs from EC2 to Lambda and set up EventBridge to trigger them on a schedule.”

✅ With impact:

“Our nightly batch jobs were running on a fleet of always-on EC2 instances that cost us about $4,000/month but were only active two hours a day. I moved them to Lambda triggered by EventBridge on a schedule. That cut the compute bill for that workload by roughly 80%, around $38k/year, and eliminated the on-call pages we used to get when an instance failed overnight.”

Same technical work. Completely different signal. The second answer tells the manager: this person understands that engineering exists to serve the business.

A simple rule: every technical story you tell should end with a number, a saved hour, or a problem that stopped happening.

A worked example: how to turn a flat answer into a winning one

Let’s take a classic cloud interview question and walk through the transformation.

The question: “Tell me about a time you improved the reliability of a system.”

The flat answer (what most candidates say)

“We were having downtime issues, so I added a load balancer and put the service across multiple Availability Zones. After that it was more reliable.”

It’s not wrong. But it has no context, no trade-off, and no measurable outcome. The manager learns almost nothing about how you think.

The storytelling + impact answer (using Context → Decision → Impact)

Context: “We ran a customer-facing checkout API on a single EC2 instance in one Availability Zone. It was fine until we had two outages in one quarter, once from an AZ disruption and once from a bad deploy, and each one took checkout down for about 40 minutes. For an e-commerce product, that’s direct lost revenue, and it was eroding trust with the business team.
Decision: “I proposed moving to an Auto Scaling group across three AZs behind an Application Load Balancer, with health checks that pulled unhealthy instances out automatically. I deliberately didn’t go straight to containers or a full EKS setup, even though it was tempting, the team had no Kubernetes experience, and I didn’t want to trade one reliability risk for a bigger operational one. The ALB-plus-ASG approach solved 90% of the problem with 10% of the complexity.
Impact: “After the change, we went from two multi-outage quarters to zero customer-facing checkout outages over the next nine months. Deploys became safe because we could roll instances one at a time. And because Auto Scaling replaced failed instances automatically, our overnight on-call pages for that service dropped to essentially zero, which the on-call team definitely noticed.”

Notice what that answer does:

It quantifies the pain before the fix (two outages, 40 minutes each, lost revenue).
It shows a deliberate trade-off (ALB + ASG instead of Kubernetes), proof of judgment.
It closes with measurable impact across three dimensions: reliability, deploy safety, and team quality of life.

That’s the difference between “I know the services” and “I know how to use the services to move the needle.”

How to prepare before your next interview

You don’t need to memorize more services. You need to package what you already know into stories. Here’s a practical drill:

List your last 5–6 real projects. Anything you touched like a migration, a cost cut, an incident, a pipeline.
For each, write three lines: the Context (the constraint), the Decision (and the road not taken), and the Impact (with a number).
Attach a metric to every story. Dollars saved, latency reduced, deploy frequency increased, incidents eliminated. If you don’t know the exact number, estimate it honestly (”roughly”, “about”) as managers value the instinct to measure.
Practice the trade-off out loud. For each story, be ready to answer: “What else did you consider, and why didn’t you pick it?” This is where senior candidates separate themselves.

Do this for six stories and you’ll have a toolkit that covers almost any technical or behavioural question they throw at you.

The takeaway

Twenty hiring managers, one message: technical skill gets you into the room, but storytelling and impact get you the offer.

The candidates who win aren’t the ones who know the most services. They’re the ones who can take a real problem, walk you through how they thought about it, name the trade-offs they made, and show with numbers what changed because of it.

Next time you prep, don’t ask “Do I know enough AWS?” Ask “Can I tell the story of what I built, and prove it mattered?”

That’s the answer hiring managers are actually waiting for.

Lambda Managed Instances: When Serverless Meets Steady-State Traffic

Lefteris Karageorgiou — Wed, 08 Jul 2026 09:30:54 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

For years, there’s been one pushback against Lambda that never fully went away: “It gets expensive at scale.”

And honestly? For steady, high-volume workloads, that criticism held up. Standard Lambda gives you one request per execution environment. So a function that spends most of its time waiting on a database or downstream API is burning paid execution time doing nothing, and at thousands of requests per second, per-invocation billing adds up fast.

That’s exactly the gap AWS Lambda Managed Instances (announced at re:Invent 2025) is built to close. Let’s break down what it actually changes, and more importantly when you should reach for it.

What it actually is

You keep the Lambda programming model. Same handler, same event source mappings, same IAM roles, same CloudWatch. But instead of running on Lambda’s shared fleet, your function runs on EC2 instances in your own account and AWS still manages them for you: OS patching, load balancing, auto-scaling, instance lifecycle. You never touch an ASG.

Three things change the game:

Multi-concurrency. One execution environment can now handle many concurrent requests instead of one. For IO-heavy workloads, AWS lets you run up to 64 concurrent requests per vCPU. That’s a completely different mental model as concurrency now means more work per environment, not just more environments.
EC2 pricing. You pay standard EC2 instance charges plus a 15% management fee, not per-request duration. Your Compute Savings Plans and Reserved Instances apply to the EC2 portion (up to 72% off on-demand). For a steady baseline, this can be dramatically cheaper.
No cold starts. Requests route to pre-provisioned environments. You also get access to specialized hardware like Graviton4 and high-bandwidth networking.

The catch: this is not a free upgrade

Here’s where teams will get burned. It still says “Lambda,” but it behaves like a small service process.

Thread safety is now your problem. Global state, connection pools, mutable singletons, writes to /tmp — anything that quietly relied on “one request at a time” needs an audit before you flip the switch. Concurrency-unsafe code doesn’t just underperform; it breaks.
No scale-to-zero. Managed Instances scale to the minimum environments you configure, even at 3am with zero traffic. You’re deliberately paying for a capacity floor.
Scaling is asynchronous. It scales on CPU and concurrency saturation, sized to absorb roughly a 50% spike before adding capacity (new instances in tens of seconds). If your traffic goes from near-zero to a massive spike in seconds, standard Lambda still has the better shape.

The decision framework

Reach for Lambda Managed Instances when most of these are true:

Traffic is steady or predictable as the service does real work most of the day
A minimum warm footprint is acceptable (you don’t need scale-to-zero)
Your code is thread-safe under concurrent load
The workload is IO-heavy, so multiple requests per environment boost throughput
You want EC2 purchase options or specific hardware (Graviton4, high networking)

The textbook fit: an API that loads a model, vector index, or ruleset into memory at init, then serves lots of read-heavy requests. On standard Lambda you’d push that state to an external store and pay the latency tax on every call.

Stick with standard Lambda for bursty, spiky, event-driven functions where scale-to-zero matters.

Stick with Fargate when you need full container semantics — sidecars, background daemons, EFS mounts, long-running processes, or task-definition control.

Getting started

The new primitive is the capacity provider (VPC, scaling mode, instance requirements):

aws lambda create-capacity-provider \
  --capacity-provider-name app-api-managed \
  --vpc-config SubnetIds=subnet-123,subnet-456,SecurityGroupIds=sg-789 \
  --instance-requirements Architectures=arm64 \
  --capacity-provider-scaling-config ScalingMode=Auto

Attach a function to it, publish an active version, and you’re serving traffic on EC2-backed capacity with the same code.

The bottom line

AWS didn’t make Lambda more magical here. It made it more honest about the workloads people were already forcing into it. Standard Lambda is still king for spiky event-driven functions. Managed Instances is the new answer for predictable, high-throughput, Lambda-shaped services, as long as your code is ready for concurrency.

Steady traffic? Do the math. It might be time to give your functions a permanent home.

My Insights After 10 Years of Serverless

Lefteris Karageorgiou — Wed, 01 Jul 2026 09:30:49 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

Ten years ago, I discovered Serverless and AWS Lambda, and it changed the way I build software forever.

What started as curiosity quickly turned into conviction. Over the past decade, I’ve designed and shipped production systems entirely on Serverless. I even built a startup from scratch, on a fully Serverless stack and took it to production in just 10 months.

In my four years at AWS, I’ve worked alongside hundreds of customers on their Serverless journeys, from first Lambda functions to complex event-driven architectures. Along the way, I distilled everything I’ve learned into my book, Mastering Event-Driven Microservices in AWS.

Last month, Lee Gilmore, an AWS Hero, reached out and featured me in his Serverless Advocate newsletter, where he posed three thought-provoking questions about the state of Serverless. I’m sharing my answers here, because they touch on things I think every Serverless engineer should be thinking about.

Q1: What is one common mistake you see teams making when building their solutions, and how can they avoid it?

A common mistake is that companies think serverless is all about AWS Lambda, and as a result, they become overly concerned about cold starts.

In reality, serverless is much broader than Lambda. Many use cases can be solved without Lambdas at all. For example, by using direct integrations with Amazon API Gateway or orchestrating workflows with AWS Step Functions.

The key is to step back and evaluate whether Lambda is actually needed. If it’s only being used to move data from one service to another, it’s often unnecessary.

That said, when you do need Lambda, you should know how to optimise it properly.

Three of the most effective techniques are:

Optimise memory: Use the AWS Lambda Power Tuning tool to find the optimal memory configuration. Since memory allocation also scales CPU, the right balance can significantly reduce both execution time and cost.
Minimise deployment size: Smaller packages lead to faster cold starts, so remove unused dependencies and keep artefacts lean.
Use SnapStart: Especially for Java workloads, SnapStart can dramatically reduce cold start latency by initialising functions ahead of time.

By using Lambda intentionally and optimising it when needed, you can avoid unnecessary complexity and get the best out of serverless.

Q2: Which tool, package, or AWS service are you most excited about right now, and why?

Right now, I’m most excited about AWS Lambda durable functions. This is something the ecosystem has needed for a long time, bringing orchestration closer to the application layer. Previously, you could achieve similar outcomes with AWS Step Functions, but local development, testing, and debugging were often cumbersome.

Although this may seem similar to Step Functions, the trade-offs are important:

Use Lambda durable functions when:

Your team prefers standard programming languages and familiar development tools
Your application logic primarily lives inside Lambda functions
You’re building Lambda-centric systems with tight coupling between workflow and business logic

Use Step Functions when:

You need a visual workflow representation for cross-team visibility
You’re orchestrating multiple AWS services and want native integrations without writing custom SDK code
You want zero-maintenance infrastructure (no patching or runtime concerns)

Durable functions make it much easier to build complex, long-running workflows directly in code, opening the door to more advanced use cases like multi-step processes and agent-style orchestration, without sacrificing developer experience.

Q3: What is your favourite trick or tip that the readers may find interesting?

A common anti-pattern I see is treating AWS Lambda as “one service = one function.” This often leads to architectures with hundreds of tiny Lambda functions, which quickly become difficult to manage, deploy, and reason about.

Instead, treat your Lambdas as microservices within a bounded context. It’s perfectly fine, and often preferable, to group related functionality together. For example, within a “users” domain, you can have both ‘createUser’ and ‘deleteUser’ handled by the same Lambda.

When deciding how to group your functions, consider these factors:

Bounded contexts
Team organisation
Scoped IAM permissions
Common code dependencies
Common downstream dependencies
Initialisation time (cold start impact)
Memory configuration

A powerful way to implement this approach is the Lambda Web Adapter pattern. Instead of creating one Lambda per HTTP endpoint, you run a traditional web framework inside a single Lambda and handle routing internally. This allows you to use familiar frameworks like Express.js, Flask, Django, Spring Boot, or ASP.NET.

The result is a more maintainable system that aligns with real domain boundaries, without losing the benefits of serverless.

Conclusion

Serverless has matured enormously over the past decade, but the fundamentals remain the same: build only what matters, let the cloud handle the rest, and always optimize for simplicity. Whether it’s choosing the right tool for the job, embracing new capabilities like Lambda durable functions, or structuring your Lambdas around real domain boundaries, the goal is to ship faster with less operational burden. If you want to go deeper on event-driven architectures, my book Mastering Event-Driven Microservices in AWS covers these patterns and many more in detail.

API Gateway: Why Your Serverless API Costs More Than an ALB at Scale

Lefteris Karageorgiou — Wed, 24 Jun 2026 09:30:40 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

You went serverless. API Gateway in front of Lambda, clean architecture, no servers to manage. It felt great at 10,000 requests a day. Then your product grew, and now you’re staring at a $2,000/month API Gateway bill wondering what happened.

Here’s the uncomfortable truth: API Gateway’s pricing model has a scaling cliff that nobody talks about during the honeymoon phase.

The Math That Changes Everything

API Gateway REST APIs charge $3.50 per million requests (after the first 333 million/month, it drops slightly). Sounds cheap in isolation. Let’s do the math for a moderately successful API:

50 million requests/month
API Gateway cost: ~$175/month

Now the same traffic through an Application Load Balancer:

ALB fixed cost: ~$22/month (base + LCUs)
Per-request component: negligible at this scale

That’s already an 8x difference, and it only gets worse as you scale. At 500 million requests/month, you’re looking at $1,750 for API Gateway vs. roughly $50–80 for an ALB. The gap becomes a canyon.

“But API Gateway Gives Me More Features”

This is the argument that keeps teams locked in. And it’s partially true. REST APIs give you usage plans, API keys, request validation, and caching built in.

But ask yourself honestly: how many of those features are you actually using?

Most production APIs I’ve seen use API Gateway as a glorified proxy. The Lambda function does all the validation, auth happens in a middleware layer or authorizer, and nobody configured the built-in caching. If that sounds like your setup, you’re paying a premium for a feature set you’re not consuming.

HTTP APIs: The Middle Ground Nobody Considers

In 2019, AWS launched HTTP APIs, a stripped-down API Gateway variant at $1.00 per million requests. That’s a 70% discount over REST APIs for the same basic function: route a request to Lambda.

HTTP APIs support JWT authorizers, CORS configuration, path parameters, and Lambda proxy integration. For most CRUD APIs, that’s the complete feature set you need.

If you’re running REST APIs today and not using usage plans, API key management, or request/response transformation at the gateway level, you’re overpaying by 3.5x for no reason.

When ALB + Lambda Actually Wins

ALB can invoke Lambda directly. No API Gateway in the path at all. You lose the API management features entirely, but you gain:

Predictable pricing that barely moves with traffic volume
Health checks and target group routing baked in
gRPC support if you need it
No 29-second timeout ceiling

The trade-off: you manage SSL certificates, you don’t get built-in throttling, and monitoring requires more CloudWatch configuration. But for high-throughput internal APIs or backend-to-backend communication, ALB is dramatically cheaper.

The Decision Framework

Here’s how I think about it:

Stay on API Gateway REST APIs if you genuinely use usage plans, per-client throttling, API key quotas, or request transformation templates. These are legitimate features with no ALB equivalent.

Switch to HTTP APIs if your Gateway is a passthrough proxy with JWT or Lambda authorizer. Same developer experience, 70% cost reduction.

Switch to ALB if you’re processing more than 100M requests/month, your APIs are internal or don’t need API management features, or you’re hitting the 29-second timeout limit.

The Takeaway

API Gateway is not expensive. API Gateway at scale without using its premium features is expensive. The mistake isn’t choosing API Gateway on day one — it’s never re-evaluating as your traffic grows.

Check your current monthly request count. Run the math against HTTP APIs and ALB. The five minutes of arithmetic might save you more than your last week of performance optimization work.

Serverless with Claude Code: Build POCs Fast Without Losing Control

Lefteris Karageorgiou — Wed, 17 Jun 2026 09:31:09 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

Proof of concepts shouldn’t take weeks. At AWS, we build many POCs for customers. Quick, focused prototypes that validate an idea, demonstrate feasibility to stakeholders, and inform the path forward. Whether it’s a new integration pattern, an AI-powered workflow, or a migration proof point, the goal is always the same: show, don’t tell.

Claude Code has become a genuine accelerator for this kind of work. Paired with AWS serverless services — Lambda, API Gateway, DynamoDB, Step Functions — it helps me go from idea to working prototype in hours rather than days. The boilerplate disappears. The IAM headaches shrink. The iteration cycles get tighter.

But here’s what I’ve learned the hard way: speed without structure creates risk.

The Problem with Moving Too Fast

I’ve seen it in my own POCs and in customer engagements. Claude Code scaffolds an entire API in minutes, but then you realise the permissions are too broad, the error handling is inconsistent, or the generated code drifts from your intended architecture. For a throwaway prototype, maybe that’s fine. But the moment a POC starts to look promising — the moment a stakeholder says “let’s run with this” — those shortcuts become technical debt.

The issue isn’t Claude Code itself. It’s that most of us treat it as a raw accelerator without building the harness around it: the specs, constraints, tests, and guardrails that keep agent behaviour reliable and predictable.

Harness Engineering: The Missing Piece

This is exactly why I’m excited about an upcoming workshop that tackles this head-on: Hands-On: Harness Engineering with Claude, hosted by Packt Publishing on Thursday, August 6, 2025 (9:00–11:30 AM EDT).

The workshop teaches harness engineering — the practical discipline of building the surrounding system that makes Claude Code’s behaviour more reliable, testable, constrained, and trustworthy. Instead of relying on prompting alone, you learn how to combine:

Specs and instructions — turning requirements into executable guidance that steers Claude Code’s output
Permissions and hooks — constraining what the agent can and cannot do
Tests and verification — validating outputs against acceptance criteria automatically
Logging and observability — understanding what the agent actually did and why

This is the difference between “Claude Code wrote something that looks right” and “Claude Code produced a verified, reviewable output within defined boundaries.”

Why This Matters for Serverless POCs

When I build serverless POCs with Claude Code, the harness is what lets me move fast and stay confident. A well-written spec means the generated Lambda functions match the architecture I intended. Permission hooks prevent the agent from creating overly permissive IAM policies. Tests validate that the API actually handles edge cases before I demo it to a customer.

The result: I keep the speed advantage — POCs in hours, costs in pennies — without the anxiety of shipping something I haven’t properly reviewed.

Who Should Attend

If you’re using Claude Code (or any AI coding agent) for real work — whether that’s serverless POCs, infrastructure automation, or application development — this workshop fills a critical gap. It’s not about prompting tricks. It’s about building an engineering framework that optimises for trust, predictability, and production readiness.

Event details:

📅 Thursday, August 6 | 9:00–11:30 AM EDT
💻 Online — join from anywhere
🎟️ Register on Eventbrite

Speed is table stakes now. Reliability is the differentiator. Learn how to build both into your Claude Code workflows.

The 4-Step Roadmap to Break Into Cloud (That Actually Works)

Lefteris Karageorgiou — Wed, 10 Jun 2026 09:31:18 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

Most people trying to break into cloud are stuck in an endless loop.

They watch tutorials. They collect certifications. They spin up a Lambda function, follow along with a YouTube video, tear it down, and call it “experience.”

Then they apply to 50, 100, maybe 200 jobs, and hear nothing back.

Here’s the uncomfortable truth: the market doesn’t reward what you know. It rewards what you can prove.

Right now, thousands of professionals who want to switch to cloud careers are competing for the same roles with the same certifications, the same generic resumes, and zero evidence that they can build anything real.

The ones who break through do something different. They follow a system.

In this article, we’ll dive into that system, which consists of four steps.

Step 1: Build Real-World Cloud Projects

Not toy examples. Not tutorial clones.

Build production-style projects using industry best practices, the kind you can confidently walk through in an interview without breaking eye contact.

That means proper architecture. Infrastructure as Code. CI/CD pipelines. Monitoring. Cost awareness. Security considerations. The full picture.

When an interviewer asks, “Tell me about something you’ve built,” you should have a project so solid that your biggest challenge is deciding which part to talk about first.

Step 2: Turn Those Projects Into a Recruiter-Ready Portfolio

Building is only half the battle. If nobody can find your work, it doesn’t exist.

Instead, showcase your projects properly. Upload clean, well-documented code to GitHub with strong READMEs. Write blog posts explaining your architecture decisions and trade-offs.

I’ve seen candidates with mediocre projects outperform stronger builders simply because their work was visible and well presented.

Also, optimize your LinkedIn profile. Treat it like a landing page, not a digital CV. Use a clear headline that states what you do and who you help. Post consistently about what you’re building and learning. Recruiters search LinkedIn daily, so make sure they find you.

Step 3: Get Visible and Secure Interviews

You can’t get hired from the shadows. Network with intent.

Engage on LinkedIn, join cloud communities, and attend local meetups.

But don’t just lurk. Comment on posts with genuine insights, share your learnings publicly, and connect with people doing the work you want to do.

I’ve seen more opportunities come from a single thoughtful comment than from hundreds of cold applications.

Showcase your projects publicly. Talk about what you’re building, what broke, and what you learned.

This isn’t bragging, it’s signaling.

You’re telling the market: “I’m here, I’m building, and I’m serious.”

The interviews will come. Not from luck, but from visibility.

Step 4: Prepare for and Master Interviews

Getting the interview isn’t the finish line. It’s the starting line.

Practice explaining your architecture decisions, trade-offs, and problem-solving approach until it becomes second nature.

Know why you chose DynamoDB over RDS. Know what you’d change if requirements shifted. Know how you’d scale under pressure.

Walk in confident, not guessing.

The candidates who land offers aren’t always the most technically brilliant. They’re the ones who communicate clearly, own their decisions, and show they think like engineers who ship to production.

The Problem With Doing This Alone

You already know what you need to do. The steps above aren’t a secret.

But knowing and executing are two different things.

Most people get stuck between Step 1 and Step 2 — building projects that aren’t strong enough or never making them visible.

Others network randomly, apply generically, and wonder why nothing lands.

What separates people who break in from people who stay stuck isn’t talent.

It’s having a structured path, accountability, and someone who’s done it before showing you exactly where to focus.

That’s Why I Built the Cloud Career Bootcamp

Over the past six years, I’ve personally guided more than 200 professionals through their transition into cloud careers using this system.

The results have been consistently strong.

People who felt their backgrounds were holding them back successfully landed cloud roles at companies like AWS, J.P. Morgan, Airbnb, Uber, Pfizer, and more.

Career changers who thought it was too late to pivot moved into cloud positions.

Instead of facing constant rejection, they started receiving multiple offers from companies eager to hire them.

Now, I’m putting together a program that walks you through this entire roadmap, step by step, with hands-on guidance, real feedback, and a community of people on the same path.

If you’re serious about breaking into cloud and you’re done spinning your wheels, join the waitlist:

👉 Cloud Career Bootcamp — Join the Free Waitlist Now

Spots will be limited.

Get on the list so you’ll be the first to know when doors open.

S3 Cost Traps: What Nobody Tells You About Lifecycle Policies

Lefteris Karageorgiou — Wed, 03 Jun 2026 09:31:37 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

You’re running a production workload on AWS. You set up S3, maybe enabled versioning because “best practice,” and moved on. Six months later, your S3 bill is three times what you expected, and you have no idea why.

You’re not alone. S3 pricing is deceptively simple on the surface, but underneath it hides several cost traps that lifecycle policies are supposed to solve. The problem? Most teams either skip lifecycle rules entirely or configure them in ways that barely help.

Let’s talk about what’s silently eating your budget.

The Versioning Tax

Versioning is great for data protection. Every overwrite or delete keeps the old object around, which means you can recover from accidental changes. What nobody emphasizes: every old version is a billable object.

If you’re writing logs, reports, or processed data that gets overwritten frequently, you could be storing 10x or 50x the “visible” data. The S3 console shows you the current objects. The old versions hide underneath, quietly accumulating storage charges.

The fix: Always pair versioning with a lifecycle rule that expires non-current versions. Ask yourself: “Do I really need 90 days of old versions, or would 7 days cover any realistic recovery scenario?”

The Multipart Upload Graveyard

When a large upload fails halfway, the uploaded parts don’t disappear. They sit in your bucket as incomplete multipart uploads, invisible in the console’s normal view, but fully billable.

If you’re running data pipelines, ETL jobs, or any process that writes large objects and occasionally fails, these orphaned parts add up. Some teams discover gigabytes of phantom storage they never knew existed.

The fix: Add a lifecycle rule to abort incomplete multipart uploads after a short window, 7 days is generous for most workloads. Some teams set it to 1 day.

The “I’ll Transition Everything to Glacier” Mistake

A common first move: create a lifecycle rule that transitions all objects to S3 Glacier after 30 days. Sounds sensible, cold storage is cheap.

Here’s what catches people:

Minimum storage duration charges. Glacier has a 90-day minimum. Delete an object on day 45? You still pay for 90 days.
Retrieval costs. If your application or downstream process ever needs those objects back, retrieval fees and restore times can surprise you.
Small object overhead. Glacier adds 32KB of metadata per object. If you’re storing millions of tiny files, the overhead alone can exceed what you’d pay in Standard.

The right question: Before transitioning, map your actual access patterns. If objects are never accessed after 30 days, Glacier Deep Archive might make sense. If they’re occasionally accessed, Infrequent Access (IA) with its simpler retrieval model is a safer bet.

Lifecycle Rules That Don’t Actually Apply

This one is subtle. You create a lifecycle rule, confirm it’s active, and assume it’s working. But lifecycle rules scope by prefix and tags. If your bucket structure changed after you wrote the rule — new prefixes, different naming conventions — your rule might be covering 10% of the bucket while the rest grows unchecked.

The fix: Audit lifecycle rules quarterly. Use S3 Storage Lens to see the actual breakdown of storage classes, current vs. non-current versions, and incomplete multipart uploads across your buckets. If the numbers don’t match your expectations, your rules have gaps.

A Mental Model for Getting This Right

Think of lifecycle policies as a three-layer system:

Expiration layer: What can be deleted, and when? Non-current versions, expired delete markers, incomplete uploads.
Transition layer: What should move to cheaper storage, and based on what access pattern evidence?
Audit layer: How do you verify the rules are actually working as intended?

Most teams only think about layer two and skip layers one and three entirely. That’s where the silent costs hide.

The Takeaway

S3 is not “set and forget.” The defaults are designed for durability, not cost efficiency. If you’re not actively managing object versions, failed uploads, and storage class transitions with lifecycle rules, and validating those rules still match reality, you’re overpaying.

Start with a single bucket. Check its Storage Lens dashboard. You’ll probably find at least one surprise waiting for you.

5 Fundamental Architectures You Should Know for Cloud Interviews

Lefteris Karageorgiou — Wed, 27 May 2026 09:31:12 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

If you want to break into a cloud role (Cloud Engineer, Cloud Developer, Software Engineer, DevOps, SRE, Solutions Architect, etc.) and you have an interview coming up, understanding cloud architecture is non-negotiable if you want to stand out in interviews.

Most candidates focus on certifications or memorising AWS services, but interviews are designed to test something deeper: how you think about building scalable, reliable, and resilient systems.

In this article, we’ll look at 5 fundamental architectures you should know. Master them, understand the trade-offs behind them, and you’ll stand out for the right reasons.

If you want hands-on practice with these 5 architectures, grab my free PDF, 5 AWS Projects To Get You Hired., featuring 5 real-world AWS projects with architecture diagrams and step-by-step implementation guides.

Download the FREE PDF

1. 3-Tier Architecture

This is the foundation. Frontend, backend, database: three distinct layers, each with a clear responsibility. Simple in concept, but interviewers expect you to go far beyond drawing three boxes on a whiteboard.

Key AWS Services: CloudFront + S3 (frontend), ALB + EC2/ECS with Auto Scaling (backend), RDS Multi-AZ or Aurora (database).

What interviewers expect you to know:

How to scale each tier independently: stateless backends behind a load balancer, read replicas for database read-heavy workloads.
Failure scenarios: what happens when an AZ goes down? How does Multi-AZ RDS failover work?
Caching strategies: ElastiCache between backend and database to reduce latency and DB load.
The difference between horizontal and vertical scaling, and why horizontal wins at scale.

When discussing this architecture, show that you understand the why behind each layer’s separation, not just the what.

2. Microservices

Breaking a monolith into smaller, independent services sounds clean on paper. In practice, it introduces a whole new category of challenges.

Key AWS Services: ECS or EKS, Lambda.

What interviewers expect you to know:

Trade-offs vs. monoliths: when the operational overhead isn’t worth it (small teams, early-stage products).
Data ownership: each service owns its data store. No shared databases.
How you handle failures across service boundaries: circuit breakers, retries with exponential backoff, timeouts.
Observability: how do you trace a request that flows through 5 services?

The key insight interviewers look for: you chose microservices because the problem demanded it, not because it was trendy.

3. Serverless

No servers to patch, no infrastructure to manage. You can build entire applications without provisioning a single instance.

Key AWS Services: Lambda, API Gateway.

What interviewers expect you to know:

Cold starts: what causes them, how to mitigate
Cost model: pay-per-invocation is cheap at low scale but can surprise you with high-throughput workloads. Know the break-even point vs. containers.
When NOT to use serverless: long-running processes, workloads needing persistent connections, or latency-critical paths where cold starts are unacceptable.

The interview-winning move is demonstrating that you can evaluate when serverless is the right tool and when it introduces more problems than it solves.

4. Event-Driven Architecture

This is where modern cloud design becomes powerful. Instead of services calling each other directly, they produce and consume events asynchronously. The result: loose coupling, high scalability, and systems that are naturally resilient to spikes in traffic.

Key AWS Services: EventBridge, SNS, SQS.

What interviewers expect you to know:

The difference between queuing (SQS one consumer) and pub/sub (SNS fan-out to many consumers).
Eventual consistency: data won’t be immediately up to date across services. You need to explain how your system handles this.
Idempotency: events can be delivered more than once. Your consumers must handle duplicates gracefully.
Dead-letter queues: what happens when an event fails processing repeatedly? How do you monitor and replay?
A real scenario: an order is placed → event fires → inventory, notifications, and analytics services react independently without knowing about each other.

The challenge is showing you can design systems where components don’t need to know about each other but still behave correctly as a whole.

5. Containers

Containers give you consistency across environments and fine-grained control over your runtime. They sit between the full control of EC2 and the abstraction of serverless.

Key AWS Services: ECS, EKS, ECR.

What interviewers expect you to know:

When containers beat serverless: long-running processes, specific runtime needs, workloads needing persistent connections, or apps being migrated from on-premises.
When serverless beats containers: short-lived, event-triggered functions with variable traffic.
Health checks, rolling deployments, and blue/green strategies for zero-downtime updates.
You don’t need to be a Kubernetes expert. But you should understand pods, services, and why teams choose (or avoid) K8s.

Conclusion

These five architectures cover the vast majority of what you’ll face in cloud interviews. You don’t need to memorise every AWS service. You need to understand patterns, trade-offs, and when to apply each one.

The candidates who stand out are the ones who can explain why they chose an architecture, not just draw it on a board. Show your reasoning. Discuss trade-offs. Acknowledge limitations.

Cloud interviews are rarely about knowing the “correct” AWS service. They’re about proving you can make good engineering decisions under constraints.

Download the FREE PDF

Building Multi-Agent Workflows on Lambda Durable Functions

Lefteris Karageorgiou — Wed, 20 May 2026 09:30:48 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

Multi-agent systems are everywhere now. Every team wants autonomous agents collaborating on complex tasks, research, planning, execution, validation. The problem? Orchestrating multiple agents that need to wait on each other, handle failures gracefully, and maintain state across long-running conversations is painful. We’ve been duct-taping this together with Step Functions, SQS queues, and DynamoDB state tables for too long.

Lambda Durable Functions change the game here. In this article we will walk through why durable functions are a natural fit for multi-agent orchestration, design a document processing pipeline with four agents, and show how the architecture holds together without a single line of state management code.

Why Durable Functions for Multi-Agent

Standard Lambda functions run start-to-finish in a single invocation. If something fails midway, you retry everything. For a multi-agent workflow where Agent A researches, Agent B plans, Agent C executes, and Agent D validates. That’s unacceptable. You can’t re-run a 4-minute research phase because the validation agent hit a transient error.

Durable functions automatically checkpoint progress, suspend execution for up to one year during long-running tasks, and recover from failures. No custom state management. No DynamoDB tables tracking “which step are we on.” The runtime handles it.

This is exactly what multi-agent orchestration needs: reliable progress tracking across agents that may take seconds or minutes to respond, with automatic recovery when things go wrong.

The Example: Document Processing Pipeline

Let’s see an example of a document processing pipeline and how we can build it with Lambda durable functions. We have four agents and one durable function orchestrating them. The agents are:

Classifier Agent — Reads an incoming document, determines its type (invoice, contract, support ticket), and extracts metadata.
Enrichment Agent — Takes the classification, pulls additional context from internal systems, and augments the document with business context.
Decision Agent — Evaluates the enriched document against business rules and decides the routing: auto-approve, escalate, or reject.
Action Agent — Executes the decision: files the document, notifies stakeholders, or triggers downstream workflows.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Lambda Durable Function (Orchestrator)                  │
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐   │
│  │ Checkpoint 1 │    │ Checkpoint 2 │    │ Checkpoint 3 │    │ Checkpoint 4 │  
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘   │
│         │                  │                  │                  │          │
│         ▼                  ▼                  ▼                  ▼          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐   │
│  │  Step 1:    │    │  Step 2:    │    │  Step 3:    │    │  Step 4:    │   │
│  │  Classify   │───▶│  Enrich     │───▶│  Decide     │───▶│  Act        │   │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘   │
│         │                  │                  │                  │          │
└─────────┼──────────────────┼──────────────────┼──────────────────┼──────────┘
          │                  │                  │                  │
          ▼                  ▼                  ▼                  ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Classifier     │  │  Enrichment     │  │  Decision       │  │  Action         │
│  Agent          │  │  Agent          │  │  Agent          │  │  Agent          │
│  (Lambda)       │  │  (Lambda)       │  │  (Lambda)       │  │  (Lambda)       │
│                 │  │                 │  │                 │  │                 │
│  - Reads doc    │  │  - Pulls context│  │  - Evaluates    │  │  - Files doc    │
│  - Classifies   │  │  - Calls APIs   │  │    rules        │  │  - Notifies     │
│  - Extracts     │  │  - Augments     │  │  - Routes       │  │  - Triggers     │
│    metadata     │  │    metadata     │  │                 │  │    downstream   │
└────────┬────────┘  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘
         │                    │                    │                     │
         ▼                    ▼                    ▼                     ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Amazon Bedrock │  │  Internal APIs  │  │  Human Review   │  │  S3 / SNS /     │
│  (LLM)          │  │  (DynamoDB,     │  │  (Callback -    │  │  EventBridge    │
│                 │  │   other svcs)   │  │   Wait/Resume)  │  │                 │
└─────────────────┘  └─────────────────┘  └─────────────────┘  └─────────────────┘

The Flow

The durable function receives a document event. It invokes the Classifier Agent as a durable step and progress is checkpointed. If Lambda recycles the execution environment after classification completes, the function resumes from that checkpoint, not from scratch.

The Classifier’s output feeds into the Enrichment Agent. Another durable step, another checkpoint. The Enrichment Agent might call external APIs that take 30 seconds. The durable function suspends, pays nothing while waiting, and resumes when enrichment completes.

Here’s where it gets interesting. The Decision Agent might determine that a human needs to review this document. The durable function uses a wait where it suspends execution entirely, for hours or days if needed, until a callback arrives with the human’s decision. No polling. No idle compute. The function simply resumes where it left off.

Finally, the Action Agent executes. If it fails, maybe a downstream system is temporarily unavailable, the durable function retries that specific step without re-running classification, enrichment, or decision. Four agents, one orchestration function, zero state management infrastructure.

Why This Beats the Alternative

Before durable functions, this same workflow required: a Step Functions state machine, DynamoDB for intermediate state, SQS queues between agents, dead-letter queues for failures, and CloudWatch alarms for stuck executions. That’s five services to manage for what is conceptually a single workflow.

With durable functions, it’s one Lambda function. Same programming model you already know. Same event handler. Same integrations. The durability is built into the execution model itself.

The Trade-Off

Durable functions use a checkpoint-replay model. Every time execution resumes, it replays from the last checkpoint. This means your orchestration logic must be deterministic, so no random values, no reading the current time for branching decisions outside of durable steps. This is a constraint worth understanding upfront.

For multi-agent workflows specifically, this is rarely a problem. Your orchestration logic is typically: call agent, get result, pass to next agent. That’s inherently deterministic.

When to Reach for This

Multi-agent workflows that involve waiting on humans, on slow external systems, on other agents, are the sweet spot. If your agents all respond in under a second and never fail, you probably don’t need durability. But that’s not the real world.

In the real world, agents call LLMs that timeout, external APIs that rate-limit, and humans that go to lunch. Durable functions handle all of that without you writing a single line of state management logic.

How to Get the Most Out of Tech Events

Lefteris Karageorgiou — Wed, 13 May 2026 09:31:02 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

I’ve attended a couple of tech events over the past two weeks, and they reminded me just how much value is sitting right there, if you know how to look for it. Most people show up, sit through sessions, and leave. But the real ROI of a tech event goes well beyond the agenda.

Here’s what I’ve learned about making the most of them.

1. Do Your Research Beforehand

Walking into an event blind is a missed opportunity. Before you even step through the door, check the agenda. Which sessions are actually worth your time? Which workshops align with what you’re working on right now? Who’s speaking, and more importantly, who’s attending?

Identify two or three people you’d genuinely like to connect with. Not just big names, but people whose work you follow, whose problems overlap with yours, or whose perspective you’d find valuable. Having that list in your head changes how you move through the event. You stop wandering and start being intentional.

2. Don’t Just Attend — Connect

This is where most people leave value on the table. They attend the sessions, they clap at the end, and they head to the next room. But the conversations that happen in the hallways, at the coffee station, or right after a talk? Those are often worth more than the talk itself.

Talk to speakers. Ask them a follow-up question. Share your perspective on something they said. Disagree, even — respectfully. The best conversations I’ve had at events started with “I see it slightly differently, here’s why.”

Networking isn’t about collecting business cards or LinkedIn connections. It’s about meaningful exchange. One real conversation with the right person can open a door that no session ever could.

3. Participate Actively

Side events, workshops, roundtables, and hackathons are where the real engagement happens. The more you put in, the more you get out. It sounds obvious, but most people default to passive attendance.

When you participate actively, you become memorable. People remember the person who asked the sharp question, who contributed to the workshop discussion, who showed up to the evening side event when everyone else went back to the hotel. That presence compounds over time.

4. Capture and Apply Insights

Take notes. Not on everything, but on what genuinely stands out. A new mental model. A tool you hadn’t heard of. A framing of a problem that clicked. A name someone mentioned three times.

The real value isn’t in the notes themselves. It’s in what you do with them afterward. Block time in the week after the event to review what you captured and identify one or two things you can actually apply. That’s the difference between an event that felt good and one that moved the needle.

The Right Measure of Success

Here’s a reframe that changed how I approach events: you don’t need to walk away with ten new contacts, a notebook full of insights, and a job offer to call it a success.

If you leave with one valuable new connection, one useful takeaway, or simply a great experience that reminded you why you’re in this field — that’s a win. Set that bar, and you’ll almost always clear it.

The mistake is treating events as passive consumption. The sessions are the structure, but the value is in how you engage with everything around them. Come prepared, stay curious, and be willing to start a conversation.

That’s where the real return is.

Spec-Driven Development on AWS

Lefteris Karageorgiou — Wed, 06 May 2026 09:31:15 GMT

Hey, it’s Lefteris 👋 I’m the voice behind the weekly newsletter “The Cloud Engineers.”

We’ve all been there. A service gets deployed, an API route goes live, and three weeks later someone asks, “Wait, what does this endpoint actually do?” The answer lives in someone’s head, a Slack thread, or a Confluence page nobody has updated since the initial design meeting. That’s the problem Spec-Driven Development (SDD) solves, and on AWS, it changes how you build, test, and evolve systems regardless of whether you’re running Lambda, ECS, EKS, or EC2.

This hands-on Spec-Driven Development workshop is built for that exact problem. Instead of watching demos, you’ll build a real application while learning how to define clear specs and guide AI to produce reliable, production-ready outputs.

After a sold-out first cohort, Cohort 2 is now open.

👉 Register here: https://www.eventbrite.co.uk/e/hands-on-spec-driven-development-workshop-cohort-2-tickets-1985498625838?aff=lefteris

Use code SD40 for 40% off

Spec-Driven Workshop at 40% off

What Is Spec-Driven Development?

Spec-Driven Development is a practice where the contract, also know as specification, is written before any implementation begins. The spec becomes the single source of truth. Everything else, the application logic, the infrastructure configuration, the event schemas, derives from it.

This isn’t documentation-first development in the old sense. It’s not about writing a Word document before you code. It’s about defining the shape of your system - inputs, outputs, events, errors - in a machine-readable format that your tooling, your tests, and your team can all reason about simultaneously.

Why It Matters Across AWS Architectures

The need for specs is a distributed systems problem, and AWS workloads are distributed by nature, whether you’re running microservices on ECS Fargate, a data pipeline on EMR, a real-time processing layer on Kinesis, or a containerized platform on EKS.

Every boundary between components is a contract. A service on ECS calling another service over HTTP, a Kafka consumer reading from MSK, an EKS pod publishing to an SQS queue, each of these is an implicit agreement about shape, behavior, and failure modes. Without a spec, that agreement is undocumented and fragile.

We’ve seen this break in predictable ways. A team running microservices on ECS changes the response schema of an internal API. The downstream service starts returning 500s. No contract test caught it. No spec was ever written. The failure surfaces in production, during peak traffic, after a deployment that passed all unit tests. A spec would have made that breaking change visible before it shipped.

The same pattern plays out in data engineering. A Glue job changes the shape of a Parquet file it writes to S3. The Athena queries downstream start failing. The schema was never formally defined, it was inferred from the data. An explicit schema contract, enforced at write time, would have caught the drift immediately.

Where Kiro Changes the Game

This is where AWS’s agentic IDE, Kiro, becomes directly relevant to how we practice SDD.

Kiro inverts the model most AI coding tools use. Kiro starts with the spec. When you describe a feature or a system in natural language, Kiro doesn’t generate code. It generates a structured specification first. That spec lives in three artifacts: requirements.md, design.md, and tasks.md.

The requirements file expands your prompt into user stories with acceptance criteria written in EARS notation, a structured format that captures preconditions, triggers, and expected system responses, including edge cases that would otherwise surface during implementation. The design file produces a technical design document covering architecture decisions and sequence diagrams. The tasks file breaks the design into discrete, sequenced implementation steps with dependency tracking. Only after you review and approve those artifacts does Kiro begin writing code.

This is spec-driven development operationalized inside your IDE. The spec isn’t a side artifact you produce reluctantly. It’s the primary artifact the entire workflow is built around. Code becomes the build output of the spec, not the other way around.

For teams building on AWS, this matters because it forces the contract conversation to happen before a single line of application or infrastructure code is written. Whether you’re defining an API contract between two ECS services, an event schema for an EventBridge rule, or the interface between a CDK construct and the team consuming it, all of it gets defined and reviewed at the spec level, not discovered during a production incident.

The Spec as the Center of Gravity

The shift SDD requires is treating the spec as the artifact that drives everything else, not as something you generate after the fact.

When we start a new API, the OpenAPI spec is written first. The service is built to satisfy it. The tests validate against it. When a consumer team needs to integrate, we hand them the spec, not a Slack message. The same principle applies to infrastructure. Before a new CDK construct is published for internal use, its interface contract is defined and reviewed. Breaking changes are explicit. They show up in the diff before they show up in a broken pipeline.

This approach forces clarity early. You can’t write a spec for something you haven’t thought through. The act of speccing a service or an event forces you to answer questions you’d otherwise defer: What are the required fields? What are the error states? What does a partial failure look like? What’s the versioning strategy?

The Discipline It Demands

Adopting SDD, with or without Kiro, requires a workflow shift.

Design reviews happen at the spec level. Before any service is built, the team reviews the OpenAPI, AsyncAPI, or infrastructure spec. This is where architectural decisions get made, not in code review.
Mocking becomes trivial. A spec-first API can be mocked immediately. Consumer teams don’t wait for implementation. Parallel development becomes the default.
Breaking changes become visible. When the spec is versioned and diffed, breaking changes are explicit. A field removal or type change shows up in the diff before it shows up in a production incident.
Onboarding accelerates. A new engineer joining the team can understand the system’s boundaries by reading the specs. The spec is the architecture, expressed precisely.

The temptation is always to skip the spec and start building, especially under deadline pressure. But the cost of that shortcut compounds. Every undocumented contract is technical debt that accrues interest in the form of production incidents, integration failures, and onboarding friction. Kiro makes that discipline easier to maintain by making the spec the default starting point, not an afterthought.

Write the spec first. Build to it. Everything else follows.

AI can get you to the first feature fast, but most developers struggle when the system starts to grow.

After a sold-out first cohort, Cohort 2 is now open.

Register here: https://www.eventbrite.co.uk/e/hands-on-spec-driven-development-workshop-cohort-2-tickets-1985498625838?aff=lefteris

Use code SDD40 for 40% off

Spec-Driven Workshop at 40% off

5 Hands-On AWS Projects That Will Prepare You for the SAA-C03

Lefteris Karageorgiou — Wed, 29 Apr 2026 09:30:41 GMT

Most people study for the AWS Solutions Architect Associate by watching 40 hours of video and memorizing answers. Then they get to a real job and freeze.

I took a different approach. I built projects that forced me to understand the services deeply, not just what they do, but why you’d choose them, what breaks under load, and how the pieces connect. I passed the SAA-C03 and could talk confidently in interviews about real implementations. Here are the 5 projects I recommend.

Sponsored by Salesforce

Cross-Platform Consistency with Agentforce AXL

If you’ve ever tried to maintain brand and logic parity for an AI agent across Slack, web, and mobile, you know the pain. Enter the Agentforce Experience Layer (AXL). This new abstraction layer allows you to define agent logic and UI components once and have them render natively across any surface—including third-party platforms like Microsoft Teams.

Stream the AXL Orchestration Session

Watch the developer conference of the year. On demand on Salesforce+

Agentic AI is changing the game and Agentforce is leading the way. Watch TDX on demand to explore dozens of sessions covering the latest innovations across Agentforce, Data 360, the core platform, vibe coding, Slack, and more. All free on Salesforce+.

Watch TDX on Salesforce+ to:

Get roadmap insights from the leaders shaping what’s next
Access broadcast-only moments and exclusive interviews

It all starts with the main keynote, where you’ll experience the future of software and learn how to build it. Watch it now and catch every moment at your own pace.

👉 Watch Now

Let’s now go back to our article and see the 5 projects.

Project 1: Three-Tier Web Application

What you build: A custom VPC with public and private subnets, an Application Load Balancer, an Auto Scaling Group, and an RDS database.

Why it matters: This is the foundational architecture pattern behind almost every production web workload on AWS. The exam tests it constantly, and so does every technical interview.

The critical insight here is traffic flow. Your ALB lives in the public subnet and accepts inbound traffic on port 443. Your EC2 instances sit in private subnets and only accept traffic from the ALB’s security group, not from the internet. Your RDS instance sits in a separate private subnet and only accepts traffic from the EC2 security group. Nothing in the data tier is ever directly reachable.

When you build this yourself, you stop memorizing “databases should be private” and start understanding why: defense in depth, blast radius reduction, and compliance requirements that mandate network isolation. You also learn where Auto Scaling actually helps, horizontal scaling behind the ALB, and where it doesn’t, like a single-AZ RDS instance that becomes your bottleneck at 500 concurrent connections.

What the exam will ask: Multi-AZ RDS failover behavior, ALB vs. NLB selection criteria, and how security group rules differ from NACLs. Build this project and those questions answer themselves.

Project 2: Serverless Image Processing Pipeline

What you build: An S3 upload triggers Lambda, which resizes the image, stores metadata in DynamoDB, and sends an SNS notification.

Why it matters: Event-driven architecture is the dominant pattern in modern cloud systems. This project teaches you how AWS services communicate asynchronously, and what happens when they don’t.

The flow looks simple: S3 event → Lambda → DynamoDB write + SNS publish. But building it forces you to confront real decisions. What’s your Lambda timeout? If image processing takes 8 seconds and you set 5, you’ll see silent failures. What’s your DynamoDB write capacity? If you’re processing 200 images per minute and your table is provisioned for 50 WCU, you’ll hit throttling. What happens to the SNS notification if the DynamoDB write fails? You’ll need to think about idempotency and partial failure handling.

What the exam will ask: S3 event notification targets, Lambda concurrency limits, DynamoDB capacity modes, and SNS delivery guarantees. This project covers all of them.

Project 3: Disaster Recovery Solution

What you build: Cross-region S3 replication, automated RDS backups, and Route 53 failover routing.

Why it matters: DR is one of the most heavily tested domains on the SAA-C03, and it’s also one of the most misunderstood in practice. Most candidates can recite the four DR strategies. Few can explain the trade-offs that determine which one you’d actually choose.

Building this project forces you to internalize the RTO/RPO trade-off with real numbers. Cross-region S3 replication gives you near-zero RPO for object storage, replication typically completes in under 15 minutes for objects under 5GB. But your RDS automated backup has an RPO of up to 24 hours unless you’re using read replicas or Aurora Global Database. Route 53 health checks with failover routing can redirect traffic in under 60 seconds, but only if your secondary environment is already running (warm standby) or fully active (multi-site active-active).

The exam distinguishes between backup/restore (hours of RTO, lowest cost), pilot light (minutes of RTO, minimal running infrastructure), warm standby (seconds to minutes, scaled-down but live), and multi-site active-active (near-zero RTO, highest cost). Build this project and you’ll understand why a financial services company chooses multi-site active-active at $40,000/month while a content platform chooses backup/restore at $200/month.

What the exam will ask: S3 replication configuration, RDS backup retention, Route 53 routing policies, and how to calculate RTO/RPO for each DR tier.

Project 4: Hybrid Storage Solution

What you build: AWS Storage Gateway connected to on-premises systems, S3 lifecycle policies moving data through storage tiers, and IAM policies controlling access.

Why it matters: Most cloud engineers underestimate how much enterprise workload still runs on-premises. Storage Gateway is the bridge, and understanding it makes you sound experienced in interviews because it signals you’ve thought about migration, not just greenfield architecture.

Storage Gateway has three modes: File Gateway (NFS/SMB access to S3), Volume Gateway (iSCSI block storage backed by S3), and Tape Gateway (virtual tape library for backup software). The exam tests which mode fits which scenario. Build a File Gateway and you’ll understand why a media company with 200TB of on-premises video assets uses it to extend their NAS to S3 without rewriting their editing workflows.

S3 lifecycle policies complete the picture. Moving objects from S3 Standard to S3 Standard-IA after 30 days saves roughly 46% on storage costs. Moving to S3 Glacier Instant Retrieval after 90 days saves another 68%. For a workload storing 10TB of infrequently accessed data, that’s the difference between $230/month and $40/month. The exam will ask you to design the right lifecycle policy for a given access pattern, build this project and you’ll answer from experience, not memorization.

What the exam will ask: Storage Gateway modes, S3 storage class trade-offs, lifecycle policy configuration, and IAM policy structure for cross-account S3 access.

Project 5: Containerized Microservices Application

What you build: Two services deployed on ECS with Fargate, task definitions with resource limits, ALB path-based routing to each service, and CloudWatch Container Insights for logging and metrics.

Why it matters: Containers are now a core SAA-C03 domain, and ECS with Fargate is the AWS-native answer to “I want containers without managing servers.” Building this project teaches you the ECS mental model, clusters, services, task definitions, and tasks, and how they map to the infrastructure underneath.

The ALB routing piece is where most candidates get confused. Path-based routing lets you send /api/orders/* to your orders service and /api/inventory/* to your inventory service, both running as separate ECS services behind a single ALB. Each service has its own target group, its own task definition with CPU and memory limits, and its own auto-scaling policy based on CPU utilization or request count. When you build this, you understand why a task definition with 256 CPU units and 512MB memory will throttle under load before it scales, and how to set the right CloudWatch alarm threshold to trigger scaling before users notice.

CloudWatch Container Insights gives you container-level CPU, memory, network, and disk metrics without any instrumentation. You’ll see exactly which task is consuming resources and correlate it with application logs in the same console. That operational visibility is what separates a working prototype from a production-ready deployment.

What the exam will ask: ECS vs. EKS selection criteria, Fargate vs. EC2 launch type trade-offs, ALB target group configuration, and CloudWatch metrics for container workloads.

Conclusion

Watching videos teaches you what AWS services do. Building projects teaches you what they cost, where they fail, and why you’d choose one over another. Every question on the SAA-C03 is ultimately asking: given these constraints, what’s the right architecture decision?

These five projects give you the intuition to answer that question, not just on the exam, but in the room when a customer asks why their database is the bottleneck, why their DR plan won’t meet their RTO, or why their serverless pipeline is costing more than expected.

Build the projects. Pass the exam. Show up to the interview ready to talk about real systems.

Building Production-Grade API Security

Lefteris Karageorgiou — Wed, 22 Apr 2026 09:31:00 GMT

We built a customer-facing API processing thousands of transactions per minute. The challenge wasn’t just handling the volume, it was protecting against attacks while maintaining performance. Every blocked malicious request saves money and prevents potential breaches, but every millisecond of latency added by security layers impacts user experience.

Sponsored by Salesforce

Salesforce goes Headless 360

The biggest news out of SF last week wasn’t a new UI, it was the decoupling of it. Salesforce is officially pivoting to an API-first “Headless 360” architecture. For those of us tired of the walled garden constraints, this is a massive shift. By separating core CRM logic and data from the standard UI, we can now treat Salesforce as a backend engine for any custom frontend or external application stack.

Click the link here and let me know what you think!

Explore the biggest announcements, launches, and innovations from TDX 2026.

Watch TDX on Salesforce+ to:

Get roadmap insights from the leaders shaping what’s next
Access broadcast-only moments and exclusive interviews

It all starts with the main keynote, where you’ll experience the future of software and learn how to build it. Watch it now and catch every moment at your own pace.

👉 Watch Now

The Requirements

Let’s now go back to our article. We needed security that addressed four critical concerns:

Application-layer protection against SQL injection, XSS, and other OWASP Top 10 attacks that target business logic rather than infrastructure.

DDoS mitigation at both network and application layers without manual intervention. When attacks happen at 3 AM, automated response is non-negotiable.

Rate limiting and throttling that works across a global user base. Simple IP-based limiting breaks with legitimate users behind corporate NATs or mobile carriers.

Performance constraints of sub-200ms API response times even with all security layers active. Security cannot become the bottleneck.

We also needed complete observability into every attack attempt with real-time alerting when patterns emerge. The solution had to scale automatically, security couldn’t become the constraint during traffic spikes.

The AWS Services

API Gateway serves as the entry point, handling authentication, request validation, and throttling before requests reach backend functions. It’s the first line of defense that rejects malformed requests immediately.

AWS WAF sits in front of API Gateway, inspecting every HTTP request against managed rule sets and custom rules. It blocks attacks at the edge before they consume API Gateway capacity or cost Lambda invocations.

AWS Shield Standard comes automatically with API Gateway, providing DDoS protection against common network and transport layer attacks. Shield Advanced is available for scenarios requiring dedicated DDoS response team support.

The Flow

When a request hits our API, it reaches the regional API Gateway with WAF rules applied at the edge. If the request matches a block rule, suspicious patterns, known bad IPs, rate limit exceeded, it’s rejected with a 403 before reaching API Gateway.

For legitimate requests that pass WAF inspection, API Gateway applies its own throttling limits, validates the request structure, checks API keys or JWT tokens, and only then invokes backend functions. If anything fails validation, the request is rejected without consuming compute capacity.

This layered approach means attacks get stopped at the cheapest point possible. WAF blocks cost nothing beyond the WAF inspection fee. API Gateway rejections cost nothing beyond the API call. Only legitimate requests that pass all checks consume Lambda invocations.

Deep Dive: WAF Rule Strategy

We started with AWS Managed Rules, they catch 90% of common attacks immediately. But we hit false positives on legitimate API calls that included JSON payloads with special characters.

Our custom rules operate with three priorities:

IP reputation list blocks known malicious IPs. We integrated threat intelligence feeds that update hourly. This stops repeat offenders before they even attempt an attack.

Rate-based rules block IPs making more than 2,000 requests in five minutes. This catches credential stuffing attempts targeting our login endpoints. We observed attackers cycling through stolen credentials at exactly this rate, fast enough to test thousands of accounts but slow enough to avoid obvious detection.

Geo-blocking rules target countries with no legitimate users but high attack traffic. This was controversial internally, but the data was clear: 95% of attacks came from regions with zero paying customers.

The critical decision: we set WAF to count mode initially, not block. We observed traffic patterns for two weeks, identified false positives, tuned rules, then switched to block mode. This prevented accidentally blocking legitimate users during the learning phase.

Deep Dive: API Gateway Security Features

API Gateway provides multiple security layers beyond basic routing.

Request validation happens before Lambda invocation, rejecting requests with malformed JSON, missing required parameters, or invalid data types. This prevents malicious payloads from reaching backend code. We defined JSON schemas for every endpoint, verbose to maintain, but it stops entire classes of attacks.

Authentication integrates with multiple mechanisms. We use API keys for simple use cases, Lambda authorizers for custom logic, and Cognito user pools for OAuth 2.0 flows. Each request is validated before consuming any compute resources.

Resource policies add another control layer. We restricted our internal APIs to specific VPCs, preventing public internet access entirely while maintaining the benefits of API Gateway’s managed service.

Deep Dive: Throttling Strategy

API Gateway’s throttling operates at multiple levels, and understanding each is critical.

Account-level limits serve as a safety net, but real control comes from usage plans. Each usage plan has both rate limits (steady-state) and burst capacity. Burst capacity matters during traffic spikes when clients temporarily exceed their rate limit using burst tokens. We set burst to 2x the rate limit, this handles legitimate spikes without triggering false positives.

Custom throttling in WAF for specific endpoints adds another layer. Our login endpoints get rate-limited to 10 requests per minute per IP, far more restrictive than general API limits. This stops brute force attacks without impacting normal API usage.

Method-level throttling provides even finer control. Read operations (GET) can have higher limits than write operations (POST, PUT, DELETE), reflecting their different resource consumption and risk profiles. We observed that 80% of attacks targeted write endpoints, so we throttled them more aggressively.

Deep Dive: Shield Protection

Shield Standard provides automatic protection against common DDoS attacks at the network and transport layers. It detects and mitigates SYN floods, UDP reflection attacks, and other volumetric attacks without any configuration.

The protection operates inline with minimal latency impact. Shield’s detection algorithms analyze traffic patterns in real-time, distinguishing legitimate traffic spikes from attack patterns. When attacks are detected, mitigation happens automatically within seconds.

For our API, Shield Advanced wasn’t initially necessary. The decision point: Shield Advanced makes sense when API downtime costs exceed $3,000/month or when regulatory requirements mandate dedicated security response. We added it after a competitor suffered a multi-day DDoS attack that made headlines.

Monitoring and Observability

WAF logs every blocked request to CloudWatch Logs. We stream these to S3 for long-term analysis. Analyzing these logs reveals attack patterns, identifies false positives, and guides rule tuning. We set up CloudWatch alarms for blocked request spikes, when blocks exceed 1,000 per minute, we get paged.

API Gateway metrics expose throttling events, 4xx/5xx error rates, and latency distributions. We correlate WAF blocks with API Gateway metrics to see the full security picture, which attacks reached API Gateway versus which were blocked at WAF.

X-Ray tracing adds end-to-end visibility, showing exactly where requests spend time and where failures occur. This proved essential when debugging whether performance issues stemmed from security layers, backend code, or downstream services.

Cost Breakdown

WAF costs run approximately $5/month base fee plus $1 per million requests plus $1 per rule per month. With 10 custom rules and 100 million requests monthly, that’s $115/month.

API Gateway costs are $3.50 per million requests for REST APIs. At 100 million requests monthly, that’s $350/month. Caching can reduce backend invocations significantly.

Shield Standard is included at no additional cost. Shield Advanced costs $3,000/month plus data transfer fees, only justified for high-value APIs where downtime is extremely costly.

Lambda invocations saved by blocking malicious requests represent the hidden cost savings. We block approximately 5 million malicious requests monthly that would have cost $1 in Lambda invocations plus potential data breach costs.

Total monthly cost for our security stack: approximately $465 for WAF and API Gateway, blocking attacks that would cost far more in compute resources and potential breaches.

Key Takeaways

This architecture stops millions of malicious requests monthly, requests that would cost money and potentially compromise systems. The layered approach proves essential: WAF catches obvious attacks, API Gateway enforces business logic throttling and authentication, and Shield protects against network-layer DDoS. No single layer would be sufficient.

Tracing Serverless Applications with AWS X-Ray

Lefteris Karageorgiou — Tue, 14 Apr 2026 09:29:34 GMT

In the world of serverless architectures, understanding how requests flow through your distributed system can feel like navigating a maze blindfolded. AWS X-Ray emerges as your guiding light, providing the visibility needed to trace requests as they traverse through Lambda functions, API Gateway endpoints, SQS queues, and other AWS services.

Sponsored by Salesforce

The shift toward Agentic AI isn’t just another buzzword—it’s redefining how we design, architect, and build modern systems.

That’s exactly why I’ll be tuning into TDX on Salesforce+ to explore how Agentforce is shaping the next generation of the developer roadmap.

If you’re serious about staying ahead of the curve, I highly recommend grabbing a free spot and joining the sessions.

Plus, when you register, you’ll be automatically entered for a chance to win one of 20 AI exam vouchers (valued at $200)—available to legal residents of the U.S., Canada, New Zealand, and the U.K.

Stream the developer conference of the year. Live on Salesforce+.

Agentic AI is changing the game and Agentforce is leading the way. Stream TDX to join dozens of sessions and virtual hands-on trainings that explore the latest innovations across Agentforce, Data 360, the core platform, vibe coding, Slack, and more. All free on Salesforce+.

Tune in to TDX on Salesforce+ to:

Build hands-on skills with virtual trainings and live demos
Get roadmap insights from the leaders shaping what’s next
Access broadcast-only moments and exclusive interviews

It all kicks off with the main keynote, where you’ll experience the future of software and learn how to build it. Add it to your calendar so you don’t miss a moment.

👉 Register for free

Why Tracing Matters in Serverless

Serverless applications are inherently distributed. A single user request might trigger an API Gateway endpoint, invoke multiple Lambda functions, write to DynamoDB, publish messages to SNS, and queue tasks in SQS. When something goes wrong, or even when you’re optimizing performance, pinpointing the bottleneck becomes challenging without proper tracing.

X-Ray solves this by creating a complete map of your request journey, showing you exactly where time is spent, where errors occur, and how services interact with each other. It’s the difference between guessing and knowing.

Understanding the Service Map

The service map is X-Ray’s visual representation of your application architecture. It automatically discovers and displays all the services your application uses, showing the relationships between them. Each node represents a service, and the connections show how requests flow between them.

What makes this powerful is the real-time health indicators. You can immediately spot services with high error rates or elevated latency. The color coding provides instant visual feedback, green for healthy, yellow for warnings, and red for errors. This bird’s-eye view helps you understand your system’s behavior at a glance.

Traces and Segments: The Building Blocks

Every request that flows through your application creates a trace. Within each trace, individual services create segments that represent the work they perform. For Lambda functions, a segment captures the entire function execution. For downstream calls to DynamoDB or other services, subsegments provide granular detail.

The trace timeline shows you exactly how long each segment took, helping you identify slow operations. You can drill down into specific traces to see the complete request path, including all the metadata, annotations, and errors associated with each segment.

Practical Tips for Effective Tracing

Start with sampling strategies wisely. X-Ray uses sampling to balance cost and visibility. The default sampling rule captures the first request each second and 5% of additional requests. For production environments, this is usually sufficient. However, for critical workflows or during troubleshooting, consider creating custom sampling rules to capture 100% of specific API paths or error conditions.

Use annotations for filtering and grouping. Annotations are indexed key-value pairs that you can use to filter traces in the X-Ray console. Add annotations for user IDs, transaction types, or feature flags. This makes it easy to analyze specific user journeys or compare performance across different code paths.

Leverage metadata for context. Unlike annotations, metadata isn’t indexed but provides rich contextual information when viewing individual traces. Include relevant business context, configuration values, or debugging information that helps you understand what was happening during the request.

Monitor cold starts separately. Lambda cold starts can significantly impact performance. Use X-Ray annotations to tag cold start invocations, allowing you to analyze their frequency and impact separately from warm starts. This helps you make informed decisions about provisioned concurrency.

Set up alarms on key metrics. X-Ray integrates with CloudWatch, allowing you to create alarms based on error rates, latency percentiles, or fault rates. Don’t wait to discover problems, let X-Ray notify you when thresholds are breached.

Trace asynchronous workflows. For event-driven architectures using SNS, SQS, or EventBridge, ensure you’re propagating trace context through message attributes. This maintains the trace continuity across asynchronous boundaries, giving you end-to-end visibility even in complex event chains.

Use trace groups for organization. As your application grows, create trace groups to organize and filter traces by environment, application component, or team ownership. This keeps your X-Ray console manageable and helps teams focus on their specific services.

The Bottom Line

AWS X-Ray transforms serverless observability from reactive debugging to proactive optimization. By providing clear visibility into request flows, performance characteristics, and error patterns, it empowers you to build more reliable and efficient serverless applications.

The key is to integrate X-Ray early in your development process, not as an afterthought when problems arise. With proper instrumentation and thoughtful use of annotations and sampling, X-Ray becomes an invaluable tool for understanding and improving your serverless architecture.

API Gateway Canary Deployments: A Strategic Approach to Safe Releases

Lefteris Karageorgiou — Wed, 08 Apr 2026 09:30:51 GMT

Canary deployments are one of the most powerful risk mitigation strategies available to cloud engineers working with AWS API Gateway and Lambda. They allow you to test new versions of your APIs in production with real traffic while minimizing the blast radius of potential issues. Understanding how they work and when to use them can be the difference between a smooth rollout and a production incident.

Sponsored by Salesforce

Stream the developer conference of the year. Live on Salesforce+.

Tune in to TDX on Salesforce+ to:

Build hands-on skills with virtual trainings and live demos
Get roadmap insights from the leaders shaping what’s next
Access broadcast-only moments and exclusive interviews

It all kicks off with the main keynote, where you’ll experience the future of software and learn how to build it. Add it to your calendar so you don’t miss a moment.

👉 Register for free

What Are Canary Deployments?

A canary deployment is a progressive rollout strategy where you route a small percentage of production traffic to a new version of your application while the majority continues using the stable version. The term comes from the historical practice of using canaries in coal mines as early warning systems, if something goes wrong with the new version, only a small subset of users are affected.

In API Gateway, canary deployments work at the stage level. When you create a canary, you’re essentially splitting traffic between two versions of your API: the base deployment and the canary deployment. Each can point to different Lambda function versions or aliases, allowing you to test new code with real production traffic.

How Canary Deployments Work with Lambda

The integration between API Gateway canary deployments and Lambda is straightforward but powerful. When you deploy a canary in API Gateway, you specify a percentage of traffic to route to the canary version, typically starting with 5-10%. The remaining traffic continues flowing to your stable base deployment.

API Gateway uses a weighted random distribution to split traffic. This means if you set a 10% canary weight, approximately 10% of requests will hit your canary Lambda function version, while 90% go to the base version. The distribution happens at the request level, not the user level, so individual users might experience both versions during a canary deployment.

The key is that both deployments exist simultaneously in the same stage. Your base deployment might point to Lambda alias “production” running version 5, while your canary points to alias “canary” running version 6. This allows you to validate new functionality with real traffic patterns before committing to a full rollout.

Practical Tips for Success

Start Small and Increase Gradually: Begin with 5-10% traffic to your canary. Monitor metrics closely for 15-30 minutes before increasing. A common progression is 10% → 25% → 50% → 100%, with monitoring periods between each increase.

Use Lambda Aliases, Not Versions Directly: Always point your API Gateway integrations to Lambda aliases rather than specific versions. This gives you flexibility to update what version an alias points to without redeploying your API. Use one alias for production traffic and another for canary testing.

Monitor the Right Metrics: Focus on error rates, latency percentiles (especially p99), and business-specific metrics. Don’t just watch averages, a 1% error rate on your canary means 1 in 100 of your customers are having problems. Set up CloudWatch alarms that compare canary metrics against baseline metrics.

Implement Proper Logging: Ensure your Lambda functions log which version they are. Include version information in structured logs so you can quickly filter and compare behavior between canary and base deployments during troubleshooting.

Have a Rollback Plan: Know how to quickly promote your canary to 100% if it’s successful, or delete it entirely if problems arise. Practice these operations in non-production environments. The API Gateway console and CLI both support these operations, but automation through CI/CD pipelines is ideal.

Consider Stage Variables: Use API Gateway stage variables to pass configuration to your Lambda functions. This allows the same Lambda code to behave differently based on whether it’s handling base or canary traffic, useful for feature flags or environment-specific settings.

Test with Realistic Load: Canary deployments are most valuable when they experience real production traffic patterns. Avoid deploying canaries during low-traffic periods if possible, you want enough requests to generate statistically significant results.

Don’t Rush the Process: The entire point of canary deployments is risk reduction. If you promote a canary to 100% within minutes, you’re not giving yourself time to detect issues. Plan for canary periods of at least several hours for critical APIs.

When to Use Canaries

Canary deployments shine when rolling out significant changes: new business logic, performance optimizations, dependency updates, or architectural changes. They’re less critical for minor bug fixes or configuration changes, though the safety they provide is always valuable.

The overhead of managing canaries is minimal compared to the protection they offer. For production APIs serving real customers, canary deployments should be your default deployment strategy, not an exception.

Multi-Region Serverless Architectures: Building Global Resilience with AWS

Lefteris Karageorgiou — Wed, 01 Apr 2026 09:30:52 GMT

Building applications that serve users across the globe requires more than just deploying to a single region. Multi-region architectures provide low latency, high availability, and disaster recovery capabilities that are essential for modern cloud applications. In this article, we’ll explore how to architect a truly global serverless application using AWS services.

Sponsored by Salesforce

Stream the developer conference of the year. Live on Salesforce+.

Tune in to TDX on Salesforce+ to:

Build hands-on skills with virtual trainings and live demos
Get roadmap insights from the leaders shaping what’s next
Access broadcast-only moments and exclusive interviews

It all kicks off with the main keynote, where you’ll experience the future of software and learn how to build it. Add it to your calendar so you don’t miss a moment.

👉 Register for free

Architecture Overview

Our multi-region architecture consists of four key components working together:

Route 53: Provides intelligent DNS routing to direct users to the nearest regional endpoint
API Gateway: Regional REST or HTTP APIs that serve as entry points in each region
Lambda: Compute layer that processes requests with minimal latency
DynamoDB Global Tables: Multi-region, fully replicated database with automatic conflict resolution

This architecture provides active-active deployment across multiple AWS regions, ensuring that if one region experiences issues, traffic automatically fails over to healthy regions.

The Architecture Flow

At the heart of a multi-region serverless architecture lies a carefully orchestrated chain of AWS services working together to deliver a seamless global experience.

Route 53: The Global Traffic Director

Amazon Route 53 serves as the entry point for all user requests. Using health checks and routing policies like latency-based or geoproximity routing, Route 53 intelligently directs users to the nearest healthy regional endpoint. This ensures that a user in Tokyo connects to the Asia Pacific region while a user in Frankfurt connects to Europe, minimizing latency and improving user experience.

API Gateway: Regional Entry Points

Each region hosts its own API Gateway endpoint, acting as the front door for that region’s serverless infrastructure. API Gateway handles request validation, throttling, and authentication before passing requests to the compute layer. With features like request/response transformation and built-in AWS WAF integration, it provides a robust security perimeter for your application.

Lambda: Distributed Compute

Behind each regional API Gateway sits AWS Lambda functions that execute your business logic. These functions are deployed identically across all regions, ensuring consistent behavior regardless of where the request originates. Lambda’s automatic scaling handles traffic spikes in each region independently, while its pay-per-use model keeps costs optimized even with a multi-region footprint.

DynamoDB Global Tables: The Data Layer

The foundation of this architecture is DynamoDB Global Tables, which provides fully managed, multi-region, multi-active database replication. When a Lambda function writes data in one region, DynamoDB automatically replicates that data to all other configured regions, typically within seconds. This means users can read and write data from their nearest region while maintaining global consistency.

Key Architectural Considerations

Conflict Resolution: DynamoDB Global Tables uses a last-writer-wins reconciliation strategy. Your application design should account for this, potentially using timestamps or version numbers to handle concurrent updates across regions.

Replication Lag: While DynamoDB replication is fast, there’s still a brief window where data might not be consistent across all regions. Design your application to handle eventual consistency gracefully.

Regional Failover: Route 53 health checks continuously monitor your regional endpoints. If a region becomes unhealthy, traffic automatically routes to the next best region, providing transparent failover without user intervention.

Cost Implications: Running infrastructure in multiple regions increases costs through data transfer charges and duplicated resources. Balance the number of regions against your actual user distribution and availability requirements.

The Benefits

This architecture delivers several compelling advantages. Users experience lower latency by connecting to nearby infrastructure. Your application remains available even if an entire AWS region experiences an outage. Data is protected through geographic redundancy. And perhaps most importantly, you can scale globally without redesigning your architecture.

When to Use This Pattern

Multi-region architectures make sense for applications with a global user base, strict availability requirements, or regulatory needs for data residency. However, they add complexity and cost, so evaluate whether your application truly needs global distribution or if a single-region deployment with good disaster recovery would suffice.

Conclusion

Building a multi-region serverless architecture with Route 53, API Gateway, Lambda, and DynamoDB Global Tables provides a robust foundation for globally distributed applications. This architecture delivers low-latency responses to users worldwide while maintaining high availability and disaster recovery capabilities.

The key to success is embracing eventual consistency, implementing comprehensive monitoring, and regularly testing your failover mechanisms. While the initial setup requires careful planning, the operational benefits of automatic scaling, reduced latency, and improved reliability make it worthwhile for applications serving a global user base.

A Beginner’s Guide to Testing Serverless Applications

Lefteris Karageorgiou — Wed, 25 Mar 2026 10:30:58 GMT

Testing serverless applications presents unique challenges that differ significantly from traditional application testing. The ephemeral nature of Lambda functions, the distributed architecture of serverless systems, and the tight integration with managed AWS services require a thoughtful approach to testing strategy. This article explores comprehensive testing methodologies for serverless applications with practical tools and frameworks.

The Serverless Testing Challenge

Serverless architectures introduce complexity that makes testing more nuanced than traditional applications. Lambda functions don’t run in isolation, they interact with API Gateway, DynamoDB, S3, SQS, SNS, EventBridge, and numerous other AWS services. The pay-per-invocation model means you can’t simply spin up a test environment and leave it running. Additionally, the distributed nature of serverless systems means that failures can cascade across multiple services in ways that are difficult to predict and reproduce.

The testing pyramid for serverless applications looks different from traditional applications. While you still want a solid foundation of unit tests, the integration layer becomes significantly more important because so much of your application’s behavior depends on how services interact with each other.

Unit Testing Lambda Functions

Unit testing Lambda functions focuses on testing your business logic in isolation from AWS services. The key principle is to separate your core logic from the Lambda handler and AWS SDK calls. This separation allows you to test your business logic without needing to mock AWS services or make actual API calls.

Testing Frameworks: Use Jest or Mocha for Node.js Lambda functions, pytest for Python, JUnit for Java, or Go’s built-in testing package for Go-based functions. These frameworks provide the foundation for writing and running your unit tests with features like test discovery, assertions, and test reporting.

Your Lambda handler should be thin, primarily responsible for parsing events, calling your business logic, and formatting responses. The actual business logic should live in separate modules that can be tested independently. This architectural pattern makes your code more testable and maintainable.

When unit testing, focus on testing edge cases, error conditions, and business rule validation. Test how your code handles malformed input, missing required fields, and unexpected data types. These tests should run quickly and require no external dependencies, making them ideal for rapid feedback during development.

Mocking Tools: Mock external dependencies at the boundaries of your application using tools like aws-sdk-mock for JavaScript, or moto for Python. Rather than mocking the entire AWS SDK, create abstraction layers or interfaces that represent the operations your code needs to perform. This approach makes your tests more resilient to changes in the AWS SDK and keeps your business logic decoupled from infrastructure concerns.

Integration Testing Strategies

Integration testing for serverless applications verifies that your Lambda functions work correctly with actual AWS services. This is where you test the contracts between your code and services like DynamoDB, S3, SQS, and API Gateway. Integration tests are more expensive and slower than unit tests, but they catch issues that unit tests cannot.

Integration Testing Tools: Leverage AWS SAM CLI for local testing and deployment, Serverless Framework’s invoke local command, or LocalStack for comprehensive AWS service emulation. For end-to-end testing, consider Postman or Newman for API testing, and aws-testing-library for programmatic integration tests.

There are several approaches to integration testing in serverless environments. One strategy involves deploying to a dedicated testing environment in AWS and running tests against real services using AWS CDK or Terraform for infrastructure provisioning. This approach provides the highest confidence that your application will work in production, but it’s also the slowest and most expensive option.

Another approach uses local emulation tools like LocalStack, SAM Local, or Serverless Offline that simulate AWS services on your development machine. These tools provide a faster feedback loop than deploying to AWS, but they may not perfectly replicate the behavior of actual AWS services. They’re best used for rapid iteration during development, with periodic validation against real AWS services.

Contract Testing: Consider implementing contract testing using Pact or Spring Cloud Contract for your Lambda functions. Contract tests verify that your functions correctly handle the event formats they receive from triggers like API Gateway, EventBridge, or SQS. They also verify that your functions produce output in the format expected by downstream services. This approach helps catch breaking changes early.

Integration tests should cover the critical paths through your application, the workflows that represent your core business value. Test how your functions handle retries, how they behave under throttling conditions, and how they recover from transient failures using tools like Chaos Toolkit or AWS Fault Injection Simulator for chaos engineering experiments. These scenarios are difficult to test with unit tests alone.

Local Development Strategies

Effective local development for serverless applications requires tools and workflows that provide rapid feedback without requiring constant deployment to AWS. The goal is to enable developers to iterate quickly while maintaining confidence that their code will work when deployed.

Local Development Tools: Use AWS SAM CLI with sam local start-api and sam local invoke commands, Serverless Framework with the serverless-offline plugin, or LocalStack for comprehensive local AWS service emulation. Docker is essential for containerizing your development environment to match Lambda’s execution environment.

Local development environments should replicate the Lambda execution environment as closely as possible using Docker images based on AWS Lambda base images. This includes matching the runtime version, environment variables, memory configuration, and timeout settings. Discrepancies between local and deployed environments are a common source of bugs that only appear in production.

Consider using Docker Compose to orchestrate containerized development environments that mirror the Lambda runtime. This approach ensures consistency across your development team and reduces “works on my machine” issues. Containers can include all necessary dependencies, tools, and configurations, making onboarding new developers faster and more reliable.

Debugging Tools: Local development should support debugging with breakpoints and step-through execution using VS Code’s built-in debugger, AWS Toolkit for VS Code, IntelliJ IDEA with AWS Toolkit, or PyCharm’s remote debugging. The ability to pause execution, inspect variables, and step through code line by line is invaluable for understanding complex issues. This capability is especially important when working with asynchronous operations and event-driven workflows.

Conclusion

Testing serverless applications requires a comprehensive strategy that spans unit testing, integration testing, local development, and production monitoring. The distributed nature of serverless architectures and the tight integration with managed services make testing more complex than traditional applications, but the right tools and approach can provide confidence in your application’s reliability and correctness.

AWS Lambda Durable Functions: What They Are and When to Use Them

Lefteris Karageorgiou — Wed, 18 Mar 2026 10:30:37 GMT

AWS Lambda Durable Functions represent a significant evolution in serverless application development, introduced at AWS re:Invent 2025. This new capability enables developers to build reliable, multi-step applications, that can run for extended periods, without paying for idle compute time while waiting for external events or human decisions.

In this article, we'll cover what they are, when you should use them, and how they compare with AWS Step Functions to help you choose the right orchestration tool for your serverless applications.

What Are Lambda Durable Functions?

Lambda Durable Functions are regular Lambda functions enhanced with stateful execution capabilities. They allow you to write complex, long-running workflows directly in your application code using familiar programming languages, rather than defining workflows in a separate orchestration language.

The key innovation lies in automatic checkpointing. As your function executes, AWS Lambda automatically saves progress at strategic points. If your workflow needs to wait for an external event, such as a webhook callback, a manual approval, or a scheduled delay, the function pauses without consuming compute resources. When the awaited event occurs, execution resumes from the last checkpoint with full context preserved.

This approach embeds orchestration logic directly into your Lambda code, eliminating the need to manage state manually or architect separate state management systems. The result is a serverless-first approach that reduces operational overhead while maintaining the scalability and pay-per-use economics that make serverless attractive.

When to Use Durable Functions

Lambda Durable Functions excel in specific scenarios where application logic naturally involves multiple steps with waiting periods:

AI and Machine Learning Workflows - Orchestrating multi-stage AI pipelines where each step might involve model inference, data transformation, or human review. The ability to pause between stages without incurring costs makes this particularly economical for batch processing scenarios.

Order Processing and E-commerce - Managing order fulfillment workflows that span inventory checks, payment processing, shipping coordination, and customer notifications. These workflows often involve waiting for external system responses or scheduled delivery windows.

Approval and Human-in-the-Loop Processes - Building loan applications, expense approvals, or content moderation systems where workflows must pause for human decisions before proceeding. The year-long execution window accommodates even the most extended approval cycles.

Event-Driven Application Logic - Coordinating complex business processes that respond to multiple events over time, such as customer onboarding journeys, subscription lifecycle management, or multi-step data processing pipelines.

Application-Centric Orchestration - When your workflow logic is tightly coupled with your application code and benefits from being expressed in the same programming language as the rest of your business logic.

How Durable Functions Compare with Step Functions

Both Lambda Durable Functions and AWS Step Functions solve the problem of orchestrating multi-step workflows, but they approach it from fundamentally different philosophical and architectural standpoints.

Architectural Philosophy

Step Functions is built for infrastructure orchestration. It excels at coordinating disparate AWS services, providing a visual workflow designer, and offering a declarative approach using Amazon States Language (ASL). The workflow definition lives separately from your application code, making it ideal for workflows that need to be understood and modified by operations teams or non-technical stakeholders.

Durable Functions, by contrast, are optimized for application orchestration. The workflow logic lives within your Lambda function code, written in your preferred programming language. This code-first approach makes it natural for developers who want to keep orchestration logic alongside business logic, use familiar debugging tools, and leverage existing programming constructs like loops, conditionals, and error handling.

Developer Experience

With Step Functions, you define workflows in ASL—a JSON-based state machine definition language. While powerful and expressive, this requires learning a new syntax and switching contexts between your application code and workflow definitions. Local testing typically involves additional tooling or mocking frameworks.

Durable Functions let you write workflows as regular code. If you know how to write a Lambda function, you already know how to write a durable function. The workflow logic uses standard programming constructs, making it more intuitive for developers and easier to test locally using familiar development tools.

Visibility and Monitoring

Step Functions provides superior visual representation of workflows. The built-in graphical interface shows execution flow, making it easy for stakeholders to understand complex processes at a glance. This visualization is particularly valuable for compliance, auditing, and operational monitoring.

Durable Functions trade some of this visual clarity for code simplicity. While you can still monitor executions through CloudWatch and Lambda insights, the workflow structure isn’t as immediately apparent to non-technical observers.

Use Case Alignment

Choose Step Functions when you need to:

Orchestrate multiple AWS services (Lambda, ECS, Glue, etc.)
Provide workflow visibility to non-technical stakeholders
Build workflows that benefit from visual design and monitoring
Implement complex branching, parallel execution, or error handling patterns that benefit from declarative definition
Maintain clear separation between orchestration and business logic

Choose Durable Functions when you need to:

Keep workflow logic tightly integrated with application code
Leverage existing programming language features and libraries
Simplify development by avoiding context switching between code and ASL
Build workflows where the orchestration is primarily application logic rather than service coordination
Optimize for developer velocity and code maintainability within Lambda

Cost Considerations

Both services charge only for active execution time, not idle waiting periods. However, the pricing models differ slightly. Step Functions charges per state transition, while Durable Functions follow Lambda’s standard pricing model with additional charges for state persistence. For workflows with many state transitions, Durable Functions may offer cost advantages. For workflows coordinating many AWS services, Step Functions’ integration capabilities may provide better value.

Conclusion

Lambda Durable Functions and Step Functions aren’t competitors—they’re complementary tools in your serverless toolkit. Durable Functions shine when you want to write workflow logic as code within your Lambda functions, keeping orchestration close to your application logic. Step Functions excel when you need to coordinate multiple AWS services with clear visual representation and operational oversight.

The choice ultimately depends on your team’s preferences, the nature of your workflows, and whether you value code-centric development or service-centric orchestration. Many organizations will find themselves using both, selecting the right tool for each specific use case.