💡 How Ewake reduced SRE operational toil for Booksy: read the article!

All articles

Vision

·

Feb 27, 2026

·

0

min read

Maybe Runbooks Were Never the Answer

Runbooks have always been a bit of a necessary pain

Everyone agrees they’re important, but no one enjoys creating them.

They usually get written after something breaks.

Then they sit in a doc somewhere… slowly drifting out of reality and at some point outdated.

AI helps in writing them, you can just explain what you want and then have very well structured runbook.

So now they are clean, probably “updated”, but the core issue never changed:

We’re still trying to capture a living, evolving production system in a static set of steps.

And to be fair, runbooks were never just documentation.

They’ve also acted as safety mechanisms: defining what actions are considered safe during stress, reducing variance in response, and giving teams confidence that recovery paths are understood.


The Real Question Isn’t “Better Runbooks”

With AI agents becoming part of how we build and operate systems, something subtle is shifting.

Incidents are no longer just something engineers walk through manually using predefined instructions.

We’re starting to rely on systems that can:

  • look at what’s happening now

  • connect signals across layers

  • reason about possible causes

  • suggest or take actions

Not to replace engineers, but to help them understand faster.

To remove the need to mentally reconstruct the system during an outage.

And once you have that…

You start wondering:

“Why are we still writing step-by-step runbooks at all?”

Runbooks assume the future is predictable and they are built on an assumption:
“If we document enough scenarios, we’ll be ready.”

But incidents emerge from interactions between components that were never meant to fail together.

That’s why even well-written runbooks often fail in the moments we need them most.

They’re static answers to dynamic problems.

What Changes in an Agent-Driven World

As AI agents become more fited in production workflows, the role of documentation starts to shift.

Instead of writing instructions for humans to follow step by step, we can start structuring knowledge that agents can use — within clearly defined safety boundaries — to:

  • interpret situations

  • explore options

  • adapt as new signals appear


Not:

“If X happens, do Y.”


But:

  • how systems relate

  • what “healthy” looks like

  • where risks usually emerge

  • what tradeoffs exist during recovery

  • which actions are safe under which conditions

This isn’t about giving agents open-ended autonomy.

It’s about moving from procedural instructions to structured operational constraints:

  • what actions are allowed

  • what risks are acceptable

  • when escalation is required

That’s not a traditional runbook.

That’s operational understanding encoded as guardrails.

Once knowledge is structured in a way that agents can use dynamically, the traditional runbook becomes less central.

The primary consumer of operational knowledge is no longer a human reading during an incident.

It’s a system helping make sense of reality in real time.

The mindset shift is subtle but important:

We stop thinking about writing runbooks, even AI-generated ones.

And start thinking about building knowledge that can be used.

The Future Isn’t Better Runbooks

It’s better operational memory.

Better context or better ways for humans and agents to collaborate during uncertainty.

The goal isn’t to automate troubleshooting entirely.

It’s to avoid forcing humans to rely on rigid, outdated instructions when navigating complex failures.

So the next step forward isn’t improving runbooks.

It’s evolving them.

From static instructions to adaptive understanding.