The Ghost in the Commit: Governing AI Agents That Build Our Infrastructure

I’ve been using Claude Code extensively since last June. This week, it helped me refactor the A2A (Agent-to-Agent) protocol endpoints on this blog and debug a subtle ISR caching issue with Vercel. The code works. The tests pass. The commits are merged.

But why did it choose a factory pattern for the endpoint handlers? Why did it structure the cache exclusion that way when other approaches would have worked? What trade-offs did it consider between x-vercel-no-cache headers versus path-based exclusions?

I watched the reasoning happen. Claude Code walked through the options, considered our existing patterns, weighed the trade-offs. It was thorough. It was visible—right there in my terminal.

Then I closed the session. And the reasoning was gone.

We’ve spent decades worrying about capturing human decision-making in software development. Now we’re building critical infrastructure with AI agents whose reasoning is inherently ephemeral. Every architectural choice, every library selection, every security trade-off lives for the duration of a context window, then vanishes.

This is a problem worth taking seriously. And a recent framework from Animesh Koratana on context graphs offers a compelling lens for understanding what we’re losing—and what we might build instead.

The Two Clocks Problem, Applied to AI Development

Koratana introduces a concept he calls the “two clocks problem.” Every system has a state clock—what’s true right now—and an event clock—what happened, in what order, with what reasoning.

We’ve built trillion-dollar infrastructure for the state clock. The event clock barely exists.

For traditional software development, this gap was manageable. Human developers carry context in their heads. They can explain decisions in code reviews. They build institutional knowledge through pairing and documentation.

AI agents have no such persistence.

State clock (what we capture):

Commit: “Refactor auth module to use JWT”
Files changed: 12
Tests: Passing
PR approved by: Human reviewer

Event clock (what we lose):

Why JWT over session tokens for this use case?
What security trade-offs were considered?
Why this JWT library over alternatives?
What architectural constraints informed the structure?
How does this decision interact with choices made in previous sessions?

The code is the artifact. The reasoning is a ghost.

And here’s the uncomfortable part: we’re not just losing documentation. We’re losing the ability to learn from AI decisions, to govern them systematically, to build organizational memory around AI-assisted development.

The Scale Problem

A human developer makes maybe dozens of significant architectural decisions per week. They can be asked to explain them. They remember context from last month.

AI agents make thousands of micro-decisions per session. Every function signature, every error handling pattern, every dependency choice, every structural decision. Each is a small bet about how the system should work.

These decisions compound. An early choice constrains later choices. A pattern established in one session becomes precedent for the next. The agent building your payment service on Tuesday has no memory of the agent that built your user service on Monday—but the decisions need to cohere.

Currently, the only forcing function for coherence is the human in the loop: reviewing PRs, catching inconsistencies, carrying context across sessions. This works at small scale. It doesn’t scale with the velocity AI agents enable.

We’re building cathedrals with amnesiac architects. Each one talented, each one working from the same blueprints, none of them remembering what the others decided or why.

The Claude Code Paradox

Here’s what’s strange about working with Claude Code: the reasoning is more visible than traditional development, but less persistent.

When a human developer makes an architectural decision, the reasoning is often invisible—it happens in their head. Maybe they explain it in a PR comment. Usually they don’t. The code just appears, and we infer intent.

Claude Code is different. When I ask it to implement a feature, I watch it think:

I'll structure this as a service layer rather than embedding the logic
in the route handler. Looking at your existing codebase, I see you've
established this pattern in user-service.ts and payment-service.ts.
This maintains consistency and will make testing easier since we can
mock the service layer independently.

The reasoning is explicit. The precedent-checking is visible. The trade-off analysis happens in plain text.

And then the session ends, and all of it evaporates.

The next time Claude Code works on this codebase—whether tomorrow or in an hour—it starts fresh. It might read the context I maintain in CLAUDE.md. It will re-analyze patterns from the code itself. But the reasoning from the previous session? Gone. The specific trade-offs considered? Gone. The precedents it established in its own mind? Gone.

We have a tool that reasons more explicitly than any human developer, and we’re throwing away all that reasoning after every session.

What We’re Losing in Every Session

Let me make this concrete. Here’s what happened in a real Claude Code session this week:

The task: Refactor A2A protocol endpoints and fix ISR caching issues with Vercel.

What Claude Code did:

Analyzed the existing A2A endpoint implementations
Identified code duplication across multiple endpoint files
Designed a factory pattern to consolidate shared logic
Implemented the refactor across six endpoint handlers
Debugged why POST requests were being cached by Vercel’s ISR
Tried multiple approaches: x-vercel-no-cache headers, path-based exclusions, cache directives
Settled on explicit ISR exclude paths as the cleanest solution
Added Playwright E2E tests to verify the fix

What got persisted: The code. The tests. Commit messages like “fix: Use explicit ISR exclude paths for A2A endpoints.”

What got lost:

Why a factory pattern over other refactoring approaches (the trade-off analysis)
Why ISR path exclusions won over header-based cache busting (the debugging journey)
The three other caching approaches that didn’t work and why
The fact that Claude Code noticed the validation schemas could be consolidated but chose to defer it (scope decision)
The precedent this sets for future API endpoint patterns

Next week, when I ask Claude Code to add new A2A capabilities, it will have no memory of this session. It will re-analyze the endpoint patterns. It might make different structural choices—not wrong, just potentially inconsistent. Because it doesn’t know what it decided before or why.

The risk: a codebase that works but doesn’t cohere, because the agent building it has no persistent memory.

CLAUDE.md Is a Primitive Context Graph (And Its Limits)

Claude Code does have one persistence mechanism: the CLAUDE.md file. It’s meant to capture project context, conventions, and guidelines that persist across sessions.

I use it. Here’s part of mine:

## Architecture Patterns

- Services go in /src/services, one file per domain
- Use Result<T, E> pattern for error handling, not exceptions
- All external API calls go through /src/clients with retry logic
- Middleware follows the pattern established in rate-limiter.ts

## Decisions Made

- JWT for auth (stateless scaling requirement)
- Redis for session/cache (already in infra)
- Chose rate-limiter-flexible over express-rate-limit (clustering support)

This helps. But notice what it is: manually curated state. I wrote this after the fact, extracting what I remembered from sessions. It’s not a decision trace—it’s a summary.

What’s missing:

The reasoning graph. Why do services go in /src/services? What alternatives were considered?
The trade-off record. “Chose X over Y” doesn’t capture why, or what we’d need to reconsider that decision.
The precedent chain. Which decision led to which? What depends on what?
The automated capture. I have to remember to update this file. I don’t always remember.

CLAUDE.md is a primitive context graph—manually maintained, human-summarized, structurally flat. It’s better than nothing. It’s not what we need.

What we need is infrastructure that captures decision traces automatically—every session, every significant choice, every trade-off—and makes them queryable for future sessions.

Context Graphs as Agent Governance Infrastructure

This is where Koratana’s framework becomes directly applicable. A context graph captures “decision traces rather than just data.” It’s infrastructure for the event clock.

For AI agent development, this means:

Capturing agent trajectories. When an agent works through a problem, it traverses a decision space. It considers options, evaluates trade-offs, makes choices. That trajectory—the path through the problem space—is the decision trace. Currently, it exists only in the context window. It could be captured as data.

Building structural memory. Over many agent sessions, patterns emerge. Certain architectural choices cluster together. Certain libraries get chosen in certain contexts. Certain trade-offs get made repeatedly. A context graph accumulates these patterns into organizational memory—the implicit policies of how your codebase is built.

Enabling coherence. When a new agent session begins, it could query the context graph: “How have previous agents approached authentication in this codebase? What patterns were established? What trade-offs were made?” The agent inherits organizational memory instead of starting from zero.

Koratana calls this “schema as output, not input.” You don’t predefine how agents should make decisions. You capture their decision traces and let the patterns emerge. The accumulated traces become a model of how your system gets built—a world model for your development process.

What Instrumented Claude Code Sessions Could Capture

Imagine Claude Code sessions that automatically emit decision traces:

Architectural Decision Trace

session: api-rate-limiting-2024-01-06
agent: claude-code
task: "Add rate limiting to API endpoints"

decisions:
  - type: library_selection
    choice: rate-limiter-flexible
    alternatives_considered:
      - express-rate-limit: "Simpler but lacks Redis cluster support"
      - custom: "Maximum control but maintenance burden"
    reasoning: "Our Redis is clustered; need distributed rate limiting"
    confidence: high

  - type: algorithm_selection
    choice: sliding_window
    alternatives_considered:
      - fixed_window: "Simpler but allows burst at window boundaries"
      - token_bucket: "Good but more complex state management"
    reasoning: "Sliding window prevents boundary bursts without token complexity"
    trade_off: "Slightly higher Redis ops per request"

  - type: structural_pattern
    choice: per_route_middleware_config
    reasoning: "Different endpoints have different rate limit needs"
    precedent_set: true
    scope: "All future middleware should follow this pattern"

context_traversed:
  - /src/middleware/* (existing patterns)
  - /src/config/redis.ts (connection setup)
  - /src/services/user-service.ts (service pattern reference)

deferred_decisions:
  - issue: "Redis connection pooling could be optimized"
    reasoning: "Out of scope for rate limiting task"
    recommendation: "Address in dedicated session"

Cross-Session Query (What Could Exist)

> What patterns has Claude Code established for middleware in this codebase?

Sessions analyzed: 12
Pattern: Per-route configuration with defaults
Established: 2024-01-06 (rate-limiting session)
Followed: 2024-01-08 (caching session), 2024-01-15 (auth refresh)
Diverged: 2024-01-12 (logging session—used global config)

> Why did the logging session diverge?

Reasoning trace from session 2024-01-12:
"Logging middleware applies uniformly across all routes.
Per-route config adds complexity without benefit here.
Intentional divergence from rate-limiting pattern."

This doesn’t exist today. But it could.

The First Step: Telemetry as Primitive Capture

The vision of full decision traces doesn’t require waiting for new infrastructure. We can start capturing something today.

Claude Code supports tracing integrations. You can pipe session data to observability platforms like LangSmith or Langfuse—and suddenly you have persistence where there was none.

What telemetry captures today:

Trace: session-abc123
├── Tool: Read(/src/middleware/rate-limiter.ts)
├── Tool: Glob(**/*.middleware.ts)
├── Tool: Read(/src/config/redis.ts)
├── Tool: Grep("express-rate-limit")
├── Tool: Write(/src/middleware/cache.ts)
├── Tool: Bash(npm test)
└── Tool: Bash(git commit -m "Add caching middleware")

This is valuable. You can see what the agent traversed—which files it read, what patterns it searched for, what it wrote. The trajectory through your codebase is captured.

But notice what’s missing: the reasoning connecting those traversals.

It read rate-limiter.ts—but what did it learn from that?
It searched for express-rate-limit—was it considering that library? Checking if it was already used?
It wrote cache.ts—but why did it structure it that way?

Telemetry gives you the tool calls. Context graphs would give you the decision traces. The trajectory without the reasoning.

Still, telemetry is the foundation. If you’re using Claude Code for production development, setting up tracing to LangSmith or a similar platform is the first step:

You get session persistence. What happened in each session is queryable, not lost.
You get trajectory data. Which files, which patterns, which tools—the path through your codebase.
You can correlate with outcomes. This session produced this PR, which led to this incident. Now you can trace back.
You’re building the substrate. When richer decision trace capture becomes possible, you’ll have the infrastructure ready.

The gap between telemetry and context graphs is the gap between “what tools were called” and “what reasoning produced those calls.” Telemetry is the event stream. Context graphs are the semantic layer on top.

But you can’t build the semantic layer without the event stream. Start capturing now.

Practical Setup

LangChain provides official documentation for tracing Claude Code to LangSmith. The setup uses Claude Code’s hook system to capture conversation transcripts and send them as traces.

The high-level steps:

Create a hook script at ~/.claude/hooks/stop_hook.sh that processes Claude Code’s conversation transcripts and sends them to LangSmith.
Configure the global hook in ~/.claude/settings.json:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "bash ~/.claude/hooks/stop_hook.sh"
          }
        ]
      }
    ]
  }
}

Enable tracing per project in .claude/settings.local.json:

{
  "env": {
    "TRACE_TO_LANGSMITH": "true",
    "CC_LANGSMITH_API_KEY": "lsv2_pt_...",
    "CC_LANGSMITH_PROJECT": "my-project-traces"
  }
}

Once configured, every Claude Code session streams to LangSmith. You can:

Review session trajectories after the fact
Search across sessions for patterns (“show me every session that touched auth”)
Correlate sessions with git commits and PRs
View conversations grouped by thread in LangSmith’s Threads tab

It’s not the full vision of decision traces. But it’s the foundation—and it’s available today.

From Capture to Governance

Captured decision traces enable something new: systematic governance of AI development.

Consistency enforcement. When an agent in a new session is about to make a decision that contradicts established patterns, the context graph surfaces the conflict. “Previous sessions established X pattern. You’re about to do Y. Is this intentional divergence or should we maintain consistency?”

Trade-off visibility. Every architectural decision has trade-offs. Currently, they’re invisible—baked into code without explanation. Decision traces make trade-offs queryable. “Show me every decision in the payments domain where we traded security for performance.” Now you can audit.

Precedent-based reasoning. When an agent faces a decision similar to one made before, the context graph provides precedent. Not “here’s what to do” but “here’s what was done before, and why.” The agent can follow precedent or diverge intentionally—but the divergence is captured too.

Organizational learning. When a pattern leads to incidents, you can trace back. “Which agent decisions led to this architecture? What trade-offs were made? Where did the reasoning go wrong?” Now you can improve—not just the code, but the decision-making process itself.

The Simulation Layer

Koratana makes a key claim: “Simulation is the test of understanding. If your context graph can’t answer ‘what if,’ it’s just a search index.”

For AI agent governance, simulation means:

Impact prediction. Before an agent commits a change, query the context graph: “Given the accumulated patterns of how changes to authentication affect downstream services, what’s the blast radius of this change?”

Coherence checking. Before merging AI-generated code: “Does this decision cohere with the architectural patterns established across the codebase? Where are the tensions?”

Counterfactual exploration. “What if we had chosen session tokens instead of JWT? How would that have propagated through subsequent decisions?” The context graph, with enough accumulated traces, becomes a simulator for your development process.

This is what experienced senior engineers have that juniors don’t—a world model of how decisions compound. Context graphs make that institutional. The agent can inherit the world model instead of building it from scratch.

Bridging AI Agents and Human Governance

The real power is in the bridge: AI agents that build, humans that govern, and context graphs that make governance tractable.

For engineering leaders: This is infrastructure that compounds. Every agent session adds to organizational memory. You’re not just shipping features—you’re building a model of how your system evolves. That model becomes an asset: queryable, auditable, learnable.

For architects: Decision traces give you visibility into how agents interpret your guidelines. Where do they follow patterns? Where do they diverge? The context graph shows you the actual architecture being built, not the intended one.

For security and compliance: When you need to explain “why is the system built this way?”—for audits, for incidents, for due diligence—the reasoning is captured. Not reconstructed from code archaeology, but recorded at decision time.

For the agents themselves: Future sessions inherit context. The agent building your service next month knows what the agent building it today decided, and why. Coherence becomes achievable.

The Path Forward

We’re at an inflection point. AI agents are becoming capable enough to make real architectural decisions. The question is whether we’ll govern those decisions or just accept the artifacts.

The path forward:

Instrument agent sessions now. Set up Claude Code tracing to LangSmith or a similar observability platform. Capture trajectories even if you can’t capture reasoning yet.
Maintain your CLAUDE.md deliberately. It’s primitive, but it’s what we have. Update it after significant sessions. Make it a habit.
Accumulate before optimizing. Don’t try to predefine “correct” agent decisions. Capture traces, let patterns emerge, then refine based on outcomes.
Build the query layer. Make accumulated traces queryable—by humans for governance, by agents for precedent. The value is in accessibility.
Close the feedback loop. When decisions lead to incidents or technical debt, trace back to the sessions that produced them. Learn from agent reasoning, not just agent output.
Integrate with human review. PR reviews become richer when reviewers can see session context. “I see Claude chose X because of Y—but have you considered Z?” becomes possible.

The Stakes

Here’s what I’ve realized after seven months of using Claude Code: the tool is more capable than my ability to capture its reasoning.

Every session, Claude Code makes dozens of decisions I’d struggle to make as quickly. It analyzes patterns I’d miss. It considers trade-offs I’d forget. It’s genuinely good at this.

And then I close my terminal, and all that work—the analysis, the reasoning, the precedent—disappears. The next session starts from scratch, with only my manually-maintained CLAUDE.md to provide continuity.

We’re leaving enormous value on the table. Not just documentation—organizational intelligence. The accumulated reasoning of every AI session, queryable and learnable.

The organizations that figure out how to capture this will have something qualitatively different: AI development that compounds. Every session makes the next one smarter. Patterns cohere because agents can query precedent. Trade-offs are visible because they’re captured, not lost.

Organizations that don’t will have faster code generation and slower debugging. More output and less understanding. They’ll ship faster until they need to change direction, and then they’ll discover that no one—human or AI—remembers why the system was built this way.

Claude Code shows me its reasoning. It’s right there in the terminal. The question is whether we’ll build infrastructure to capture it—or keep letting it disappear.

This post was inspired by Animesh Koratana’s framework on context graphs. His original piece explores the theoretical foundations—the “two clocks problem,” agents as informed walkers, and context graphs as organizational world models. I’ve applied that lens to the specific challenge I’m experiencing daily: governing AI agents as they build our infrastructure.