Why Trinity runs on Effect: typed errors, supervised agents

Lorenzo WynbergJune 10, 202610 min readEngineering

An autonomous pipeline is only as trustworthy as its worst failure.

Trinity plans, executes, reviews, and ships software without a human babysitting each step. That only works if the substrate underneath is bulletproof — because when an agent is three files into a refactor at 2 AM and the provider rate-limits the next call, something has to happen, and "swallow the exception and leave the story half-done" is not an acceptable something. The hard part of autonomous coding was never generating the diff. It's everything around the diff: what happens when a call fails, when you hit Stop, when a worker hangs, when two pieces of work race.

So we moved Trinity's highest-leverage surfaces onto Effect — the TypeScript library for typed errors, structured concurrency, and dependency injection as values. Agent execution. Error handling. Auth. The coordinator. Not a monolithic rewrite — Effect strongly in the app, not Effect everywhere. The footprint today: ~366 files import it, ~200 use Effect.gen, 70 tagged-error classes, and 43+ services wired through Context.

This post is about what that buys you, the developer whose code Trinity is shipping. Two wins lead: how Trinity runs agents, and how it handles failure.

Agent execution that unwinds cleanly

Every harness driver Trinity ships — Claude Code, Codex — has the same shape. Not a Promise. An Effect.

trinity/src/lib/harness/types.ts

run: (...) => Effect.Effect<AgentResult, HarnessError, R>;
stream: (...) => Effect.Effect<HarnessStreamHandle, HarnessError, R | Scope.Scope>;

That one type does three things a Promise can't. It puts every failure in a typed channel (HarnessError, the second slot). It declares its dependencies (R, the third slot) instead of reaching for globals. And because it's an Effect, it runs inside Effect's structured concurrency — which is the part that matters when you hit Stop.

Here's the problem structured concurrency solves. A running agent isn't one thing. It's a subprocess, an event stream fanning out to your activity feed, a lease on a model slot, a git worktree on disk, and often a couple of forked side-tasks. In a Promise-and-event-emitter world, tearing all of that down correctly on cancellation is a manual chore you get wrong in the edge cases — the ones that leave a zombie node-pty process or a worktree that never got cleaned up.

Trinity ties the whole thing to a Scope.

Scoped resource

AgentSession(Scope)

PubSub

Per-session event fan-out, replay-buffered

Supervised fibers

runAgent + forked work, one fiber each

Harness lease

Acquired on open, released on close

Scope.close()interrupts every fiber, drains the PubSub, releases the lease — in reverse order, deterministically.

No orphaned processes. No half-released leases. No zombie worktrees.

An AgentSession is a scoped resource. The PubSub event hub, the supervised fibers, and the harness lease all acquire under one scope. Close the scope and every fiber is interrupted, the PubSub drains, and the lease releases — in reverse acquisition order, deterministically. When you press Stop, that's the call that fires. The agent's work unwinds cleanly: no orphaned processes, no half-released leases, no zombie worktrees.

note

Fiber.interrupt is the bridge that makes cancellation real. The worker pool wires it to an AbortController, so interrupting the Effect fiber actually unwinds the underlying async cleanup — not just a flag that gets checked later, but cleanup that runs.

Timeouts as values, not thrown exceptions

The execution coordinator runs each worker as an Effect fiber. A stuck job used to be a race between a setTimeout and the work, with cleanup bolted on. Now the timeout is a value:

trinity/src/lib/execution/coordinator/worker-pool.ts

Effect.timeoutFail(workerLoop, {
  duration: timeout,
  onTimeout: () => new WorkerError({ jobId, phase: 'harness', cause }),
});

When the duration elapses, the harness fiber is interrupted and a tagged WorkerError comes back carrying the phase it died in — claim, harness, or complete. Ops know exactly which boundary hung. No exception thrown across an async boundary to be caught (or missed) three layers up.

The coordinator's background loops got the same treatment. We deleted the setInterval / clearInterval bookkeeping and replaced it with Effect.repeat(scan, Schedule.fixed(...)) forked into a FiberSet held in a scope. One Scope.close() interrupts every loop at once. No dangling timers to leak, no "did we clear that interval?" bugs on shutdown.

Concurrency that doesn't take the system down with it

When Trinity generates a plan, it forks a roadmap update as its own fiber, generates the stories concurrently, and joins after. A failure in the roadmap update doesn't crash story generation — they're independent fibers, not two halves of a Promise.all that fails together. No event-emitter glue holding them in sync. This is what parallel execution looks like when the runtime, not your hand-written coordination code, owns the structure: agents run in parallel safely, and one failing doesn't take the others down.

Errors you can actually act on

Here's the failure mode every autonomous tool eventually hits. A provider returns a rate-limit error. The code does if (err.message.includes('rate_limit')). The provider changes the wording in a minor update. The check silently stops matching. Now a rate limit looks like an unknown crash, the retry logic never fires, and your story is stranded half-done with no signal as to why.

Strings are not a type system. So Trinity's harness errors are a typed algebraic data type — nine variants, each carrying a payload specific to its failure.

HarnessError — 9 typed variantsEffect<AgentResult, HarnessError, R>

RateLimitError.retryAfterMs

ContextWindowError.limitTokens

AuthExpiredError.credential

ToolFailedError.toolName

CancelledError.reason

NetworkError.errno

ValidationError.field

ProcessError.exitCode

UnknownError.cause

Beforeif (err.message.includes('rate_limit'))String-matching. Silently misses the variant when the wording drifts.

→

Afterswitch (err._tag) { case 'RateLimitError': ... }The compiler enforces every arm. A new variant won’t compile until it’s handled.

The win isn't aesthetic. Each variant carries exactly what recovery needs. RateLimitError.retryAfterMs is the provider's own reported wait time — so a rate limit becomes a known, retryable value that backs off for precisely as long as the provider asked, instead of a swallowed exception. ProcessError.exitCode tells you the subprocess died and how. ContextWindowError means compact the conversation, not wait — a distinction string-matching erases.

The UI pattern-matches on _tag to drive the activity-feed error chip and the right retry affordance for each case. And because it's a discriminated union, adding a tenth variant is a compile error everywhere that didn't handle it — the autonomous pipeline fails loudly and precisely, never silently.

A rate limit is data, not a disaster

The pipeline reads the provider's own retryAfterMs, waits exactly that long, and resumes — no guessed backoff, no stranded story. The error is a value the system reasons about, not an exception it trips over.

This pattern runs through the codebase. AuthError maps each auth failure to a precise HTTP status (400 / 401 / 403 / 500), with a test pinning the mapping so a message-string drift can't quietly regress it. PermissionError carries the GitHub permission level you needed versus the one you had — returned as a value, threaded through Layer composition, no new parameters, no throws. Even the local database migrations got it: SqlExecError preserves the exact failed statement and the original SQL error, so instead of a cryptic migration M failed, ops see precisely which file and which statement blew up.

Failure	String-matched (before)	Typed value (after)
Rate limit	err.message.includes('rate_limit')	RateLimitError.retryAfterMs — exact backoff
Auth expired	Generic 500, re-auth never prompted	AuthError → 401, status pinned by a test
Worker hang	setTimeout race, cleanup bolted on	WorkerError{phase} — value with the boundary
Migration fails	Error('migration M failed')	SqlExecError{migration, statement, cause}
New failure mode	Falls through, silently unhandled	Compile error until every arm handles it

Dependency injection that deletes boilerplate

The two wins above need plumbing, and Effect's plumbing is Context. Trinity's request context — identity, scope, the active project, the verified user — are Context.Tag services. Routes compose layers (requestScopeLayer, ensureProjectCtxLayer, userCtxLayer) instead of threading ExecCtx through every signature. A route that needs the project context writes yield* ProjectCtx and gets it. The per-route context boilerplate is gone.

The elegant part is how the two execution axes flow to the leaf. An AgentSession carries two optional concerns: a SessionObserver for streaming events out, and a PermissionResolver for approvals coming in. The leaf runAgent reads them via Effect.serviceOption — so they reach it through Context without a single signature change.

trinity/src/lib/agent/agent.ts

// The leaf is ambient-aware. Whether a session is bound or not, the
// signature is identical — the session's concerns arrive through Context.
export function runAgent(...): Effect.Effect<AgentResult, HarnessError, never>

That serviceOption read is also where Trinity's two execution modes collapse into one code path. No PermissionResolver bound? Allow-all — fully autonomous. A session binds one? Every action routes to interactive approval. Same harness code, two modes, zero conditional branching. The mode is a fact about the context, not a flag the code has to check.

Building software that ships itself means the substrate has to be bulletproof. Effect is how we make the autonomous loop trustworthy — every failure typed, every resource scoped, every dependency a value.

The next step: porting the data API to Effect

The app-side adoption is real and shipped. The next step is the website data API — and it's still the planned next step, not a done deal.

Today — in the app, shipped

Agent execution (harness drivers, runAgent)
Typed error channel (9-variant HarnessError ADT)
Supervised coordinator (FiberSet + Scope)
Auth + permissions (Context.Tag + Layers)

Next — the API port (issue #442)

Standalone Effect HttpApi server on Fly
Warm per-tenant DB client pool (ScopedCache)
Zod → effect/Schema contracts, snake_case preserved
Derived AtomHttpApi client, retires TanStack Query

Today that API is over 250 Next.js serverless route files with Zod contracts. The plan (issue #442 on our roadmap) is to extract it into a standalone, long-running Effect server on Fly, built on @effect/platform's HttpApi, with the frontend consuming a derived, Result-typed client through @effect-atom.

The forcing function is real, and it's about your data. Trinity provisions a Turso database per user and per team — your data, isolated, close to you. On serverless, every request opens a fresh database client. At scale that means connection storms and cold latency on every call. A persistent Effect process holds a warm per-tenant client pool — a ScopedCache that keeps your database connection hot across requests. Effect's runtime is happiest in a long-running process: build the layer graph once at boot, hold resources for the process lifetime. That's a faster, warmer data plane, and the connection-storm problem disappears.

The contracts migrate from Zod to effect/Schema, snake_case preserved end to end, shared by both hosts. So the same typed contract guards the wire from the app down to the data layer — a contract change can't silently break a screen, because the schema is the single source of truth on both ends. The migration is non-breaking and incremental, one domain at a time, starting with a threads pilot. Auth hardens on the way: the Fly server verifies a cryptographic JWT — the sub claim as the identity anchor — instead of trusting a header.

The takeaway

Trinity exists to make autonomous software delivery trustworthy. You can't get there with Promise-and-string-matching plumbing — the failure modes are exactly the ones that strand a story or leak a process, and those are the moments that decide whether you trust the loop.

Effect closes them. Typed errors so a rate limit is a value with the provider's own wait time, not a swallowed exception. Supervised fibers and scopes so Stop means stop — cleanly, every time. Structured concurrency so parallel agents fail independently. That's the substrate under planning, execution, and the knowledge that compounds across runs.

We did the app first because that's where the agents run and where a dropped failure costs you a story. The data API is next. Same principle, one layer down: make the thing that ships your software impossible to fail silently.