RESEARCH NOTES ·LONG-RUNNING AI · GROWTH
Harness
Engineering
讓 AI 像長跑選手 —— 永不停工,還做得比你好
Notes from building a long-running AI agent for advertising and growth at Maxora. The missing layer between “the LLM runs” and “the LLM runs reliably in production” — how to make an agent stable, observable, and self-improving.
The failure most LLM products hit
You ship an LLM feature. Week one, the demo is perfect. Week two, it starts drifting:
- The same prompt returns JSON today and Markdown tomorrow.
- When it fails, you can't tell if it's the prompt, the API, or the input.
- Adding a feature breaks something else, every time.
This usually isn't a bad prompt. It's the absence of harness engineering — the system layer that decides how the model is called, constrained, observed, and extended. The question this whole note answers: what's the missing piece between “the LLM works” and “the LLM runs stably in production”?
Eighty words changed everything
Bare prompt
→ The model hallucinated a parser.py that didn't exist, invented a bug, faked a verify step, and reported “fixed” — without ever touching a real file.
+ 80 words of harness
→ ls → cat parser.py → found the real bug → fixed it → ran tests → passed.
Sometimes a model isn't not smart enough — it just lacks the guidance a human would give.
Same model, same weights. The difference was a harness: a few rules that turned fantasy into reliable execution. That's the whole thesis — most production failures are structural, not a matter of model intelligence.
Three layers of engineering
Prompt Engineering
Write what to ask this time.
solves: the model's behaviour this call
Context Engineering
Decide what the model gets to see.
solves: the model's information this call
Harness Engineering
Design how the model is called, constrained, observed, and extended inside a system.
solves: the model behaving every call
The top two layers solve this time. Harness solves every time. However good your prompt and context are, without a harness to govern the agent it will drift the longer it runs.
The six parts of a harness
A harness is not just “pipeline + hooks + tests.” In production it's six distinct pieces.
Structured I/O
Everything in and out of the model has a schema. No output drift — a typed, versioned source of truth.
Bounded Tools
What the agent can and can't do is defined in code, not requested in a prompt. The boundary is enforced.
Prompt Management
Prompts are centralised, layered, and versioned — shared rules, per-task intent, per-project personality.
Orchestration
Something decides when the model is called and in what order. One loop, many intents.
Trace
Every step is visible. When it breaks at 3am, you can grep exactly which tool call went wrong, and what it cost.
Evaluation
The system can quantify “is this good?” — and gate, block, and retry on the answer.
Closing the loop: self-healing
The most useful pattern is an evaluation gate: a tool produces something, a check runs immediately, and if it fails the harness blocks and feeds a neutral correction back to the agent — which retries with a hint until it passes. No human in the loop.
Don't scold the model.
Retries use neutral language (“contrast is 2.8:1, needs ≥ 4.5:1 — try a darker primary”). Berate an LLM and it tends to answer like something being berated.
Two principles I keep relearning
CLAUDE.md is a map, not a law book.
Don't dump every rule and doc into it — that eats the context window. Point the agent at where to look (“need X? read docs/X.md”) instead of pasting X. And write the first version by hand: human-written guidance usually beats model-written, especially on strong models.
Critical rules are scar tissue.
The handful of hard rules in a production agent's config exist because something broke once. “Deploy to staging before prod.” Each one is an incident encoded so the agent never repeats it.
How to adopt it, three weeks
You don't have to build all six at once. If you develop with an agent CLI, much of this already exists as extension points.
Stabilise behaviour
- Draft a project guide (CLAUDE.md), then edit it by hand.
- List the commands you run often (test / lint / format).
- List the things the agent must not do. Commit it.
Automate the routine
- Turn your 3 most-repeated requests into reusable skills.
- Add one or two hooks (lint --fix after edits, notify on stop).
Build your own harness
- Copy the six-part skeleton: typed state, tool groups, layered prompts, an intent loop, trace logging, an evaluation gate.
- Don't implement every intent — get one running end-to-end first.
The point
The most important harness is one that keeps improving itself.
That's the bar I'm building toward at Maxora: an agent that accumulates per-project memory, turns every campaign it ships into a training signal, and corrects its own mistakes through self-healing loops. A 2026 AI agent isn't a static tool — it's a system that gets stronger on its own.
Concept reference: Hung-yi Lee — Harness Engineering. These are my working notes applied to long-running growth agents.