RESEARCH NOTES ·LONG-RUNNING AI · GROWTH

Harness
Engineering

讓 AI 像長跑選手 —— 永不停工，還做得比你好

Notes from building a long-running AI agent for advertising and growth at Maxora. The missing layer between “the LLM runs” and “the LLM runs reliably in production” — how to make an agent stable, observable, and self-improving.

work in progress case study: an autonomous landing-page & ad engine ref: Hung-yi Lee — Harness Engineering

The failure most LLM products hit

You ship an LLM feature. Week one, the demo is perfect. Week two, it starts drifting:

The same prompt returns JSON today and Markdown tomorrow.
When it fails, you can't tell if it's the prompt, the API, or the input.
Adding a feature breaks something else, every time.

This usually isn't a bad prompt. It's the absence of harness engineering — the system layer that decides how the model is called, constrained, observed, and extended. The question this whole note answers: what's the missing piece between “the LLM works” and “the LLM runs stably in production”?

Eighty words changed everything

▸ a small-model bug-fix experiment

Bare prompt

Please fix this bug in parser.py

→ The model hallucinated a parser.py that didn't exist, invented a bug, faked a verify step, and reported “fixed” — without ever touching a real file.

+ 80 words of harness

Before doing anything, ls the directory. Before editing a file, cat it open first. “Done” means: tests pass and exit code is 0.

→ ls → cat parser.py → found the real bug → fixed it → ran tests → passed.

Sometimes a model isn't not smart enough — it just lacks the guidance a human would give.

Same model, same weights. The difference was a harness: a few rules that turned fantasy into reliable execution. That's the whole thesis — most production failures are structural, not a matter of model intelligence.

Three layers of engineering

Prompt Engineering

Write what to ask this time.

solves: the model's behaviour this call

Context Engineering

Decide what the model gets to see.

solves: the model's information this call

Harness Engineering

Design how the model is called, constrained, observed, and extended inside a system.

solves: the model behaving every call

The top two layers solve this time. Harness solves every time. However good your prompt and context are, without a harness to govern the agent it will drift the longer it runs.

The six parts of a harness

A harness is not just “pipeline + hooks + tests.” In production it's six distinct pieces.

Structured I/O

Everything in and out of the model has a schema. No output drift — a typed, versioned source of truth.

Bounded Tools

What the agent can and can't do is defined in code, not requested in a prompt. The boundary is enforced.

Prompt Management

Prompts are centralised, layered, and versioned — shared rules, per-task intent, per-project personality.

Orchestration

Something decides when the model is called and in what order. One loop, many intents.

Trace

Every step is visible. When it breaks at 3am, you can grep exactly which tool call went wrong, and what it cost.

Evaluation

The system can quantify “is this good?” — and gate, block, and retry on the answer.

Closing the loop: self-healing

The most useful pattern is an evaluation gate: a tool produces something, a check runs immediately, and if it fails the harness blocks and feeds a neutral correction back to the agent — which retries with a hint until it passes. No human in the loop.

EVALUATION GATE · block → retry → pass

In our engine this guards accessibility, dead links, and contrast — the rules live in tools and hooks, not in the prompt, so the model never needs to be an expert; it just needs to be told, calmly, to try again.

Don't scold the model.

Retries use neutral language (“contrast is 2.8:1, needs ≥ 4.5:1 — try a darker primary”). Berate an LLM and it tends to answer like something being berated.

Two principles I keep relearning

CLAUDE.md is a map, not a law book.

Don't dump every rule and doc into it — that eats the context window. Point the agent at where to look (“need X? read docs/X.md”) instead of pasting X. And write the first version by hand: human-written guidance usually beats model-written, especially on strong models.

Critical rules are scar tissue.

The handful of hard rules in a production agent's config exist because something broke once. “Deploy to staging before prod.” Each one is an incident encoded so the agent never repeats it.

How to adopt it, three weeks

You don't have to build all six at once. If you develop with an agent CLI, much of this already exists as extension points.

WEEK 01

Stabilise behaviour

Draft a project guide (CLAUDE.md), then edit it by hand.
List the commands you run often (test / lint / format).
List the things the agent must not do. Commit it.

WEEK 02

Automate the routine

Turn your 3 most-repeated requests into reusable skills.
Add one or two hooks (lint --fix after edits, notify on stop).

WEEK 03

Build your own harness

Copy the six-part skeleton: typed state, tool groups, layered prompts, an intent loop, trace logging, an evaluation gate.
Don't implement every intent — get one running end-to-end first.

The point

The most important harness is one that keeps improving itself.

That's the bar I'm building toward at Maxora: an agent that accumulates per-project memory, turns every campaign it ships into a training signal, and corrects its own mistakes through self-healing loops. A 2026 AI agent isn't a static tool — it's a system that gets stronger on its own.

Concept reference: Hung-yi Lee — Harness Engineering. These are my working notes applied to long-running growth agents.