ENGINEERING NOTES ·LOCAL LLM · CODING AGENT
A Local
Claude Code
單卡機 · 小模型 · 大奇蹟
Notes from building Maxora — a privacy-first coding agent where the source code never leaves your laptop and only the “brain” runs on a self-hosted model. How to stand up a small model from scratch, beat a tiny context window, and make a CLI that genuinely feels Claude-Code-level: streaming, parallel tools, async input, self-verification.
Two sides: local hands, remote brain
The thesis is privacy without giving up capability. Your repository — the thing you actually care about keeping private — never leaves the machine. Only the model's reasoning runs elsewhere, on hardware you control. So the system splits cleanly in two:
The model never touches the filesystem. It only emits OpenAI-format tool_calls;
the CLI executes them locally, gates anything dangerous behind a y/n prompt, and
feeds the results back. That single rule — the LLM proposes, the local client
disposes — is what makes a remote brain safe to use on private code.
Ollama already speaks an OpenAI-compatible API, so why not point the CLI straight at it? Because a thin proxy is the right seam for the things that shouldn't live in either the model or the client: compressing screenshots down to a few hundred pixels before they cost vision tokens, handling concurrent requests, optional bearer auth, and a redundant truncation safety net. It stays a near-transparent relay — the CLI owns the system prompt.
Standing up the model, end to end
From bare GPU to an endpoint your CLI can call. The whole point of a small model is that this fits on one workstation card.
Pick & pull a coding model
- A quantized open-weight coding model (the Qwen-Coder family is a strong default) — small enough for a single 24 GB card.
- Install Ollama;
ollama pullthe tag. It serves an OpenAI-compatible API immediately.
Bake the context window
- The OpenAI endpoint ignores a per-request
num_ctx, so set it once in a Modelfile and build a derived model. - That's how you get a real 32k (or 128k) window instead of the small default.
Front it with a proxy
- A tiny threaded HTTP server intercepts
/chat/completionsto compress images and relay the stream. - Run it under systemd with
Restart=alwaysso it's always there — never launched by hand.
Reach it privately
- Expose the GPU only on a private network; the laptop reaches it over VPN.
- The CLI discovers the model from
/v1/modelsand pins it via an env override.
On one card you are spending a fixed VRAM budget across three things that all want it: model precision, a vision projector, and the KV-cache (your context window). You cannot max all three. A 27B dense model with vision tops out around a 32k window; drop vision and a 35B quant will hold 128k. Choose by workload: vision-heavy front-end work vs. huge multi-file refactors. There is no free lunch here — pick the corner of the triangle your tasks actually live in.
The real problem: a tiny window
A frontier model gives you 200k+ tokens. A self-hosted small model gives you ~32k. That single constraint is what separates a toy from something that survives a real session. Context engineering — deciding what the model gets to see — is the bulk of the work. Six techniques do the heavy lifting:
Token-aware compaction
Before each request, estimate tokens (CJK ≈ 1, else ≈ 4 chars/token) and, past a threshold, fold old turns into a model-written summary. Cut on user-turn boundaries so a tool_call/tool pair is never split.
Ranged reads, not whole files
Cap read_file output, and give grep a context argument (N lines around each hit) so the model can locate and patch a large file without ever loading the whole thing.
Repo map
At startup, statically extract signatures (functions, classes) and inject a compact <repo_map> — the model gets the shape of the codebase for a few hundred tokens instead of reading every file.
Sub-agents
A task tool delegates broad, uncertain exploration to a fresh-context sub-agent. It burns its own window on grep + ranged reads and returns a concise summary, leaving the main window clean.
On-demand memory
Only a small project guide is injected in full each session. Larger docs (CLAUDE.md / AGENTS.md) are listed by name and read on demand — a big doc never bloats the window.
Keep only the latest image
Vision eats tokens. After a screenshot the newest image stays in context and older ones are stripped — so the agent can screenshot → fix → screenshot repeatedly without blowing the window.
Compaction is the one thing you never ask permission for.
Every other irreversible action goes through a gate. Compaction can't — declining it would overflow the window and kill the turn. So it runs automatically, and the messages it summarises away are archived to disk first, in case you want them back.
The agent loop, and making it async
At its core every coding agent is one loop. The art is in what wraps it.
Two details make the loop feel professional rather than fragile. Parallelism:
read-only tool calls from one turn run concurrently in a thread pool, results re-ordered to
match; only mutating calls are serialized. Resilience: connection errors and
5xx retry with exponential backoff (never 4xx). And while the model cold-loads, a live
🧠 thinking… Ns spinner proves it isn't frozen.
Synchronous REPL
→ You stare at a frozen prompt. Can't queue the next task, can't switch mode mid-run.
Async, background worker
→ Type while it works; tasks queue and run serially. It feels alive.
The trick: there is exactly one input channel.
The agent runs on a background worker thread while the main thread keeps a persistent prompt
alive. New tasks queue and process serially. The subtle part is the permission
gate: it can no longer read the keyboard itself, because the single live prompt already owns it.
So a y/n/a request routes through that one channel — the prompt answers the
gate. Conversation-mutating commands (/reset, /compact,
/model) are blocked while the worker is busy, and the synchronous loop stays as an
automatic fallback for non-interactive input. The lesson generalises: one input
channel, everything negotiates through it.
How it's structured
The system separates into a handful of concerns, each isolated so it can be reasoned about — and swapped — on its own. The shape, not the source:
| Concern | What it owns |
|---|---|
| Settings & state | One source of truth for configuration and shared runtime state. |
| Tool registry | Each tool's schema, implementation, and danger flag, bound together in one place. |
| Model client | Streaming, reasoning split, retry/backoff, model selection. |
| Context manager | Token estimation and compaction on safe boundaries. |
| Agent core | System-prompt assembly, parallel tool execution, the loop itself. |
| REPL & input | The command surface, the async worker, and a single input channel. |
| Permission layer | The y/n/a gate, diff preview, and deny-rule enforcement. |
| Delegation | Fresh-context sub-agents that isolate their own token usage. |
| Project awareness | A static repo map and an auto-read per-project guide. |
| Extensibility | Shell hooks and external tool servers (MCP). |
| Plumbing | Sessions, git ops, editor integration, syntax/test checks, terminal UI. |
Tools are one declaration. A single decorator binds a tool's schema, its implementation, and whether it's dangerous into one live registry — adding a tool is decorating one function, nothing else to keep in sync. Because the registry is live, capabilities added at runtime (a sub-agent, an external tool server) appear automatically. The working set:
Read-only tools run with no confirmation; the five ⚠ tools are dangerous=True and pass through the permission gate. multi_edit applies several patches to one file atomically; patch_file refuses a non-unique match; run_command can run in the background; screenshot renders local HTML headless and feeds the PNG back so the model can review its own UI.
What keeps a small model honest
A 27B model is capable but it over-claims — it will say “done” without calling a tool, or rewrite a whole file and stall at 80% of the window. The harness, not the weights, is what makes it reliable. Layered defences:
Read before edit
Patching or overwriting an existing file is rejected unless it was read this session. Creating a new file needs no read.
Permission modes
default / acceptEdits / plan / yolo, switched live with shift+tab. plan blocks every mutation so the agent designs without acting.
Deny rules
Path globs (.env, *.pem, **/.git/**) are enforced at the tool layer — they block even under yolo. A security rule, not a prompt.
Auto-verify
After edits, one extra turn makes the model grep/read-check that every change it claimed actually exists — catching over-claimed completion.
Auto-check
Changed files are syntax-checked (or run through your test/lint command); on failure the error is fed back and the agent fixes it, up to a cap.
Large-file guard
Whole-rewriting a big existing file is refused (it thrashes the window); the model is steered to multi_edit/patch_file instead.
The “said it, didn't do it” nudge.
The classic small-model failure: the model writes “I'll add that file now” and then ends the turn without calling a tool. A dedicated nudge detects stated-intent-without-tool-call and prompts it to actually do the thing — capped so it can't loop forever. Every executed tool is also appended to an audit log, and every edit pushes its prior content to an undo stack.
What makes it feel Claude-Code-level
“Same level” isn't the model — it's the dozen affordances around it that you stop noticing because they just work.
- Token-by-token streaming, with reasoning split out and a live thinking timer.
- Parallel read-only tools; resilient retry on transient failures.
- Async persistent input — queue tasks, switch mode mid-run.
- Real context compaction on safe boundaries, with archived backups.
- Diff preview before every edit; read-before-edit; an undo stack.
- Permission modes, deny rules, and a full audit log.
- Sub-agents, a project guide, MCP servers, and shell hooks.
- Self-verification and auto-syntax/test checks that close the loop without a human.
- Vision: drop a screenshot in, or let the agent render and review its own UI.
- Sessions you can
--resume; a polished prompt-toolkit input with a slash-command menu.
The point
The model is the smaller half. The harness is the product.
A self-hosted small model will never out-reason a frontier model on raw IQ. It doesn't have to. Wrapped in a harness that manages its context, parallelises its tools, verifies its own work, and keeps your source on your own machine, it becomes something a frontier API can't be: private, owned, and free to run all day. That's the bar I built Maxora toward — Claude-Code-level ergonomics on hardware you control.
Companion to my Harness Engineering notes — this is the same philosophy applied to a local coding agent rather than a growth agent.