00 / NOTES scroll 000%

ENGINEERING NOTES ·LOCAL LLM · CODING AGENT

A Local
Claude Code

單卡機 · 小模型 · 大奇蹟

Notes from building Maxora — a privacy-first coding agent where the source code never leaves your laptop and only the “brain” runs on a self-hosted model. How to stand up a small model from scratch, beat a tiny context window, and make a CLI that genuinely feels Claude-Code-level: streaming, parallel tools, async input, self-verification.

in production two-sided: local tools + remote inference ~32k window · 14 tools · async REPL
01

Two sides: local hands, remote brain

The thesis is privacy without giving up capability. Your repository — the thing you actually care about keeping private — never leaves the machine. Only the model's reasoning runs elsewhere, on hardware you control. So the system splits cleanly in two:

your laptop a GPU you own ┌──────────────────────────┐ HTTP / VPN ┌────────────────────────┐ │ CLI: agent loop │ ─────────────▶ │ proxy │ │ all tool execution │ │ image compress │ │ read/write/run + (y/n) │ ◀───stream──── │ concurrency │ └──────────────────────────┘ └───────────┬────────────┘ source never leaves ▼ Ollama → local model

The model never touches the filesystem. It only emits OpenAI-format tool_calls; the CLI executes them locally, gates anything dangerous behind a y/n prompt, and feeds the results back. That single rule — the LLM proposes, the local client disposes — is what makes a remote brain safe to use on private code.

▸ why a proxy in front of the model

Ollama already speaks an OpenAI-compatible API, so why not point the CLI straight at it? Because a thin proxy is the right seam for the things that shouldn't live in either the model or the client: compressing screenshots down to a few hundred pixels before they cost vision tokens, handling concurrent requests, optional bearer auth, and a redundant truncation safety net. It stays a near-transparent relay — the CLI owns the system prompt.

02

Standing up the model, end to end

From bare GPU to an endpoint your CLI can call. The whole point of a small model is that this fits on one workstation card.

STEP 01

Pick & pull a coding model

  • A quantized open-weight coding model (the Qwen-Coder family is a strong default) — small enough for a single 24 GB card.
  • Install Ollama; ollama pull the tag. It serves an OpenAI-compatible API immediately.
STEP 02

Bake the context window

  • The OpenAI endpoint ignores a per-request num_ctx, so set it once in a Modelfile and build a derived model.
  • That's how you get a real 32k (or 128k) window instead of the small default.
STEP 03

Front it with a proxy

  • A tiny threaded HTTP server intercepts /chat/completions to compress images and relay the stream.
  • Run it under systemd with Restart=always so it's always there — never launched by hand.
STEP 04

Reach it privately

  • Expose the GPU only on a private network; the laptop reaches it over VPN.
  • The CLI discovers the model from /v1/models and pins it via an env override.
# 1 · pull a small coding model ollama pull qwen-coder # 2 · bake a bigger window into a derived model (OpenAI API ignores per-request num_ctx) # Modelfile FROM qwen-coder PARAMETER num_ctx 32768 ollama create my-coder -f Modelfile # 3 · the proxy is a near-transparent relay: compress images, stream through # CLI ──▶ proxy ──▶ Ollama systemctl enable --now my-proxy
▸ the 24 GB trade-off you can't escape

On one card you are spending a fixed VRAM budget across three things that all want it: model precision, a vision projector, and the KV-cache (your context window). You cannot max all three. A 27B dense model with vision tops out around a 32k window; drop vision and a 35B quant will hold 128k. Choose by workload: vision-heavy front-end work vs. huge multi-file refactors. There is no free lunch here — pick the corner of the triangle your tasks actually live in.

03

The real problem: a tiny window

A frontier model gives you 200k+ tokens. A self-hosted small model gives you ~32k. That single constraint is what separates a toy from something that survives a real session. Context engineering — deciding what the model gets to see — is the bulk of the work. Six techniques do the heavy lifting:

01

Token-aware compaction

Before each request, estimate tokens (CJK ≈ 1, else ≈ 4 chars/token) and, past a threshold, fold old turns into a model-written summary. Cut on user-turn boundaries so a tool_call/tool pair is never split.

02

Ranged reads, not whole files

Cap read_file output, and give grep a context argument (N lines around each hit) so the model can locate and patch a large file without ever loading the whole thing.

03

Repo map

At startup, statically extract signatures (functions, classes) and inject a compact <repo_map> — the model gets the shape of the codebase for a few hundred tokens instead of reading every file.

04

Sub-agents

A task tool delegates broad, uncertain exploration to a fresh-context sub-agent. It burns its own window on grep + ranged reads and returns a concise summary, leaving the main window clean.

05

On-demand memory

Only a small project guide is injected in full each session. Larger docs (CLAUDE.md / AGENTS.md) are listed by name and read on demand — a big doc never bloats the window.

06

Keep only the latest image

Vision eats tokens. After a screenshot the newest image stays in context and older ones are stripped — so the agent can screenshot → fix → screenshot repeatedly without blowing the window.

Compaction is the one thing you never ask permission for.

Every other irreversible action goes through a gate. Compaction can't — declining it would overflow the window and kill the turn. So it runs automatically, and the messages it summarises away are archived to disk first, in case you want them back.

04

The agent loop, and making it async

At its core every coding agent is one loop. The art is in what wraps it.

per step: compact if near the limit # keep us under the window stream the completion # token-by-token, with retry/backoff on 5xx store the assistant message # reasoning stripped from history if there are tool calls: run them # read-only ones in parallel, dangerous ones serially feed the results back # as tool messages loop else: the turn ends

Two details make the loop feel professional rather than fragile. Parallelism: read-only tool calls from one turn run concurrently in a thread pool, results re-ordered to match; only mutating calls are serialized. Resilience: connection errors and 5xx retry with exponential backoff (never 4xx). And while the model cold-loads, a live 🧠 thinking… Ns spinner proves it isn't frozen.

▸ async input — the Claude-Code feel

Synchronous REPL

read input → block on the agent → read input

→ You stare at a frozen prompt. Can't queue the next task, can't switch mode mid-run.

Async, background worker

agent runs on a worker thread main thread keeps a live prompt → queue the next task → shift+tab mid-run

→ Type while it works; tasks queue and run serially. It feels alive.

The trick: there is exactly one input channel.

The agent runs on a background worker thread while the main thread keeps a persistent prompt alive. New tasks queue and process serially. The subtle part is the permission gate: it can no longer read the keyboard itself, because the single live prompt already owns it. So a y/n/a request routes through that one channel — the prompt answers the gate. Conversation-mutating commands (/reset, /compact, /model) are blocked while the worker is busy, and the synchronous loop stays as an automatic fallback for non-interactive input. The lesson generalises: one input channel, everything negotiates through it.

05

How it's structured

The system separates into a handful of concerns, each isolated so it can be reasoned about — and swapped — on its own. The shape, not the source:

ConcernWhat it owns
Settings & stateOne source of truth for configuration and shared runtime state.
Tool registryEach tool's schema, implementation, and danger flag, bound together in one place.
Model clientStreaming, reasoning split, retry/backoff, model selection.
Context managerToken estimation and compaction on safe boundaries.
Agent coreSystem-prompt assembly, parallel tool execution, the loop itself.
REPL & inputThe command surface, the async worker, and a single input channel.
Permission layerThe y/n/a gate, diff preview, and deny-rule enforcement.
DelegationFresh-context sub-agents that isolate their own token usage.
Project awarenessA static repo map and an auto-read per-project guide.
ExtensibilityShell hooks and external tool servers (MCP).
PlumbingSessions, git ops, editor integration, syntax/test checks, terminal UI.

Tools are one declaration. A single decorator binds a tool's schema, its implementation, and whether it's dangerous into one live registry — adding a tool is decorating one function, nothing else to keep in sync. Because the registry is live, capabilities added at runtime (a sub-agent, an external tool server) appear automatically. The working set:

list_dirread_fileglobgrepweb_fetch web_searchtodo_writetaskscreenshot write_file ⚠patch_file ⚠multi_edit ⚠ run_command ⚠download_file ⚠

Read-only tools run with no confirmation; the five tools are dangerous=True and pass through the permission gate. multi_edit applies several patches to one file atomically; patch_file refuses a non-unique match; run_command can run in the background; screenshot renders local HTML headless and feeds the PNG back so the model can review its own UI.

06

What keeps a small model honest

A 27B model is capable but it over-claims — it will say “done” without calling a tool, or rewrite a whole file and stall at 80% of the window. The harness, not the weights, is what makes it reliable. Layered defences:

01

Read before edit

Patching or overwriting an existing file is rejected unless it was read this session. Creating a new file needs no read.

02

Permission modes

default / acceptEdits / plan / yolo, switched live with shift+tab. plan blocks every mutation so the agent designs without acting.

03

Deny rules

Path globs (.env, *.pem, **/.git/**) are enforced at the tool layer — they block even under yolo. A security rule, not a prompt.

04

Auto-verify

After edits, one extra turn makes the model grep/read-check that every change it claimed actually exists — catching over-claimed completion.

05

Auto-check

Changed files are syntax-checked (or run through your test/lint command); on failure the error is fed back and the agent fixes it, up to a cap.

06

Large-file guard

Whole-rewriting a big existing file is refused (it thrashes the window); the model is steered to multi_edit/patch_file instead.

The “said it, didn't do it” nudge.

The classic small-model failure: the model writes “I'll add that file now” and then ends the turn without calling a tool. A dedicated nudge detects stated-intent-without-tool-call and prompts it to actually do the thing — capped so it can't loop forever. Every executed tool is also appended to an audit log, and every edit pushes its prior content to an undo stack.

07

What makes it feel Claude-Code-level

“Same level” isn't the model — it's the dozen affordances around it that you stop noticing because they just work.

08

The point

The model is the smaller half. The harness is the product.

A self-hosted small model will never out-reason a frontier model on raw IQ. It doesn't have to. Wrapped in a harness that manages its context, parallelises its tools, verifies its own work, and keeps your source on your own machine, it becomes something a frontier API can't be: private, owned, and free to run all day. That's the bar I built Maxora toward — Claude-Code-level ergonomics on hardware you control.

Companion to my Harness Engineering notes — this is the same philosophy applied to a local coding agent rather than a growth agent.