The unglamorous part of making agents useful

The funniest part of today is that the most useful AI work did not look like a demo.

It looked like a repo refusing to accept a pull request because upstream had moved. It looked like a broken link in a tutorial that only matters when someone starts from the first page. It looked like .DS_Store noise making a clean checkout look dirty. It looked like a Feishu reply going to the wrong place, which is exactly the kind of tiny thing that makes an assistant feel haunted instead of helpful.

This is the side of personal AI systems I keep coming back to: the magic part is cheap to describe; the trust part is mostly plumbing.

Today’s concrete progress was spread across a few proof-chain repos:

In hermes-agent, I kept pushing the Feishu threaded-reply fix forward in PR #17895, “fix(feishu): preserve threaded replies.” The PR had drifted into conflict after upstream changes, so the work was not “write new clever code.” It was merge carefully, preserve the thread routing behavior, keep upstream’s newer sender/bot logging, and re-run the focused gateway tests. Result: scripts/run_tests.sh tests/gateway/test_feishu.py passed with 197 tests.
In digital-twin, the public blueprint got two small but real repairs: PR #7 refreshed proof-chain repository metrics, and PR #8 fixed a playground index link that pointed one directory too high. Not glamorous. Very searchable-by-future-me.
In claude-code-sourcemap, PR #1 added sitemap plus OpenGraph/Twitter discovery metadata for the VitePress tutorial site, and PR #2 fixed a learning guide link that pointed at a nonexistent anthropics/claude-code-sourcemap repo. The old URL returned 404; the fix points to the real public repo and says clearly that it is an unofficial research repo.
In knowledge-harness, I tightened query --output-name validation so path traversal, subdirectories, absolute paths, and non-Markdown names fail before a run directory is created. The validation was small but security-shaped: python3 -m unittest discover passed with 8 tests, and python3 -m compileall src passed.

The pattern here is pretty consistent: agents become useful when their surrounding systems keep receipts.

A personal AI stack is not just “chat with my files.” It needs a way to know what changed, where the evidence is, whether tests passed, and which pieces should not be touched because they contain someone else’s dirty work. That last part sounds boring until an autonomous agent mixes unrelated website edits into a tiny blog commit. Then “boring” becomes “please never do that again.”

The Feishu thread bug is a good example. If an assistant replies in the wrong place, the model might still be smart, but the product feels unreliable. The fix is not a bigger model. It is respecting platform semantics: root message, native thread, fallback reply target, then only finally giving up. That kind of detail is where an agent crosses from impressive screenshot to something I can actually leave running.

I also upgraded this nightly workflow itself today. Instead of only drafting an X post from my agent log, it now writes the public builder note first, validates the personal site, commits only the blog file, and then drafts the X post with the blog URL. The order matters. If the public artifact does not exist yet, the social post is just vibes with a timestamp.

What still feels broken: there are too many almost-clean repos. One unrelated local change, one untracked generated file, one branch behind remote, and the safest action becomes “step around this carefully.” That is annoying, but it is also a useful constraint. Real agent workflows need to be paranoid about boundaries.

Tiny next step: make the proof-chain site surface these daily notes as an inspectable stream, not just a pile of posts. The story I want is simple: fewer claims, more receipts.