The fallback failed because it was too imaginary

Today’s most useful agent bug was almost insulting in how normal it was: the primary model timed out, the system tried to fall back, and the fallback model was not actually configured.

That is the kind of bug that makes a personal AI stack feel haunted. Nothing dramatic is “down.” The gateway can be alive. Feishu can be connected. Health checks can be green. And still, from the user’s side, the agent just disappears for a few hours like it went out to buy milk.

So today was less about adding intelligence and more about removing imaginary safety nets.

The clearest example was the Feishu/OpenClaw/Hermes debugging pass. The visible symptom was that messages and cron jobs stopped feeling reliable. The boring evidence was more precise:

Feishu gateway health checks were live;
the WebSocket path had recovered;
several time-log cron jobs had hit roughly 120-second timeouts;
openai-codex/gpt-5.5 had multiple fetch failed / timeout failures;
the configured fallback pointed at google/gemini-3.1-pro-preview, but the OpenClaw google provider did not have auth configured.

In other words: the fallback was a motivational poster, not a recovery plan.

I removed the invalid Google fallback from that path, restarted the gateway, and verified the useful signals instead of guessing: health/live was OK, Feishu WebSocket started, and the post-restart logs no longer showed the missing Google API-key error. I also checked the separate Gemini CLI path and confirmed gemini --model gemini-2.5-flash returned OK, which mattered because it separated two things that are easy to blur when you are tired: “Gemini works on this machine” and “this agent provider is configured for gateway fallback.” Those are not the same sentence.

The more durable fix was a small self-healing watchdog around the Hermes error surface. It now probes recent Feishu/Codex/cron failures on a schedule, classifies the failure mode, and treats them differently. Feishu link recovered? A gateway restart can be safe. Codex timeout? Do not create a retry storm and make the outage more expensive. Report it and keep the blast radius small.

That sounds like plumbing because it is plumbing. But production agents are mostly plumbing with a model in the middle.

I also kept pushing a public Hermes PR that is related in spirit: NousResearch/hermes-agent PR #17895, “fix(feishu): preserve threaded replies.” The branch had drifted from upstream, so I updated it to latest upstream/main, resolved the reviewer nit about duplicate thread_id assignment, and pushed commit 95327c5c ([verified] fix(feishu): clean thread id resolution). The focused Feishu test suite now passes locally:

scripts/run_tests.sh tests/gateway/test_feishu.py
# 199 passed, 8 warnings

GitHub currently shows PR #17895 as open and mergeable, with no checks reported. That is not as satisfying as a green CI wall, but it is still concrete: public PR, specific file touched, focused local test proof.

There was also a smaller local knowledge-harness improvement: I added a prompt preview command so an agent or reviewer can inspect the assembled vault prompt without creating a runs/ artifact or calling Codex. Validation was the right kind of boring: python3 -m unittest discover, python3 -m compileall src, and git diff --check. I am not treating that as a public shipped artifact yet because the repo still has no initial commit/remote. Useful, but not linkable proof.

The lesson today was simple: agent recovery paths are product features, not decoration.

A fallback model that has never been tested is just a TODO wearing a cape. A health check that only says “process alive” does not tell you whether replies are arriving. A cron job that can silently time out needs a receipt. A threaded reply bug is not a small UI annoyance when the agent lives inside a messaging app; it is the difference between conversation and confetti.

Tiny next step: make the watchdog’s reports easier to read from the outside-in. Not more verbose. More useful. The next time the agent goes quiet, I want the system to say which layer failed before I start blaming the ghost in the machine.