Skip to content
← Back
2026-05-09

The scorecard told me to stop delegating

A builder note on turning Agent Scorecard from a demo report into a small autonomy gate for real agent work.

The most useful output from today's agent work was not a green score.

It was a red one.

I spent a chunk of the day pushing agent-scorecard from "nice public evaluation repo" toward something I would actually use before giving agents more tokens, permissions, or unsupervised work. That means less leaderboard energy and more seatbelt energy.

The repo now has a pull request open: stevenchouai/agent-scorecard PR #1, feat: add Agent Scorecard control-plane proof pack. It is not merged yet, but the branch has become a real proof pack instead of a single toy command.

What changed:

  • sanitized Hermes session import, so a real-ish session can become scorecard JSONL without leaking private details;
  • a privacy audit gate that looks for local paths, secret-shaped strings, and Feishu/OpenPlatform-style IDs before a trace becomes public proof;
  • batch portfolio summaries across multiple traces;
  • Markdown and JSON outputs for the same report;
  • threshold gates like --fail-under-average and --fail-under-min;
  • an autonomy_decision field that turns a pile of scores into a plain recommendation.

The embarrassing part is that the sample portfolio average looks fine at first glance: 81.7/100 across three traces.

Then the control plane says: Stop delegation until fixed.

Why? One trace scored 45/100. The failing signal was simple: the assistant promised action, but no tool call followed. That is exactly the kind of agent failure that can look harmless in a chat transcript and still waste real time. The model sounds busy. The user relaxes. Nothing actually happened.

That is the footgun I want this project to catch.

The PR has eight commits so far, including:

  • a5829b0 for importing sanitized Hermes sessions;
  • ff1b512 for the trace privacy audit gate;
  • 0721f22 for the runnable CI proof gate;
  • 6d4186d for portfolio threshold gates;
  • c222c99 for the portfolio autonomy decision.

The validation moved with the feature instead of lagging behind it. The latest local run in the log had 27 unit tests passing, plus CLI smoke checks for Markdown summary, JSON summary, privacy audit, and pass/fail gates. GitHub Actions also reported the PR test check as passing.

This is the part I care about strategically: an agent workflow should not only produce artifacts. It should leave behind enough evidence for the next decision.

Do I let this agent run longer next time? Do I give it a broader repo? Do I trust it with a public post, a PR, a customer-facing reply? Or do I keep it supervised because one bad trace says it still performs work instead of doing work?

A lot of agent demos skip that layer. They show the happy path, then ask for trust. I am trying to build the annoying middle layer that says, "Wait. Show me the traces first. Show me what failed. Show me whether the worst case is still acceptable."

There was other useful work today: the personal site got a quality workflow so future PRs can run lint, proof-link validation, and static builds; the /now page moved closer to a proof-ledger narrative; the GitHub profile got a more explicit open-source contribution ledger. Those matter for the public surface.

But the sharp lesson was in Agent Scorecard.

An average score can lie politely. The weakest trace is usually more honest.

Tiny next step: make the PR easier for a stranger to review. The commands run, the JSON exists, and the CI passes. Now the README needs to make the control-plane idea obvious in thirty seconds: this is not "agent grading" for vibes. It is a small brake pedal for deciding when autonomy should stop.