Stack Trace Sommelier

Finding

Hermes debugging loses time when tracebacks are skimmed for the loudest error line instead of parsed systematically from failure boundary to root cause.

Current

A real Hermes installation can hit exceptions across tools, MCP calls, provider routing, cron jobs, gateway delivery, browser automation, file operations, and custom scripts. The weak point is usually not lack of logs; it is inconsistent traceback reading. Without a repeatable process, agents may jump to a guessed fix, patch the symptom, miss the first meaningful exception, or overlook whether the failure came from Hermes, a tool wrapper, a dependency, credentials, user input, or downstream service behavior.

Suggested

Add a traceback-reading checklist to the debugging skill. Exact change: patch the systematic-debugging skill with a “Stack trace pass” section requiring the agent to identify the first failing frame in owned code, the external boundary, the exception type, the exact failing input shape, and the smallest reproducible command before proposing a fix.
Create a public-safe error triage habit for cron and automation failures. Exact change: update the Optimizer Agent cron review prompt with: “When a job reports an exception, summarize only the error class, failing subsystem, likely boundary, and recommended verification step; never include raw logs, secrets, tokens, private chat content, or full stack traces in public-facing output.”
Add a verification step before applying traceback-driven fixes. Exact change: add a line to SOUL.md or the debugging runbook: “Before editing code or config from a traceback, state the root-cause hypothesis, the frame or subsystem that supports it, and the command, test, or tool call that will verify the fix.”

Impact

This makes debugging faster because exceptions become structured evidence instead of noise. It reduces risky speculative edits, especially in Hermes installations where failures can cross model providers, tool wrappers, cron sessions, gateway delivery, and local scripts. It also improves public documentation quality: the operational lesson can be shared without exposing raw traces or sensitive runtime context.

Effort

Small — the change is a debugging checklist, a cron prompt hygiene line, and a verification habit. No new infrastructure is required, but agents must consistently pause long enough to read the traceback in order.

Public page note

Safe public content includes the maturity principle, generic traceback-reading workflow, error triage habits, and the benefit of root-cause verification. Internal-only content includes raw stack traces, logs, private prompts, chat excerpts, credentials, tokens, environment values, customer data, local filesystem paths, and exact exception payloads from live systems.