Five-Model Flight

Finding

Hermes model selection is weak when a single impressive answer is treated as proof of operational fit instead of being compared against a small, representative model screen.

Current

A real Hermes installation may have access to multiple providers, fallback models, local models, specialist models, and default routing rules. The weak point is usually the selection habit: the first model that sounds fluent can become the default before it has been tested on the installation’s actual work patterns, such as tool use, Danish summaries, coding, research synthesis, cron prompts, memory discipline, and refusal behavior. That creates hidden risk because a model can look strong in casual chat while failing on the specific workloads Hermes must run reliably.

Suggested

Create a five-model screening runbook before changing default model routing. Exact change: add docs/runbooks/model-screening.md with a rule that every serious model change must test at least five distinct LLMs against the same small case set before recommending a default, fallback, or specialist role.
Define a representative Hermes test pack instead of using ad-hoc prompts. Exact change: add a “Model screening cases” section to the operator runbook or eval notes with five stable tasks: tool-routing decision, public-safe content rewrite, short Danish TTS summary, code/config review, and multi-source research synthesis.
Record the model decision as a role assignment, not a popularity ranking. Exact change: update the model selection dashboard copy or SOUL.md with: “For each screened model, record best role, failure mode, cost/latency impression, and whether it should be primary, fallback, specialist, or rejected.”

Impact

Five-Model Flight turns model choice into an operational evaluation habit rather than a vibe check. It reduces the chance of overcommitting to a fluent but brittle model and makes fallback routing more defensible when providers, pricing, or quality shift. The installation gains a repeatable way to decide which model should handle planning, coding, research, public writing, or low-cost routine tasks.

Effort

Small — the main work is a short runbook, a fixed five-case prompt set, and a lightweight decision record. No new infrastructure is required unless the installation wants to automate the screening later.

Public page note

Safe public content includes the maturity principle, generic screening cases, role-based model selection guidance, and the benefit of testing several LLMs before changing defaults. Internal-only content includes provider keys, private eval prompts containing sensitive context, raw model outputs from private tasks, exact cost dashboards, internal routing configuration, logs, credentials, and any customer-specific benchmark data.