Capability-tiered models and benchmarks¶

heal is built to run on whatever model you have — from frontier APIs down to an 8B model on a laptop. It treats model capability as a first-class axis, probed rather than assumed.

The output-mode ladder¶

Structured output can be transported three ways. heal resolves the best one per backend, with a universal floor:

flowchart LR
    A[tool calling<br/>reliable?] -->|yes| TOOL[tool output]
    A -->|no| B[native JSON<br/>schema?]
    B -->|yes| NATIVE[native output]
    B -->|no| PROMPTED[prompted JSON<br/>universal floor]

Crucially, verification lives in output validators, which work in every mode — so even a prompted-JSON-only model heals with the same live checks. Exploration tools are attached only on backends with reliable tool calling; everything else gets richer pre-curated evidence instead.

Probe, don't assume¶

heal doctor fires tiny calls at each configured endpoint to measure tool calling, native JSON, prompted JSON, and vision, then resolves the capability profile. This caught real backend quirks:

MiniMax mishandles forced tool_choice — the same triage task ran in 14s or 311s depending on a profile flag, and tool-mode validator loops failed outright while prompted mode passed in 16s. heal ships a built-in profile that resolves MiniMax to prompted output.
vLLM rejects strict tool schemas — heal strips them automatically.
Small models differ in kind of failure: transport (no tool endpoint), quality (prompted misclassification), or availability — which is exactly why per-model probing beats a global setting.

What works where¶

From the experiment matrix (experiments/minimax-probe/FINDINGS.md):

Backend	Locator heal	Notes
gpt-4.1-nano	✅ ~4s	cheapest tier works well
MiniMax-M2.5	✅ ~15s	prompted mode; `tool_choice` quirk auto-handled
qwen3-14b	✅ slow	no tool endpoints — prompted floor
llama-3.1-8b	⚠️	retry loop converges; triage quality limited

The load-bearing finding: ModelRetry verification works on every reachable model, including 8B-class ones — which is what makes the universal floor real rather than aspirational.