Skip to content

Capability-tiered models and benchmarks

heal is built to run on whatever model you have — from frontier APIs down to an 8B model on a laptop. It treats model capability as a first-class axis, probed rather than assumed.

The output-mode ladder

Structured output can be transported three ways. heal resolves the best one per backend, with a universal floor:

flowchart LR
    A[tool calling<br/>reliable?] -->|yes| TOOL[tool output]
    A -->|no| B[native JSON<br/>schema?]
    B -->|yes| NATIVE[native output]
    B -->|no| PROMPTED[prompted JSON<br/>universal floor]

Crucially, verification lives in output validators, which work in every mode — so even a prompted-JSON-only model heals with the same live checks. Exploration tools are attached only on backends with reliable tool calling; everything else gets richer pre-curated evidence instead.

Probe, don't assume

heal doctor fires tiny calls at each configured endpoint to measure tool calling, native JSON, prompted JSON, and vision, then resolves the capability profile. This caught real backend quirks:

  • MiniMax mishandles forced tool_choice — the same triage task ran in 14s or 311s depending on a profile flag, and tool-mode validator loops failed outright while prompted mode passed in 16s. heal ships a built-in profile that resolves MiniMax to prompted output.
  • vLLM rejects strict tool schemas — heal strips them automatically.
  • Small models differ in kind of failure: transport (no tool endpoint), quality (prompted misclassification), or availability — which is exactly why per-model probing beats a global setting.

What works where

From the experiment matrix (experiments/minimax-probe/FINDINGS.md):

Backend Locator heal Notes
gpt-4.1-nano ✅ ~4s cheapest tier works well
MiniMax-M2.5 ✅ ~15s prompted mode; tool_choice quirk auto-handled
qwen3-14b ✅ slow no tool endpoints — prompted floor
llama-3.1-8b ⚠️ retry loop converges; triage quality limited

The load-bearing finding: ModelRetry verification works on every reachable model, including 8B-class ones — which is what makes the universal floor real rather than aspirational.