Threading and execution model¶

This is the load-bearing piece of plumbing. Three facts collide:

Robot Framework listener callbacks are synchronous.
pydantic-ai agents are async and should be reused across a run.
The Browser/Appium library instance is only safe on RF's main thread.

The actor model¶

heal runs one persistent asyncio loop on a dedicated thread. The listener submits a healing transaction and then services a request queue while blocked — any driver or RF call the engine needs is marshalled back to the main thread, while the LLM work runs on the healer loop.

sequenceDiagram
    participant Main as RF main thread
    participant Loop as healer loop
    Main->>Loop: submit transaction
    activate Loop
    Loop->>Loop: agents + evidence (parallel)
    Loop-->>Main: need DOM / screenshot / rerun?
    Main->>Main: execute on main thread
    Main-->>Loop: result
    Loop-->>Main: HealOutcome + RCA
    deactivate Loop
    Main->>Main: apply outcome (status, assign, log)

This single structure solves all three constraints at once: the browser is only ever touched on the main thread; agents get a real, reused event loop with parallel LLM calls; and there is no nested-event-loop fragility.

Re-entrancy and abandonment¶

Re-entrancy guard: while a transaction is active, listener events triggered by heal's own keyword reruns are ignored (a single flag the engine owns) — so a rerun never spawns a nested transaction.
Abandonment: if a transaction exceeds its budget, the listener unblocks after a grace period, the keyword stays failed, and the run continues — a hung agent never hangs the suite.

Proven, not assumed¶

A spike ran this model inside a real Robot Framework run: 4/4 tests passed — keyword rerun, return-value assignment, parallel loop work, and timeout abandonment all behaved as designed, with clean output.xml/log.html. No fallback design was needed.

Source: experiments/rf-threading/FINDINGS.md.