Understanding Run Results
How individual test outcomes roll up into run-level results, and what drives the overall pass/fail decision.
A run executes multiple tests. Each test produces its own outcome, and the run's overall result is computed from those outcomes using a clear set of rules.
Test outcome vs execution status
These are two different things:
| Concept | What it tracks | Example |
|---|---|---|
| Execution status | Whether the session finished | completed, failed, aborted |
| Test outcome | What the test found | passed, failed, blocked |
An execution can be completed (the session ran to the end) while the test outcome is failed (the test found a defect). The test outcome is what matters for results.
Always look at test outcome
A completed execution does not mean the test passed. It means the agent finished running. The effective result is always the test outcome, not the execution status.
How run results aggregate
When all child tests complete, the run's overall status is determined by priority:
| Rule (evaluated in order) | Run status |
|---|---|
Any test is Verified Failed | Verified Failed |
Else any test is Blocked | Blocked |
Else any test is To Verify | To Verify |
Else all tests are Verified Passed | Verified Passed |
One failure fails the run
A run with 49 passes and 1 failure is still Verified Failed. One failing test is enough to mark the entire run as failed.
Quarantined and skipped tests
Tests that are quarantined (flagged as flaky) or intentionally skipped are treated as neutral in aggregation. They do not cause the run to fail or pass — they are excluded from the result calculation.
A run where all non-quarantined tests pass is Verified Passed, even if some quarantined tests were skipped.
Mixed outcomes and Needs Review
If a run contains a mix of test results and non-test statuses (like closed or cannot_reproduce alongside verified_passed), the system flags Needs Review and sets the run to To Verify. This prevents ambiguous outcomes from being treated as clean passes.
What each outcome means for the run
| Test outcome | Run impact | Action needed |
|---|---|---|
| Passed | Positive signal | None |
| Failed | Run fails | Investigate failure, triage bug |
| Blocked | Run blocked | Check environment, dependencies |
| Skipped (quarantined) | Neutral — ignored | Fix flaky test separately |
| Inconclusive | Run needs review | Human decides: retry or investigate |
