Skip to content

Understanding Run Results

How individual test outcomes roll up into run-level results, and what drives the overall pass/fail decision.

A run executes multiple tests. Each test produces its own outcome, and the run's overall result is computed from those outcomes using a clear set of rules.

Test outcome vs execution status

These are two different things:

ConceptWhat it tracksExample
Execution statusWhether the session finishedcompleted, failed, aborted
Test outcomeWhat the test foundpassed, failed, blocked

An execution can be completed (the session ran to the end) while the test outcome is failed (the test found a defect). The test outcome is what matters for results.

Always look at test outcome

A completed execution does not mean the test passed. It means the agent finished running. The effective result is always the test outcome, not the execution status.

How run results aggregate

When all child tests complete, the run's overall status is determined by priority:

Rule (evaluated in order)Run status
Any test is Verified FailedVerified Failed
Else any test is BlockedBlocked
Else any test is To VerifyTo Verify
Else all tests are Verified PassedVerified Passed

One failure fails the run

A run with 49 passes and 1 failure is still Verified Failed. One failing test is enough to mark the entire run as failed.

Quarantined and skipped tests

Tests that are quarantined (flagged as flaky) or intentionally skipped are treated as neutral in aggregation. They do not cause the run to fail or pass — they are excluded from the result calculation.

A run where all non-quarantined tests pass is Verified Passed, even if some quarantined tests were skipped.

Mixed outcomes and Needs Review

If a run contains a mix of test results and non-test statuses (like closed or cannot_reproduce alongside verified_passed), the system flags Needs Review and sets the run to To Verify. This prevents ambiguous outcomes from being treated as clean passes.

What each outcome means for the run

Test outcomeRun impactAction needed
PassedPositive signalNone
FailedRun failsInvestigate failure, triage bug
BlockedRun blockedCheck environment, dependencies
Skipped (quarantined)Neutral — ignoredFix flaky test separately
InconclusiveRun needs reviewHuman decides: retry or investigate