description: Decision tree for triaging a VR diff — flakiness vs environment drift vs real regression. tldr: When a diff fires, work the triage tree: same image on re-run (flake → stability checklist), differs by machine (env drift → pin Docker image), differs per browser (browser-specific rendering → per-browser baselines), differs CI vs local (always capture inside Docker). Skip only with a linked issue; never skip to ship a suspected regression.
Triaging False Positives
When to Use
Use this guide when a diff fires. Before assuming it's intentional, work through the triage tree.
Decision
| Question | Finding | Resolution |
|---|---|---|
| Same diff on re-run? | No — diff varies run-to-run | Flakiness. Go to Stability Checklist. Most likely: animations, fonts, or network idle |
| Different on different machines? | Yes | Environment drift. Pin the Playwright Docker image version exactly (no :latest); use the same image locally and in CI |
| Different per browser? | Yes | Browser-specific rendering. Accept per-browser baselines (Playwright keys by project name) or mask the affected region |
| Different in CI vs local? | Yes | Almost always fonts, DPI, or GPU rendering. Only capture and compare inside the Playwright Docker image — never on the host |
When to .skip vs fix immediately
.skip is acceptable when:
- The diff is a known flake under active investigation (link the issue in the skip comment)
- A vendor library upgrade introduced rendering differences you'll address in a follow-up PR
.skip is not acceptable as a way to ship a PR you suspect contains a real regression. If skips accumulate beyond ~5% of the suite, the suite is dead.
Pattern
# Q1: Re-run the failing test twice
npx playwright test --grep "<failing test>"
npx playwright test --grep "<failing test>"
# If diff differs run-to-run → flakiness (not a regression)
Common Mistakes
- Wrong: re-running until it passes → Right: produces a "stable enough" baseline that masks real diffs
- Wrong: bumping
thresholdinstead of fixing the env → Right: silent erosion of test value - Wrong:
.skip-ing flakes indefinitely → Right: flake debt accumulates; team stops trusting the suite