If you haven’t seen it yet, Sam Safron recently published Tests that sometimes fail. It’s well worth the read.
I think we’ve been doing an OK job at managing our test suite, though in the last 2-3 releases it does seem as if the CI server is less and less reliable, taking longer and longer to get to a passing state.
From the Mitigation Patterns section, we tend to take a Do Nothing approach, as in CI ignores the failure, we manually re-rerun the test (pipeline), and hope for the best. We also have tickets filed to investigate and fix flaky tests. Maybe sometimes, rarely, we delete a test that’s not adding value. Overall we’re still investing in the test suite and I don’t think it’s gotten away from us yet, though I do think we could do better.
Running the test suite constantly is an interesting idea, when paired with the idea of quarantining and fixing flaky tests. I’m curious what the team thinks - would we want to try such an approach? Is it worth the overhead?
Perhaps a simpler start is adopting the practice of recording the root-cause of each flaky test in one place. I know I saw an analysis recently on a flaky contract test which was excellent and informative, though tucked away in a Jira comment. I’d propose we follow Sam’s advice and we start collecting what we find as the cause of each flaky test we encounter here in this thread. If we start noticing patterns we could compile what we find into a knowledge-base in the wiki later.