Flaky automated tests

If you haven’t seen it yet, Sam Safron recently published Tests that sometimes fail. It’s well worth the read.

I think we’ve been doing an OK job at managing our test suite, though in the last 2-3 releases it does seem as if the CI server is less and less reliable, taking longer and longer to get to a passing state.

From the Mitigation Patterns section, we tend to take a Do Nothing approach, as in CI ignores the failure, we manually re-rerun the test (pipeline), and hope for the best. We also have tickets filed to investigate and fix flaky tests. Maybe sometimes, rarely, we delete a test that’s not adding value. Overall we’re still investing in the test suite and I don’t think it’s gotten away from us yet, though I do think we could do better.

Running the test suite constantly is an interesting idea, when paired with the idea of quarantining and fixing flaky tests. I’m curious what the team thinks - would we want to try such an approach? Is it worth the overhead?

Perhaps a simpler start is adopting the practice of recording the root-cause of each flaky test in one place. I know I saw an analysis recently on a flaky contract test which was excellent and informative, though tucked away in a Jira comment. I’d propose we follow Sam’s advice and we start collecting what we find as the cause of each flaky test we encounter here in this thread. If we start noticing patterns we could compile what we find into a knowledge-base in the wiki later.

Thoughts?

Best,
Josh

Great article, thanks for sharing!

I do agree we could be doing more about randomly failing tests (that’s the nomenclature that functions for “flaky tests” in OpenLMIS, obviously). I like what they mention about the “If the build is not green, nothing gets deployed”. This is exactly the approach that we had in OpenLMIS, and that we had to move away from due to the number of flaky tests we have started seeing, which in turn paralyzed the whole workflow.

The “quarantine and fix” is what we mostly do for the performance tests. If any of the tests starts exceeding the allowed threshold or failing due to demo data changes, we usually up the threshold or comment the test out, and create a ticket to fix those issues. For other types of tests, the approach is often re-run until it passes, and we are mostly seeing that in functional tests. Thanks to the longer release code freeze period we have managed to focus on their stability more - several issues were identified, and we have also added a recording of the tests and tracking browser console logs, which are both saved as an artifact for failed tests, and what should help us do a better job in fixing them in the future.

The problem with “quarantine and fix” is that sometimes it’s not a single test that fails randomly. Sometimes, it’s a single cause, that makes various tests fail randomly. That’s what we have recently been seeing in functional tests - the test would fail to move to the home screen after logging in, for various scenarios.

I like what they said have done for their first step - have the developer that originally introduced the flaky automated test fix it, and then document the problem (could be in a form of root-cause analysis). First, it may be easiest and quickest for the original developer to identify what is causing the problem, second, it may help everyone focus more on the quality of the tests we are adding.

Having the test suite running constantly sounds interesting, but we already run tests so often, that we could probably identify flaky tests without this overhead. Finally, as I mentioned previously, sometimes it’s not a single test - sometimes it’s a single cause that affects many tests, and therefore a different approach than “quarantine and fix” is needed for those.

Best,
Sebastian

Great article indeed!

I think that the “quarantine and fix” approach combined with the “If the build is not green nothing gets deployed” is the one to go with. As mentioned in the article:

I think there is a very strong argument to say a test suite of 100 tests that passes 100% of the time when you rerun it against the same code base is better than a test suite of 200 tests where passing depends on a coin toss.

I totally agree with the point. By skipping some of the tests we are not necessarily reducing the number of tests (we can always fix them) but we would stabilize the rest of the suite and finally make failure mean something instead of being annoying background noise.

Also, quarantining the flaky features in functional tests could give us a clear understanding of how many of those features are actually flaky. Looking at the build artifacts we can see that there are around a dozen, but are always the same test failing or is it only the number of flaky tests that stays the same? I’m not sure and I must say I never actually checked that.

The article also contains some points that could be incorporated on the UI like:

Our JavaScript test suite integration tests have been amongst the most difficult tests to stabilise. They cover large amounts of code in the application and require Chrome web driver to run. If you forget to properly clean up a few event handlers, over thousands of tests this can lead to leaks that make fast tests gradually become very slow or even break inconsistently.

This, together with the below issue on GitHub makes me pretty sure we’re encountering some heavy memory leaks in our tests. We already encountered issues with tests breaking because of the amount of memory used in the past.

Other than that, the article suggests running your tests concurrently and in random order. I’ve already done some investigation and created the following ticket for introducing concurrency to the unit tests on the UI:
https://openlmis.atlassian.net/browse/OLMIS-6354
I’ve yet to investigate how to run the unit tests on the UI in random order. Looking at the jasmine documentation it should be fairly easy but I was unable to validate that the solution indeed works so far.

Best regards,
Nikodem