The tests and the build
The concept of having extensive tests as part of the build is not flawed. This is actually what the build is for. What is flawed is to have tests “that fail frequently but intermittently.” A test—no matter if it’s a unit test which executes a one-line method which does some elementary stuff and asserts that the result is equal to a given value, or if it’s a system test which relies on dozens of different components, everyone of which can fail—has its value only when green indicates success and red indicates failure. If the test randomly fails, this random characteristic makes it not just useless, but harmful: harmful because your team will mistrust the build.
— Hey, I think we shouldn’t push this hotfix to production, because our build is red.
— Oh, come on, it’s probably some flaky test, as usual. Just push it to production manually.
And then you spend the next four hours trying to undo the catastrophic consequences of what could be avoided by just looking at the build.
If you remove the tests from the build, then why having those tests in the first place? Imagine you run them by hand once per day (and you run them several times, since they a flaky). One of the tests appears to be consistently red. What now? How would you find which one of today’s fifty commits broke the test? And how do you expect a developer who actually broke something to remember exactly what he was working on yesterday?
Flakiness in tests
Flakiness can come from several sources:
Individual components in a system fail. For instance, it happens that when under heavy load, one system makes another system fail, given that both systems are third-party (and you can’t change them), and you configured them correctly.
If this is the reason of a failure, it may indicate that your product doesn’t cope well with failures coming from outside. The solution would be to make it more robust. There are plenty of different cases, and plenty of different solutions, such as failover, retry policies, etc.
A system fails because of the interactions with the outside world. Imagine that the system tests run on an infrastructure which is also used by three other products. It may happen that when another team is running stress tests, the network becomes so slow that your tests simply fail because the parts of your product timeout on most basic things, such as waiting for a response from the database.
In this case, the solution is to put more isolation, such as move to a dedicated infrastructure, or set up quotas to guarantee that every project will have enough computing, network and memory resources, no matter how other teams are using the infrastructure.
A test fails because of the complexity of the system itself, or because the test platform is unreliable. I’ve seen this on several web projects, with tests running through an emulated browser. The complexity of the product itself meant that occasionally, an element wouldn’t be shown on a page as fast as needed, and even more worrisome, sometimes a test would simply misbehave for no apparent reason.
If this is what you have, you might move to a better testing platform, as well as try to simply as much as possible to product itself.