Saturday, June 17, 2017

CI/CD and Second Order Test Concerns

Cisco has some reasonably mature media products (phone and video) built using the microservices approach with Continuous Integration/Continuous Delivery and plenty of automated testing. As our products matured the nature of the challenges we faced changed: we were faced with second order test effects. The first order effect of the tests is to test our production source code, catching bugs and increasing the production code's quality. The second order of effect is the increasing overhead of designing, building, operating, modifying, cleaning up and eliminating automated tests. As the total number of tests increases both performance and reliability of the tests will become critical to your ability to turn the CI/CD crank on each new change. To make life interesting, we have a world of great techniques we use to improve our production code and apply almost none of it to our test code.
Cisco's agile process use a fairly rigidly defined "definition of done" with a long list of requirements. It's somewhat a pain, but it did indeed yield code that had appropriate unit, sanity, regression, integration, feature, system, load, performance and soak tests. Code was always fairly modular due to a hard cyclomatic complexity requirement and we used all the latest bug scanning tools and so forth. Coverage was kept high, and we got large benefits from the careful and frequent testing.
This allowed us to deliver changes and features much quicker at first. We each built our handful of microservices and their little universes of tests, then added tests for the microservices we depended on. Every time new features are added, multiple new automated tests of various sorts are needed. As time passes and you grow features in an agile manner you end up with dependencies on more and more microservices, and you only have to get burned a couple times to realize you need to add tests the verify the features of other microservices that you rely on do indeed work. This leads to fuzzier lines of responsibilities, reinvented test approaches without best practices and hard to maintain tests. Communication across teams helps but is time consuming.
Every time a customer issue is fixed a regression test is added. Tests accumulate, and when a large organization is applying thousands of developers to building new interdependent microservices, the tests multiply at an amazing rate.
Like anything, writing good tests takes time to learn and master. Since the production code is the actual shipping item, much less time is spent revisiting tests, cleaning them up, making them modular and less complex. Get the code looking good, get the test working (it does not have to look good) and check it all in. This also means you're slower to master the test coding process - it's lower priority than the features, since features get your team those critical velocity points.
Given the requirements, maximizing velocity requires skimping on testing and mostly leaving them in the moderately functional state, not the desirable well tested and cleaned up state that increases quality and maintainability. Production code coverage is checked, test code coverage itself is never looked at. Production code is measured for cyclomatic complexity and rejected if it isn't fairly simple, but that is not done with test code. No automated bug checkers for test code!
Over time you get some sweet microservices providing awesomely scalable, performant and reliable features in a manner that simply can't be done in an old school behemoth solution. The pattern works extremely well, but it is also accumulates a huge amount of technical debt: the test code turns into a world of hurt. This is the most painful second order test concern of CI/CD systems that I've seen. Focus on production code over test gets increasingly expensive over time and especially as you scale the number of contributors up.
Just as we are mastering the architecture and the approach and delivering new features and bug fixes at a rapid pace, as our "velocity" starts peaking (boy, did Cisco go on about velocity) the tests accumulate huge amounts of poorly designed monolithic non-modular error and breakage prone code.
Our CI/CD systems refuse to integrate if we fail the tests. The first wave of pain was when scale started increasing massively and performance (as expected) dropped a bit. All was comfortably within expectations, but a few tests would break due to poor design and timing dependencies. Occasional code submissions failed to go through because a test that had nothing to do with your code failed; having never seen that test code, you have no idea why. Rather than check carefully, you immediately rerun the test. If it passes this time given that it has nothing to do with your code, the temptation is pretty much overwhelming to ignore it: try best 2 of 3, if it passes, it's in! While this is an insidious practice, the nature of timing dependencies in tests is that they are intermittent. If it fails too frequently then the team responsible for it will notice and fix it; if it always fails the team responsible will be found and told to fix it so that code can be promoted. This is the sort of situation that gets you to switch off tests so you can promote a change. If you find yourself switching off tests then you're probably not spending enough time maintaining your test code.
Now there are thousands of tests, and on top of the random test failures, the tests themselves start taking longer and longer, pretty quickly to an unacceptable degree. Time now has to be spent going back and sorting out tests to run very frequently vs. occasionally vs. rarely to get appropriate performance out of the different phases of the test suites without losing the coverage and quality benefits. Cisco's product managers weren't about to assign user stories to us for things they didn't care about and didn't feel responsible for, so the problem would fester until enough engineers on enough different teams were complaining about it that it finally percolated a few levels up and some VP had to step in and re-purpose efforts, assembling a team across the groups with the offending test suites to spend a week or two cleaning up. After the slow downward creep in velocity caused by the problem, velocities drop even further as teams change focus and lose members temporarily, and executives are unhappy.
Pretty soon the occasional build failure is a reliable build failure, sometimes with 2 or 3 random cases failing. Once again, no scrum team is in a position to address all of the issues, we haven't noticed our own intermittently failing tests blocking us (or if we do, we fix that one), we just get stuck by everyone else's blocking us. Note that this is an evil networking effect: the bigger you are, the worse the problem any given level of unreliable tests will cause you, and it goes up faster than linearly, I'm pretty sure. At companies as large as Cisco this becomes a large concern.
Once again it waits for a VP to crack the whip, and teams get raided,and velocity again drops, and executives are again annoyed, and Cisco kicks off another round of layoffs. Not that the problems caused the layoffs, mind you, they were just a regular feature of Cisco life, but I digress.
The main simple rule of thumb I learned at Cisco doing CI/CD is that done right in a large and mature microservices cloud, you spend quite a bit more time coding and maintaining all the different test cases for all the different types of testing than you do coding the actual production code to be tested.

No comments:

Post a Comment