Testing in Production – Risks and Rewards

In traditional waterfall software development models, testing was limited to a distinct lifecycle phase that occurred after implementation and before deployment. While this approach has its advantages, like clearly defined stages and comprehensive milestones, modern software development methodologies favor continuous testing because of the speed and agility it provides. Many software testing activities have become automated; from unit and smoke tests to API, system and performance tests.

Increased automation improves testing efficiency and can produce more stable and secure software; however, it does not eliminate the uncertainty of deploying code to production environments especially in cloud workloads. In addition, automated testing supporting software development is generally not appropriate for execution in production environments. Functional and exploratory testing after deploying to Prod have become an afterthought for these reasons. In some IT and QA circles, ‘testing in production’ has become a euphemism for cutting corners, taking unnecessary risks, and behaving irresponsibly with code in a rush to deploy early. This thinking is gradually changing.

Why Test in Production?

For better or for worse, once your code gets deployed and goes live, it becomes more than the sum of its parts. It is now an intricate system of infrastructure, code, micro services, users and environments whose interactions can cause unpredictable and undesired effects. Testing in production helps to build a level of real-world confidence and lets you see how your code performs ‘in the wild’. With agile DevOps strategies, it has never been quicker and easier to revert to a previous, stable, release if all else fails. This takes much of the fear and risk out of testing in production.

Shift-left testing, or testing early on in the pipeline, while highly recommended and necessary, is not enough on its own. For rigorous and full coverage testing, it is increasingly necessary to test in production as well. Code can behave differently with live data than in Dev or staging environments. Testing in prod lets QA engineers and stakeholders alike engage in real user monitoring, or RUM. RUM looks at all the user requests that come in as well as the responses that are delivered as a result. As all of this is happening in real time, it provides valuable data and insight into how real-world users interact with your product.

Individual transactions can be monitored throughout different layers and levels of complexity within the application. This provides a deeper understanding of the code pathways and pinpoints which pieces of code are being used during user sessions. The data gathered during transaction monitoring in production is extremely useful for unearthing defects and locating pain points and bottlenecks that can slow down code. Furthermore, it provides a high level of understanding of what features are most commonly used and in which way.

Advantages of Testing in Production

Introducing production testing as a regular activity in the software development lifecycle extends the continuous feedback loop with live data as well as real time, real world traffic. Testing in production often leads to finding bugs that would otherwise go overlooked and enhances recovery and resiliency testing. It will also provide a comprehensive insight into the overall state of the software product.

Beta programs take live testing to the next level by encouraging early adopters to provide feedback regarding newly released features, as well as overall user experience. Not only will beta testing yield bugs that might have slipped under the radar, but it often results in uncovering edge cases and happy paths that might not have been covered in the initial software design. The user is always right, and even the most perfect code means little if the target audience finds it difficult, unpleasant or confusing to use. Acting on user feedback leads to continuous improvement.   

Continuous monitoring activities in production using tools like New Relic provides a realistic real time picture of the overall health of your software. This insight will ultimately minimize the risk of bad deployments breaking the production environment. Being aware of weak points and vulnerabilities can help your team prioritize tasks and improve system performance. This mindset is extremely valuable for DevOps teams as it allows them to quickly implement the correct disaster recovery protocols and bring downtime to a minimum. 

Disadvantages of Testing in Production

While testing in production has several advantages, it does not mean that other environments should be overlooked. Testing in production is only effective if all QA personnel are in sync and take ownership and responsibility for not only production, but staging and development environments as well. In Agile development, testing activities are not limited to the QA team, developers are also encouraged to perform unit and functional tests as well.

Some types of testing in production can adversely affect the performance of a software product. For example, running an end to end load test in production can significantly diminish user experience. This can manifest as poor load times, malfunctions or unresponsiveness. To minimize fallout, it is best to run these tests outside business hours, when there are fewer active users. Bad website performance leads to a loss in profit, so it is extremely important for QA engineers to design tests that are quick and resource efficient. 

The riskiest production resilience testing method was introduced by Netflix in 2010, and is known as a chaos monkey. It is a piece of code that is purposely designed to wreak havoc in the production environment. This technique is extremely valuable for verifying that all the code and services are working as expected. It is, however, a two edged sword and needs to be carried out with extreme caution. Deliberately injecting failures can cause serious problems, and even damage the web server. When implemented correctly, chaos monkey testing is the best way to build confidence in the resiliency of a system and to ensure that it can withstand turbulent conditions in the live environment.