Reliable tests require robust datasets
We frequently see test automation efforts fail to provide adequate quality assurance because of inadequate test data. Low quality tests often have false positive or false negative results, leading to a low level of confidence in testing, significant time spent on test analysis and maintenance efforts. One of the primary reasons behind subpar tests is a lack of attention to test datasets. Uncontrollable changes of test data, hardcoded identifiers in test code and ignored test cases due to a lack of available production data all contribute to unreliable tests, and low confidence in test results. To solve these problems, teams need to invest in robust test data management techniques and tools.
Experts in test data management
Since the inception of Grid Dynamics in 2006, we have never employed manual test engineers; instead, we have focused exclusively on test automation. We very quickly became familiar with the "flaky tests problem", and began successfully solving it by using test data management and service virtualization. We have helped multi-billion dollar enterprises with technology departments consisting of thousands of developers and testers solve these problems in a lean and agile way for more than 10 years. We bring our expertise and blueprints to analyze application portfolios, choose early candidates for adoption, and scale test data management techniques.
Synthetic data generation
Synthetic data generation is the most predictable approach to test data management. The datasets are created in a reliable and controlled fashion when tests are implemented, and are only changed when needed. The synthetic data is loaded to the system under test on demand during every test execution, simplifying management of the environment.
An extra benefit of synthetic data is that quality engineers can model corner cases that may not appear in the production data.
Production data curation
Testing with production data has its own benefits, since it may uncover defects missing in a synthetic data set. Using production data for testing requires significant curation to tokenize or obfuscate sensitive data, reduce the size of the dataset and periodically renew the dataset in the test environments.
Additionally, test cases need to avoid using hardcoded identifiers to find the right combinations of data items. Instead, each test case should implement a mechanism to find data items that satisfy its requirements.
Unified data interface
To simplify working with both synthetic and production data, and enable the reuse of tests to work with both datasets, a unified data retrieval interface should be implemented. We call this interface "data pools". With this interface, a test requests needed data items, and the required data is returned. When the system that is being tested works with synthetic data, the interface will retrieve synthetic data or generate it on the fly.
When it works with a production dataset, the interface will query production data and gracefully fail when data is not found, clearly stating that failure is not due to a defect, but due to the absence of data.
How it works
We prefer a hands-on approach in our engagements from day one. More often than not, clients we engage with have already implemented some level of test automation. In this case, our architects and engineers get engaged in the most challenging applications and services. For these applications, we start by looking at existing test suites implementation, data in test datasets and integrations.
After we understand the current situation in depth, we suggest improvements to bring the most value with the least effort and in the shortest period of time. Once the bleeding has stopped, we move to long-term improvements in the process and tooling, adopt modern techniques across the board, and educate client teams to spread knowledge and know-how within the organization.