Big Data requires a specialized approach to CI/CD
As analytical data platforms become mainstream and receive significant funding, large enterprises need to onboard large data engineering teams and establish efficient delivery processes to ensure these platforms are successful. While the central ideas from traditional continuous delivery and DevOps can be reused for Big Data implementations, analytical data platforms require a specialized and customized approach because of the sheer volume of data that they process.
Implementing Big Data and CI/CD at scale since 2010
At Grid Dynamics, we have expertise in building analytical data platforms, data engineering, CICD, quality engineering and DevOps. When we started implementing analytical data platforms for our clients, we did it with full automation and continuous delivery best practices. In our work on data platform development, each data processing pipeline is treated as a microservice, which is the same technique we use when developing transactional microservices architectures. We then automate the delivery process for data pipelines, using our expertise with test data management, test automation and environment management.
In order to implement a CI/CD pipeline for an analytics data platform, coordination is needed between various skills: data engineering, deployment engineering and quality engineering. It is difficult for one person to have all these skills, so our approach is to have all the necessary skills within the same team. That way, the team can function as a unit and release a high quality, efficient reliable data pipeline.
Full automation of delivery process
In order to attain a delivery process that is high in efficiency, quality and reliability, we automate all aspects of the data pipelines delivery, including: testing, deployment and release. When applied to Big Data, CI/CD must be handled differently because of its high dependency on data, and specifically test data management. Our approach to test data management in Big Data development includes creating synthetic datasets that can be used for unit testing, and providing production-like data for integration testing. Our usual recommendation for Big Data projects is to start testing with production-like data as early in the process as possible.
Protection of sensitive data
In many cases, data engineers need to work on data pipelines that process sensitive data. On one hand, it is better to provide teams with production-like data, but on the other hand, developers and quality engineers shouldn't have access to sensitive data. In some cases, there are additional requirements related to accessing sensitive data from offshore locations. To address this challenge, we use tokenization and encryption to obfuscate sensitive data before it is given to developers. If additional security is required, we may use fully synthetic data sets that were created with production data patterns.
Transparency and visibility
Security and compliance
How it works
Because we value a hands-on approach, we embed our architects and engineers into development and release teams to help with the tasks at hand from the beginning of the project. This ranges from choosing the right mix of roles and skills in a development team, to building a continuous delivery platform and automating the deployment and change management process. We have a deep understanding of enterprise change management, production operations and security policies.
Our approach is not to disrupt existing