Machine learning complexity slows development
Creating predictive models and deploying them to production is an essential function for data science teams. Given the various tools, resources and integrations needed to create data pipelines and manage every step of the predictive model development process, this can be challenging. Data scientists need to have control over a lot of resources to successfully build models and put them into production. They also have to work with various tools in different stages of development that must integrate and run in the same ecosystem. Additionally, DevOps teams must be able to access the work of data scientists in order to deploy models to production.
The complexity of managing all these various tools and processes prevents the fast creation of data pipelines, slowing production as a whole.
Experience as DevOps experts and machine learning pioneers
Since 2010, Grid Dynamics has worked with machine learning in an effort to help our customers with predictive models, recommendation engines and visual search. Our expertise in DevOps and building modern infrastructure platforms has given us the experience needed to piece all the tools, processes and workflows of the data science process together.
SRE practices help us build robust and resilient business-oriented services, reducing operational cost and increasing ROI. We have developed top talent in the field of machine learning for DevOps, and some of our employees have even written books on the subject.
Kubernetes-based ML pipeline
A machine learning platform can deploy and reliably run complex machine and deep learning workloads (TensorFlow, PyTorch, Keras, Scikit, Chainer, etc.). It supports various data processing and machine learning tools as a native platform, or as a container orchestrator. This gives us the ability to utilize a powerful, universal computation platform without the compromises of a classical data processing platform like Hadoop.
Having real scalability without losing built-in resilience allows the use of near-zero resources when the ML pipeline is not running. Therefore, businesses can pay for scalability as they need it without suffering from constantly running HDFS nodes.
Easier data analysis
Data scientists can conveniently analyze data and prototype models using Jupyterhub. This and a plethora of supporting toolsets (Data Version Control, Featuretools, etc.) creates the perfect sandbox for development and experiments.
Complex workflows and data pipelines
The platform utilizes native Kubernetes capabilities to run complex workflows and data pipelines based on the Argo framework, but can be easily integrated with native schedulers and workflow managers such as AirFlow or GCP Composer.
Models are deployed into production and scaled with TensorFlow Serving, Kubeflow (Seldon) or Clipper. Another option is creating a serving service and getting all quality attributes (resilience, elasticity, high availability and security) out-of-the box.
Key technical features
Cloud agnostic architecture
Containerized ML application packaging
Supports mainstream ML tools
Onboarding for niche and emerging ML tools
CI/CD integration for full life cycle of the model
Kubernetes native workflows for data pipelines
Scalable model serving
Efficient use of GPU resources
Business continuity and disaster recovery
Solution reference architecture
Our open source solutions
Why you should build this and how to get started
Grid Dynamics' broad experience in big data and cloud architectures has enabled us to create a new approach to building an ML platform.
Unlike a more traditional Hadoop-based big data platform with a data science add-on and a limited selection of supported ML tools, this platform allows teams to easily choose the best tools for the job end-to-end. Its resource management features also make deep learning projects faster and more efficient to run.