In-Stream Processing 
Service Blueprint

What is In-Stream Processing?

In-Stream Processing is a powerful new technology can that scan mind-boggling volumes of data coming from sensors, credit card swipes and web clicks, and find patterns of behavior that lead to actionable insights nearly instantaneously. Companies across all industries are exploring new ways of processing information in real time, and In-Stream Processing is emerging as the leading framework to enable a wide range of real-time applications.

What do organizations want from In-Stream Processing infrastructure?

Customers want the ease of getting started with developer-friendly, inexpensive infrastructure  rapidly scale for massive production workloads as the system acquires more data sources, applications and customers, all in the same platform. We are constantly asked the same two questions:

1) What is the simplest, cheapest and fastest way to get my team up and running?

2) How to design massively-scalable and highly-available production infrastructure?

Open source community had been on the forefront of innovation in the In-Stream Processing space, with dozens of companies and thousands of developers contributing to the ever-growing array of technologies. The use of open source components assures the lowest total cost of ownership, the widest access to the developer market and the least amount of vendor lock-in.

History of In-Stream Processing Projects

Leading cloud providers have been working hard to integrate In-Stream Processing technologies into their offering. The customers want the cloud - for fast developer access to small infrastructure footprint and ease of scaling for production workloads.

As a part of our Blueprint program, Grid Dynamics provides a well-documented, tried-and-true reference architecture and reference implementation for an In-Stream Processing Service that is built with 100% open source components and runs on any cloud platform, absolutely free.

What is in our In-Stream Processing Service Blueprint?

We’ve taken lessons learned, best practices and proven configurations from our experience in implementing large-scale In-Stream Processing systems for many customers and created a single reference architecture for a complete end-to-end blueprint for In-Stream Processing Service. It consists of 100% open source components, runs on any public cloud and scales from developer sandboxes that can be spun-up at a click of a button to always-on production configurations.

The use of our blueprint is completely free. The blueprint’s reference architecture is well-documented in a series of blog posts available at blog.griddynamics.com/topic/big-data. We are also in the process of releasing a reference implementation, soon to be available as open source binding for deployment of the complete blueprint on Amazon AWS with a push of a button.

Customer Success Stories

Ad agency

The business opportunity: Integrating new data sources (web and browsing behavior) provided by their partners, which with proper integration would allow them to put forward offers aligned with their user's interests.

The task: Incorporate new data and identify patterns to improve understanding of customer's users and increase Click Through Rate.

Outcome: Designed and implemented a Hadoop-based platform for storing billions of profiles. Conducted an analysis of search patterns in browsing histories to identify users with high probability to convert. Built facility for on-demand data analysis. Created reports set for downstream consumers.

Leading telecom provider

The business opportunity: Operational reporting were both time consuming and prone to errors due to the high number of distributed data sources. Substantial employee effort, the high risk of error and the significant time lag between when the data arrived and when reports were produced, negatively impacted the business.

The task: Create a timely and accurate reporting system to provide insight for improved business operations.

Outcome: Designed and implemented a Hadoop/Hive-based data warehouse for historical and ongoing call records. Cleaned up, enriched, and prepared the data for exploration and visualization.

Digital ad agency

The business opportunity: One of biggest headaches for advertisers on the Internet is fake traffic. A robot "sees" an impression but "it" definitely won't buy anything, wasting advertisers' money. These robots are a problem and having timely identification of fraudulent impressions significantly increases ad efficiency.

The task: Architect an In-Stream Processing Engine in order to detect and eliminate false impressions in real time.

Outcome: We designed and deployed In-Stream Processing infrastructure and then implemented models, designed by a customer's Data Science team, at-scale. Millions of events per second were handled by the resulting solution. 

Data management company

The business opportunity: Collect online information from a broad partnership network, correlate and analyze same in order to build user profiles in order to allow retailers, advertisers and other digital companies to deliver relevant customer experiences.

The task: Rearchitect a private data center for integration to the cloud.  To enable integration, scalability and future-proof the platform.

Outcome: Data processing pipeline was split into several phases, the first one responsible for the initial data collection from different sources and integration was moved to the cloud infrastructure. This project has cloud enabled future phases of the data processing pipeline.

Blueprint Goals

To provide engineering teams with pre-made, self-deployable cloud infrastructure in order to develop and test real time in-stream processing applications. At the same time, enabling operations teams to deploy, operate and grow enterprise-grade production infrastructure.  Our design goals for the blueprint are as follows:

  • Pre-integrate event queueing, stream processing, data storage, insight delivery and result visualization into a single platform.
  • Support high throughput (up to 100,000 events/second), low-latency (under 60 seconds from event to insight) stream processing
  • Fault tolerant, highly available, dynamically scalable computational platform
  • Programmable in Spark Streaming API in Java or Scala
  • Support algorithms supported by Spark Streaming, including Spark SQL Streaming and machine learning
  • Store up to 30 days of raw data and isights
  • Support in-stream, batch and on-demand insight delivery
  • Composed of 100% free, open source software supported by an active community
  • Cloud-ready and portable across public and private clouds
  • Developer-friendly
  • Production-ready
  • Proven in mission-critical implementations
  • Interoperable with any big data platform
  • Extendable to support new use cases and unique requirements

Read the Blueprint

Post 4. In-Stream Processing Service Blueprint

Post 3. Overview of In-Stream Processing Solutions On the Market

Post 2. How In-Stream Processing Works

Post 1. What is In-Stream Processing?

Subscribe to Our Blog

Professional Services

Grid Dynamics is here to help with architecture, design, implementation or operational support of In-Stream Processing platforms.

Our services cover the full lifecycle of In-Stream Processing platforms, including:

  • Business needs analysis and recommendation on the technology stack selection,
  • Blueprint customization based on specific business needs,
  • Recommendations on selection of a cloud provider, including cost estimation for the required infrastructure,
  • Design of a continuous integration and continuous deployment (CICD) pipeline,
  • Design for multi-datacenter deployment and disaster recovery,
  • Development of dashboards for visualization, monitoring and reporting of In-Stream Processing results,
  • Recommendations for a testing strategy of In-Stream Processing applications (including test data management, workload modeling and test automation),
  • Complete implementation of the In-Stream Processing Service (including implementation of custom business logic, automated QA and deployement to the selected cloud),
  • Integration of the In-Stream Processing Service in customer infrastructure (including upstream, downstream systems as well as monitoring and alerting services); and
  • Architectural supervision in instances of self-implementation on our Blueprint.

Our professional services also extend to production support for the open source components used within the Blueprint, those being Apache Kafka, Apache Spark Streaming, Redis and Apache Cassandra.

Contact Us to Learn More
Contact Us To Learn More

Subscribe To Our Blog

Subscribe to Our Blog

Thank you! Your submission has been received!  

Oops! Something went wrong while submitting the form. Please try again.