Data pipeline and stream processing

A data pipeline is method for shipping data efficiently to various services throughout your system. It also provides a framework that supports stream processing, enable things like:

Real-time dashboards
Iterative recommender systems
Data warehouses without complex ETL processes

The producer consumer model

The idea is that you have a set of producer and consumer applications, where the producer will publish data to your pipeline and the consumer will subscribe from your pipeline. That means that you can integrate your pipeline with multiple systems by creating multiple consumer applications.

graph LR A(Mobile) -.publish.-> E((Pipeline)) B(Platform) -.publish.-> E((Pipeline)) C(CRM) -.publish.-> E((Pipeline)) D(Salesforce) -.publish.-> E((Pipeline)) E -.subscribe.-> G(Data Warehouse) E -.subscribe.-> I(Real-time Dashboards) E -.subscribe.-> J(Recommender Systems) E -.subscribe.-> K(Third-party Integration)

It also support a time to live (TTL) parameter which allows you to persist data in the pipeline for days to years. As a result you can replay messages as needed. This is particularly useful if your consumer application logic changes.

The pipeline is essentially a high-throughput distributed messaging system such as Apache Kafka or Amazon Kinesis. They are built to support a huge number of incoming and outgoing messages.

Mike Trienis

Data pipeline and stream processing

The producer consumer model

You might also enjoy (View all posts)