A data pipeline is method for shipping data efficiently to various services throughout your system. It also provides a framework that supports stream processing, enable things like:
- Real-time dashboards
- Iterative recommender systems
- Data warehouses without complex ETL processes
The producer consumer model
The idea is that you have a set of producer and consumer applications, where the producer will publish data to your pipeline and the consumer will subscribe from your pipeline. That means that you can integrate your pipeline with multiple systems by creating multiple consumer applications.
It also support a time to live (TTL) parameter which allows you to persist data in the pipeline for days to years. As a result you can replay messages as needed. This is particularly useful if your consumer application logic changes.
The pipeline is essentially a high-throughput distributed messaging system such as Apache Kafka or Amazon Kinesis. They are built to support a huge number of incoming and outgoing messages.