🌊

Google Cloud Dataflow

AnalyticsApache Beam-Based Stream and Batch Processing

Google Cloud Dataflow is a managed data processing service that runs both streaming and batch pipelines through one model. You submit Apache Beam code, and Dataflow manages the worker infrastructure and scaling.

▶Architecture Diagram

📊 Data Flow

📡Pub/Sub

🪣Cloud Storage

🌊Dataflow Job

⚙️Workers

📊BigQuery

📮Dead Letter

📋Logging

Dashed line animations indicate the flow direction of data or requests

Why do you need it?

When streaming events and batch files coexist, teams often end up maintaining separate processing code for each ingestion style. As pipeline scale grows, worker management, retries, and late-data handling quickly become complicated.

Why did this approach emerge?

Older architectures often split batch processing and stream processing into entirely separate systems. As data became more real-time, the demand for one programming model that could handle both flows pushed Beam and Dataflow into the mainstream.

How does it work inside?

Dataflow runs Apache Beam transforms across managed workers. Apache Beam is a programming model that lets you write one pipeline definition for both batch and streaming data — you describe the steps once, and Dataflow decides how to distribute them. It reads from sources such as Pub/Sub or Cloud Storage, applies filtering, windowing, and transformation steps, then writes the results into sinks such as BigQuery. Windowing groups events by time intervals so you can compute aggregates over defined periods (e.g., "orders per minute"), including handling events that arrive late.

Boundaries & Distinctions

Dataflow and BigQuery both matter in data systems, but BigQuery analyzes data after it is loaded while Dataflow transforms and moves data before or while it is being loaded. If stored-data analysis is the main goal, BigQuery is central; if data movement and transformation are the main problem, Dataflow is central.

When should you use it?

A strong fit for ETL, streaming aggregates, and log-processing pipelines at scale. It can be too much development overhead for small or infrequent data moves.

Real-time event processingBatch ETLWindowed aggregationData cleansing