The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in MassiveScale, Unbounded, OutofOrder Data Processing
A single unified model
- 分解data pipeline为四个维度，提供清晰性、可组合性和灵活性
- What results are being computed
- Where in event time they are being computed.
- When in processing time they are materialized.
- How earlier results relate to later renements.
- 分离数据处理的逻辑与下层的实现，允许基于correctness, latency, and cost选择batch, micro-batch, or streaming engine
- A windowing model which supports unaligned event-time windows, and a simple API for their creation and use.
- A triggering model that binds the output times of results to runtime characteristics of the pipeline, with a powerful and flexible declarative API for describing desired triggering semantics.
- An incremental processing model that integrates retractions and updates into the windowing and triggering models described above
windowing can be broken apart:
- Elements are initially assigned to a default global window, covering all of event time, providing semantics that match the defaults in the standard batch model.
- Since windows are associated directly with the elements to which they belong, this means window assignment can happen anywhere in the pipeline before grouping is applied. This is important, as the grouping operation may be buried somewhere downstream inside a composite transformation
- All are initially placed in a default global window by the system.
- Then implementation of AssignWindows puts each element into a single window.
Triggers & Incremental Processing
Triggers are complementary to the windowing model, in that they each affect system behaviour along a different axis of time:
Windowing determines where in event time data are grouped together for processing.
Triggering determines when in processing time the results of groupings are emitted as panes.
Triggers system provides a way to control how multiple panes for the same window relate to each other
Accumulating & Retracting
- Never rely on any notion of completeness.
- Be flexible, to accommodate the diversity of known use cases, and those to come in the future.
- Not only make sense, but also add value, in the context of each of the envisioned execution engines.
- Encourage clarity of implementation.
- Support robust analysis of data in the context in which they occurred.