Skip to content
Frank McSherry edited this page Aug 1, 2015 · 13 revisions

The main feature timely dataflow provides, in addition to moving data along dataflow edges, is the indication of progress through the stream of data.

The system alerts each recipient of timestamped data once it can determine that some previously active timestamp will never be seen again. This allows the operator to react with final messages and actions, for example sending aggregates, flushing state, or releasing held resources. The same mechanisms also alert the user as output data emerge from the dataflow graph.

The language of progress

We first need to develop timely dataflow's progress vocabulary.

A timely dataflow graph consists of operators, whose inputs and outputs are connected by channels. Each operator input has one associated channel, but an operator output may have many channels leading from it: multiple other operators may want to see the data it produces. Each group of records transmitted in timely dataflow have an associated timestamp.

Timely dataflow is responsible for producing information that would let each operator answer the question

Will I ever see data with this specific timestamp on this specific input of mine?

To provide this information, timely dataflow imposes some constraints on the structure of the dataflow graph and the behavior of operators.

Constraints

The high-level constraint that timely dataflow imposes is that that there should be no cycles in the timely dataflow graph which permit data on some channel to eventually produce data on that channel with the same timestamp. This will let the system reason that once some timestamp has gone away, it will not be seen again.

This property can be difficult to reason about directly, so instead we will impose simpler, local constraints on the graph structure and operator behavior, and argue that these constraints imply the desired global property.

Graph structure

With one exception, operators may only be constructed from the full set of channels connected to their inputs, and their outputs are not available until this happens.

This constraint by itself would ensure that dataflow graphs are acyclic, as we could totally order the operators by their construction time, and edges would not be able to go backwards along this order.

Acyclic dataflow graphs are a limitation we hope to avoid. To create cycles, timely dataflow permits one specific system-provided operator to be constructed before the channel connecting its input is available. This is the feedback operator, and it ensures that timestamps are strictly advanced for all data passing through.

Operator behavior

Each operator must respond to input messages only with output messages whose timestamps are at least as large.

To maintain the strict advancement the feedback operator introduces, each other operator must never "roll back" timestamps. This property is a bit subtle, in that more complicated operators may have multiple inputs and outputs, not all of which should be thought of as directly connected.

Interfaces

Each operator must implement the Scope<T> trait, parameterized by the timestamp T their inputs and outputs use for data:

pub trait Scope<T: Timestamp> {
    fn name(&self) -> String;
    fn inputs(&self) -> u64;
    fn outputs(&self) -> u64;

    // initialization methods, describing internal/external structure and capabilities.
    fn get_internal_summary(&mut self) -> (Vec<Vec<Antichain<T::Summary>>>, Vec<CountMap<T>>);
    fn set_external_summary(&mut self, Vec<Vec<Antichain<T::Summary>>>, &mut [CountMap<T>]);

    // run-time methods, indicating changes in capabilities.
    fn push_external_progress(&mut self, &mut [CountMap<T>])
    fn pull_internal_progress(&mut self, &mut [CountMap<T>], &mut [CountMap<T>], &mut [CountMap<T>]) -> bool;
}
Clone this wiki locally