GitHub

Current

                                              ┌─────────────┐
                                              │             │
┌─────────┐                                   │             │
│         │               ┌─────────────┐     │             │
│  source │─────────--───▶│ BatcherNode │────▶│  Persistent │
│         │               └─────────────┘     │    store    │
└─────────┘                                   │             │
                                              │             │
                                              └─────────────┘

Goal

Goal of this project to implement certain functioanlities defined by the Dataflow paper (foundation for apache Beam) to build an iceberg writer/persis

Functioanlity (ToDo)

Time based windows (adjustable)
Window rotation
Processing the window based on partitioned Values
Handling CDC data (sub paritionining the batches based on cdc timestamp to maintain order)
Backpressure
Adjusts batch sizes based on processing performance

Benchmark

Where sample data looked like {"name": "A", "size": "small", "count": 2}

Total number of events in 1 batch = 10,000

partitioned on count and randomness was limit to 10 in data gen

Arrow conversion took: 170.850ms
Partitioning took: 34.328ms
Partitioned length: 10

Ideally


                          ┌─────────────┐     ┌─────────────┐
                     ┌───▶│ BatcherNode │────▶│             │
┌─────────┐          │    └─────────────┘     │             │
│         │          │    ┌─────────────┐     │             │
│  source │─────────▶│───▶│ BatcherNode │────▶│  Persistent │
│         │          │    └─────────────┘     │    store    │
└─────────┘          │    ┌─────────────┐     │             │
                     └───▶│ BatcherNode │────▶│             │
                          └─────────────┘     └─────────────┘

Ref

dataflow paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Readme.md		Readme.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal

Functioanlity (ToDo)

Benchmark

Ref

About

Releases

Packages

Languages

Shreyas220/Dataflow

Folders and files

Latest commit

History

Repository files navigation

Goal

Functioanlity (ToDo)

Benchmark

Ref

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages