kgen: parallelize #9841

benesch · 2022-01-01T19:59:36Z

Parallelize kgen and use it in cloudbench. More details in individual commits.

The goal here is to remove mzbench (see #9779).

Motivation

This PR refactors existing code.

Tips for reviewer

The diff is a bit smaller if viewed with whitespace hidden.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR adds a release note for any user-facing behavior changes.

benesch · 2022-01-01T20:59:16Z

I verified the changes to cloudbench—the avro_ingest benchmark still works just fine. I also did a few quick experiments and the new kgen is as fast as kafka-avro-generator, if not a touch faster. They both produce 100MM records in ~25s on my 32 CPU machine.

umanwizard · 2022-01-01T21:09:33Z

src/kafka-util/src/bin/kgen.rs

-                let f = self.bytes.get_mut(&p).unwrap();
-                let mut val = vec![];
-                f(&mut val);


This is strange; I don't remember at all why I had these take the output vector as a function argument rather than just returning Vec<u8>. Any insight?

Not really! I can only assume you wrote the signature of the generator functions to re-use buffers before you realized that the Avro API made that infeasible.

That sounds plausible.

umanwizard · 2022-01-01T21:22:17Z

src/kafka-util/src/bin/kgen.rs

-                    return Err(e.into());
+                    Retry::default()
+                        .clamp_backoff(Duration::from_secs(1))
+                        .retry(|_| match producer.send(rec.take().unwrap()) {


This unwrap will cause a panic when we get a non-"QueueFull" error, because we will retry without resetting rec (in the previous code, we would return the error in that case).

That's probably fine because the behavior in either case is the program failing, but if we do choose to have that behavior, we should make it more explicit (e.g., panic on receiving the error, rather than here).

Oh good point! I'll address.

umanwizard · 2022-01-01T21:24:04Z

src/ore/src/retry.rs


 use futures::{ready, Stream, StreamExt};
 use pin_project::pin_project;
 use tokio::io::{AsyncRead, ReadBuf};
-use tokio::time::Duration;
+use tokio::time::{self, Duration, Instant, Sleep};

 // TODO(benesch): remove this if the `duration_constants` feature stabilizes.


Check out rust-lang/rust#84120

I actually went to see if duration_constants was stabilized, saw that it hadn't been, and got sad. Did not think to see if Duration::MAX had escaped separately! Tyvm!

mzbench has been obsoleted by the feature benchmark framework (in test/feature-benchmark) and the cloudbench tool (bin/cloudbench). The kafka-avro-generator tool has been obsoleted by parallelizing kgen directly (MaterializeInc#9841). So this commit removes mzbench. To expound on the rational for removing mzbench: * mzbench configurations require an unmaintainable duplication of mzcompose.yml files. Each mzbench configuration contains 300+ lines of nearly identical definitions. There was talk of improving this (see MaterializeInc#6676), but the plans never came to fruition. * The interplay between mzbench and mzcompose is unnecessarily delicate. mzbench expects a composition with workflows named just so, and then parses their output. This makes it very difficult to refactor the underlying compositions, since you don't know if you're breaking the contract with mzbench. I think most of mzbench's features could be recreated much more simply with an e.g. `--num-trials` parameter to mzcompose. * mzbench introduced quite a bit of complexity by trying to be both a demo of using Materialize to power a real-time dashboard [0] and a benchmarking framework. Experience suggests that this results in a tool that is a suboptimal dashboard and a suboptimal benchmarking framework. Better to have two separate tools optimized for their specific purpose. The new feature benchmarking framework resolves the above concerns. It is only focused on being a benchmarking framework and does not suffer from the code duplication problem. [0]: https://github.com/MaterializeInc/materialize/blob/45586f38a/doc/developer/mzbench.md#worker-balance-visualization

Having a separate `RetryStream` cleanly separates the specification of a retry policy from the state required to execute that policy. A forthcoming commit will add a synchronous retry API which makes the separation of policy from implementation more important. This commit makes the new `RetryStream` an internal implementation detail rather than part of the public API, as it doesn't seem useful outside of `Retry::retry` and `RetryReader`. If we *do* discover a use for it, it's easy to slap a `pub` on the new `RetryStream` type down the road.

To make space for a synchronous Retry::retry method.

This operates identically to `Retry::retry_async` but uses `std::thread::sleep` to wait rather than Tokio timers.

@umanwizard

The duration_constants feature actually isn't stable yet, but Duration::MAX got stabilized separately. h/t @umanwizard

This should improve throughput when the rkdkafka producer queue fills up.

Teach kgen to optionally spawn multiple threads, defaulting to the number of physical CPUs available on the machine. Thread safety made this surprisingly irritating. This commit refactors the Avro generator so that the ThreadRng is only ever passed as a parameter, never stored, as otherwise the Avro generator does not implement `Send`. It also introduces a rather goofy `Generator` trait whose only purpose is to make it possible to clone the generator closures.

kafka-avro-generator is going away soon. Use kgen directly instead.

mzbench has been obsoleted by the feature benchmark framework (in test/feature-benchmark) and the cloudbench tool (bin/cloudbench). The kafka-avro-generator tool has been obsoleted by parallelizing kgen directly (MaterializeInc#9841). So this commit removes mzbench. To expound on the rational for removing mzbench: * mzbench configurations require an unmaintainable duplication of mzcompose.yml files. Each mzbench configuration contains 300+ lines of nearly identical definitions. There was talk of improving this (see MaterializeInc#6676), but the plans never came to fruition. * The interplay between mzbench and mzcompose is unnecessarily delicate. mzbench expects a composition with workflows named just so, and then parses their output. This makes it very difficult to refactor the underlying compositions, since you don't know if you're breaking the contract with mzbench. I think most of mzbench's features could be recreated much more simply with an e.g. `--num-trials` parameter to mzcompose. * mzbench introduced quite a bit of complexity by trying to be both a demo of using Materialize to power a real-time dashboard [0] and a benchmarking framework. Experience suggests that this results in a tool that is a suboptimal dashboard and a suboptimal benchmarking framework. Better to have two separate tools optimized for their specific purpose. The new feature benchmarking framework resolves the above concerns. It is only focused on being a benchmarking framework and does not suffer from the code duplication problem. [0]: https://github.com/MaterializeInc/materialize/blob/45586f38a/doc/developer/mzbench.md#worker-balance-visualization

benesch requested a review from umanwizard January 1, 2022 19:59

umanwizard approved these changes Jan 1, 2022

View reviewed changes

benesch added 7 commits January 1, 2022 16:30

ore/retry: rename Retry::retry to Retry::retry_async

6561cb0

To make space for a synchronous Retry::retry method.

ore/retry: add a synchronous Retry::retry operation

d6aab5e

This operates identically to `Retry::retry_async` but uses `std::thread::sleep` to wait rather than Tokio timers.

ore/retry: use newly-stabilized Duration::MAX constant

d65340a

The duration_constants feature actually isn't stable yet, but Duration::MAX got stabilized separately. h/t @umanwizard

kgen: use exponential backoff to retry when queue is full

f6a6c66

This should improve throughput when the rkdkafka producer queue fills up.

benches/avro_ingest: use kgen directly rather than kafka-avro-generator

fcb0f7d

kafka-avro-generator is going away soon. Use kgen directly instead.

benesch force-pushed the kgen-refactor branch from 8bd9ee8 to fcb0f7d Compare January 1, 2022 21:35

benesch mentioned this pull request Jan 1, 2022

mzbench: remove #9779

Merged

2 tasks

benesch enabled auto-merge January 1, 2022 21:54

benesch merged commit 19b3d67 into MaterializeInc:main Jan 1, 2022

benesch deleted the kgen-refactor branch January 1, 2022 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kgen: parallelize #9841

kgen: parallelize #9841

Uh oh!

benesch commented Jan 1, 2022 •

edited

Loading

Uh oh!

benesch commented Jan 1, 2022

Uh oh!

umanwizard Jan 1, 2022

Uh oh!

benesch Jan 1, 2022

Uh oh!

umanwizard Jan 1, 2022

Uh oh!

umanwizard Jan 1, 2022

Uh oh!

benesch Jan 1, 2022

Uh oh!

umanwizard Jan 1, 2022

Uh oh!

benesch Jan 1, 2022

Uh oh!

benesch Jan 1, 2022

Uh oh!

Uh oh!

kgen: parallelize #9841

kgen: parallelize #9841

Uh oh!

Conversation

benesch commented Jan 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

benesch commented Jan 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benesch commented Jan 1, 2022 •

edited

Loading