Batch loading with speed control with slow consumer #348

ArsenyClean · 2022-09-18T17:34:15Z

We need to load a big part of data from HBase using Spark.

Then we put it into Kafka and read by consumer. But consumer is too slow

At the same time Kafka memory is not enough to keep all scan result.

Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter. But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.

Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data

We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow

Sugess solutions:

Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible

Maybe there is some existed mechanism in spark to stop and continue launched jobs

Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)

Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)

Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded

Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter

P.S This #108 will not help, we already use filter

Any ideas would be helpful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch loading with speed control with slow consumer #348

Batch loading with speed control with slow consumer #348

ArsenyClean commented Sep 18, 2022 •

edited

Loading

Batch loading with speed control with slow consumer #348

Batch loading with speed control with slow consumer #348

Comments

ArsenyClean commented Sep 18, 2022 • edited Loading

ArsenyClean commented Sep 18, 2022 •

edited

Loading