Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch loading with speed control with slow consumer #348

Open
ArsenyClean opened this issue Sep 18, 2022 · 0 comments
Open

Batch loading with speed control with slow consumer #348

ArsenyClean opened this issue Sep 18, 2022 · 0 comments

Comments

@ArsenyClean
Copy link

ArsenyClean commented Sep 18, 2022

We need to load a big part of data from HBase using Spark.

Then we put it into Kafka and read by consumer. But consumer is too slow

At the same time Kafka memory is not enough to keep all scan result.

Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter. But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.

Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data

We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow

Sugess solutions:

Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible

Maybe there is some existed mechanism in spark to stop and continue launched jobs

Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)

Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)

Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded

Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter

P.S This #108 will not help, we already use filter

Any ideas would be helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant