-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing Tf-Serving/Redis-Consumer Interaction #286
Comments
In a conference with @vanvalen, @willgraf, and @MekWarrior, we all agreed on a few things:
Now, we need to work out the details of the logic behind these variables. The first step is listing all variables we need to set and the second is determining a formula for each. |
Figuring out values for the relevant variables can be a four-step process:
|
Re goal 1: @willgraf and I came up with the following facts: GPU memory:V100 and T4 GPUs have 16Gb of GPU memory
uncompressed image sizes:128x128 at float32 image = 65536b = 64KiB questions:
preliminary answers:
resources:https://stackoverflow.com/questions/40190510/tensorflow-how-to-log-gpu-memory-vram-utilization |
With these settings and the updates from #304, the 100 image benchmarking jobs have gone from taking ~1 hour to consistently taking ~30-40 minutes. This is indeed for float32 data. float16 data, strangely, is taking almost twice the time to complete. |
Describe the bug
@willgraf observed the following while trying to run benchmarking recently:
When benchmarking with 100 jobs, there is some variance in how long each job takes to finish: anywhere from 1 hour to ~75 minutes. However, all jobs completed successfully.
By contrast, when benchmarking with 1000 jobs, there were
DEADLINE_EXCEEDED
errors coming fromredis-consumer
pods, presumably due totf-serving
pods taking too long to process images, and many jobs failed.Details:
Following the standard benchmarking protocol we've developed, each job consists of a zip file containing 100 images. Each image is 1280x1080 pixels and passed to Tensorflow-Serving in float32 encoding.
The model used is the
NuclearSegmentation
model, version 1 (which can be found in thekiosk-benchmarks
bucket on Google Cloud). This model was built with the deepcell-tf PanopticNet model.Analysis:
The
deadline exceeded
warnings led us to think, based on past experience, thattf-serving
was taking too long to process images with this model. This is a problem that we encountered at various points during the first round of benchmarking (for the paper), almost a year ago. Our solution then was to hardcode a series of parameters (in theredis-consumer
pods, thetf-serving
pods, and theprometheus
pod) so that we could get maximal throughput while minimizing errors like the one just observed. This solution worked well for the model and cluster version used in the first round of benchmarking. Unfortunately, hardcoding heuristically-determined values for these parameters is a brittle solution, and it appears to not work for the current model.tl;dr
Getting the best performance out of the pipeline between the
redis-consumer
pods andtf-serving
is tough, and the appropriate settings are dependent not only on the respective versions of the pods being used, but also the exact model being used.Expected behavior
This is not well-defined. Perhaps a first step in solving this issue is us defining performance thresholds that the kiosk must meet.
The text was updated successfully, but these errors were encountered: