Optimizing Tf-Serving/Redis-Consumer Interaction #286

dylanbannon · 2020-03-06T23:13:34Z

Describe the bug
@willgraf observed the following while trying to run benchmarking recently:

When benchmarking with 100 jobs, there is some variance in how long each job takes to finish: anywhere from 1 hour to ~75 minutes. However, all jobs completed successfully.

By contrast, when benchmarking with 1000 jobs, there were DEADLINE_EXCEEDED errors coming from redis-consumer pods, presumably due to tf-serving pods taking too long to process images, and many jobs failed.

Details:

Following the standard benchmarking protocol we've developed, each job consists of a zip file containing 100 images. Each image is 1280x1080 pixels and passed to Tensorflow-Serving in float32 encoding.
The model used is the NuclearSegmentation model, version 1 (which can be found in the kiosk-benchmarks bucket on Google Cloud). This model was built with the deepcell-tf PanopticNet model.

Analysis:

The deadline exceeded warnings led us to think, based on past experience, that tf-serving was taking too long to process images with this model. This is a problem that we encountered at various points during the first round of benchmarking (for the paper), almost a year ago. Our solution then was to hardcode a series of parameters (in the redis-consumer pods, the tf-serving pods, and the prometheus pod) so that we could get maximal throughput while minimizing errors like the one just observed. This solution worked well for the model and cluster version used in the first round of benchmarking. Unfortunately, hardcoding heuristically-determined values for these parameters is a brittle solution, and it appears to not work for the current model.

tl;dr

Getting the best performance out of the pipeline between the redis-consumer pods and tf-serving is tough, and the appropriate settings are dependent not only on the respective versions of the pods being used, but also the exact model being used.

Expected behavior
This is not well-defined. Perhaps a first step in solving this issue is us defining performance thresholds that the kiosk must meet.

The text was updated successfully, but these errors were encountered:

dylanbannon · 2020-03-09T20:56:46Z

In a conference with @vanvalen, @willgraf, and @MekWarrior, we all agreed on a few things:

In order to handle all possible Tensorflow models, we would need to grab two pieces of information from each model in the models bucket, namely, input/ouput image size and model throughput speed with optimal batching on a known GPU, and use that information to inform the choice of environmental variables, as discussed above.
We don't need to handle all possible Tensorflow models. We don't even need to handle all CNNs written in Tensorflow. For now, our supported model range can be much smaller, since the models in the deepcell-tf repo have more or less uniform input shapes and model throughputs.
As an explicit spec of supported model types, we will say that we support all models that have
- input shape of either 128x128 ad float32, or 512x512 at float32, and
- have a throughput speed of between 5 fps and 10 fps with optimal batching on a Titan V GPU.
Using those assumptions during cluster startup, we can set all necessary variables (in theory) for any chosen GPU type.

Now, we need to work out the details of the logic behind these variables. The first step is listing all variables we need to set and the second is determining a formula for each.

dylanbannon · 2020-03-10T00:41:10Z

Figuring out values for the relevant variables can be a four-step process:

1. In order to optimize usage of GPU memory, we need to determine an appropriate value for the MAX_BATCH_SIZE env var in the tf-serving pods. This requires us to know how much memory a given image being processed by a given model will take up in the GPU (see comment below for some guidance on this). We also need to know the available memory of the GPU attached to the tf-serving pod. To compute the optimal batch size, divide GPU memory by memory foot per image and round down to the nearest whole number. (There is a corresponding variable in the redis-consumer pods called TF_SERVING_MAX_BATCH_SIZE. This may need to be set to same value as MAX_BATCH_SIZE, but it's unclear. This all depends on whether tf-serving will process multiple requests simultaneously so long as they fit in memory. We should consult the Tensorflow Serving docs for information on this.)
2. Once 1. is done, we can set a limit on CPU memory usage in the tf-serving pod using the MAX_ENQUEUED_BATCHES variable. Our current understanding is queue_memory_allocation = MAX_BATCH_SIZE_memory_footprint*MAX_ENQUEUED_BATCHES. Armed with this formula, we should pick a target queue size on disk (just choose something reasonable to start with and maybe tune heuristically later on) and then the MAX_ENQUEUED_BATCHES computation is simple.
3. Once 1. and 2. are done, we can optimize the usage of redis-consumer CPU time by setting GRPC_TIMEOUT properly. To do this, though, we need to know what the model throughput is on the given GPU (attached to tf-serving) with optimal batching. For our current purposes, we'll assume that all models run between 5fps and 10fps on a Titan V with optimal batching. (For other GPUs, we'll need to add a conversion factor.) Now, since the slow rate leads to a more conservative estimate, whereas the faster rate risks throwing DEADLINE_EXCEEDED errors, we'll assume the slower rate in the following: The idea should be to set GRPC_TIMEOUT = MAX_ENQUEUED_ BATCHEStime_to_process_a_batch_on_Titan_Vconversion_factor_from_Titan_V_to_other_gpu. (NB: The conversion factor could be determined empirically or, maybe, theoretically, based on differences in GPU processing speeds.)
4. Once trials 1., 2., and 3. are but a distant memory, the final challenge moves to the fore. Here, we try to optimize usage of all other cluster resources, via proper scaling of the redis-consumer and tf-serving pods. The best solution here is not obvious. It might require finding a new source of metrics for Prometheus. Istio is a potential source. redis-consumer logs, via a plugin like mtail or grok_exporter is another possibility. Maybe all the metrics we need are already being exported by tf-serving. Who knows? However, custom scaling metrics should be constructed for both redis-consumer and tf-serving and used in Horizontal Pod Autoscaler resources to control their scaling, one everything else is already in place.
5. (Optional) Once the autoscaling has been figured out, we can probably go back and tweak the queue length parameter from step 2 (with the grpc_timeout hopefully being updated automatically), to optimize redis-consumer throughput.

dylanbannon · 2020-03-10T00:46:01Z

Re goal 1:

@willgraf and I came up with the following facts:

GPU memory:

V100 and T4 GPUs have 16Gb of GPU memory

16GB = 16,000,000,000b
T4 uses GDDR6 memory and V100 uses HBM memory

uncompressed image sizes:

128x128 at float32 image = 65536b = 64KiB
512x512 at float32 image = 1048576b = 1MiB

questions:

Are images loaded into GPUs uncompressed? Probably, but we can't find documentation to this effect.
We know how much memory the GPU has and we know how big an image is in memory, but how much memory does a GPU need per image to carry out all its calculations? Is this amount dependent on the model, or not?

preliminary answers:

Images are loaded into GPUs uncompressed.
Apparently, GPUs store images and all intermediate node values, gradients, etc. in memory simultaneously. This means that the memory footprint on one image being trained is much greater than just the image size itself. What's more, the memory footprint per image is heavily dependent on model architecture. (The cs231n link in the resources below goes over an example, as does the datascience.stackexchange.com link.)

resources:

https://stackoverflow.com/questions/40190510/tensorflow-how-to-log-gpu-memory-vram-utilization
https://datascience.stackexchange.com/questions/17286/cnn-memory-consumption
http://cs231n.github.io/convolutional-networks/#case

willgraf · 2020-03-20T19:00:10Z

Through some trial and error testing, I've found that 64 batches works well (for 128x128x1 images). 128 batches also work, though it does slow down the response time a bit. I think 64 batches are a safe way to go.
The MAX_ENQUEUED_BATCHES looks like it works differently than we thought. I'm still unsure of the details, but using a large value (10,000) does not yield any evictions for 128x128 images or 512x512 images.
I am seeing most predictions come back within 2-3 seconds (or less). This means the GRPC_TIMEOUT could be reduced, but the default value of 20s seems to work fine without drawbacks.

With these settings and the updates from #304, the 100 image benchmarking jobs have gone from taking ~1 hour to consistently taking ~30-40 minutes. This is indeed for float32 data. float16 data, strangely, is taking almost twice the time to complete.

willgraf · 2020-06-09T17:35:50Z

Closing this as most of the action items have been resolved except 4), which lives on in an already existing issue: #278

The summaries of the interplay between these issues will also be moved as a warning for new workflow uses: #356

dylanbannon added the bug Something isn't working label Mar 6, 2020

willgraf changed the title ~~Excessive Redis-Consumer Timeouts during Benchamrking~~ Excessive Redis-Consumer Timeouts during Benchmarking Mar 9, 2020

dylanbannon changed the title ~~Excessive Redis-Consumer Timeouts during Benchmarking~~ Optimizing Tf-Serving/Redis-Consumer Interaction (was Excessive Redis-Consumer Timeouts during Benchmarking) Mar 10, 2020

willgraf changed the title ~~Optimizing Tf-Serving/Redis-Consumer Interaction (was Excessive Redis-Consumer Timeouts during Benchmarking)~~ Optimizing Tf-Serving/Redis-Consumer Interaction May 4, 2020

willgraf added enhancement New feature or request and removed bug Something isn't working labels May 4, 2020

willgraf mentioned this issue Jun 9, 2020

New workflows may require changes to tf-serving and consumer settings. #356

Open

willgraf closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Tf-Serving/Redis-Consumer Interaction #286

Optimizing Tf-Serving/Redis-Consumer Interaction #286

dylanbannon commented Mar 6, 2020 •

edited by willgraf

Loading

dylanbannon commented Mar 9, 2020 •

edited

Loading

dylanbannon commented Mar 10, 2020 •

edited by willgraf

Loading

dylanbannon commented Mar 10, 2020 •

edited

Loading

willgraf commented Mar 20, 2020 •

edited

Loading

willgraf commented Jun 9, 2020

Optimizing Tf-Serving/Redis-Consumer Interaction #286

Optimizing Tf-Serving/Redis-Consumer Interaction #286

Comments

dylanbannon commented Mar 6, 2020 • edited by willgraf Loading

Details:

Analysis:

tl;dr

dylanbannon commented Mar 9, 2020 • edited Loading

dylanbannon commented Mar 10, 2020 • edited by willgraf Loading

dylanbannon commented Mar 10, 2020 • edited Loading

GPU memory:

uncompressed image sizes:

questions:

preliminary answers:

resources:

willgraf commented Mar 20, 2020 • edited Loading

willgraf commented Jun 9, 2020

dylanbannon commented Mar 6, 2020 •

edited by willgraf

Loading

dylanbannon commented Mar 9, 2020 •

edited

Loading

dylanbannon commented Mar 10, 2020 •

edited by willgraf

Loading

dylanbannon commented Mar 10, 2020 •

edited

Loading

willgraf commented Mar 20, 2020 •

edited

Loading