|
| 1 | +Backend Batch Processing With Nomad |
| 2 | +http://bit.ly/jhancock_nomad |
| 3 | +19 Sep 2017 |
| 4 | + |
| 5 | +Jason Hancock |
| 6 | +Operations Engineer, GrayMeta |
| 7 | + |
| 8 | +https://jasonhancock.com |
| 9 | +@jsnby |
| 10 | + |
| 11 | +* Ops Engineer at GrayMeta |
| 12 | + |
| 13 | +.image images/GrayMeta.png |
| 14 | +.caption [[https://www.graymeta.com][graymeta.com]] |
| 15 | + |
| 16 | +* What we do... |
| 17 | + |
| 18 | +We index large filesystems, analyze content, extract metadata, etc. |
| 19 | + |
| 20 | +* Our Workload Composition |
| 21 | + |
| 22 | +Lots of backend jobs, essentially one job per file. |
| 23 | + |
| 24 | +Some jobs are small and fast, others are large and can take multiple hours to complete. |
| 25 | + |
| 26 | +: example: harvesting a 20kB jpeg vs a 3 hour 4k resolution video |
| 27 | + |
| 28 | +* Our original architecture |
| 29 | + |
| 30 | +.image images/diagrams/original.png |
| 31 | + |
| 32 | +: A user tells our API to process some storage like an s3 bucket. |
| 33 | +: We walk that filesystem, queuing up jobs to process |
| 34 | +: Our backend worker reads from the queue, process jobs, pushes results into the results queue |
| 35 | +: A persistence process reads jobs from the results queue, pushes them into the database through our API |
| 36 | + |
| 37 | +* Scaling the original architecture |
| 38 | + |
| 39 | +The original architecture scales horizontally |
| 40 | + |
| 41 | +.image images/diagrams/original_singlenode.png |
| 42 | + |
| 43 | +* Things were good, but we had operational issues |
| 44 | + |
| 45 | +Some customers have really large files, and our scheduling algorithm was "dump it into a queue and hope for the best" |
| 46 | + |
| 47 | +: A job could get launched on a box that didn't have enough disk and it would just fail. |
| 48 | + |
| 49 | +* To counter our lack of intelligent bin-packing |
| 50 | + |
| 51 | +We over-provisioned all our nodes on disk |
| 52 | + |
| 53 | +.image images/diagrams/overprovisioned.png |
| 54 | + |
| 55 | +* Our CFO wasn't pleased |
| 56 | + |
| 57 | +.image images/money_in_wallet.jpg |
| 58 | +.caption [[https://www.flickr.com/photos/68751915@N05/6722569541/][Money In Wallet]] by [[http://401kcalculator.org/][401kcalculator.org]], CC BY-SA 2.0 |
| 59 | + |
| 60 | +* Other problems: Security |
| 61 | + |
| 62 | +We didn't have a defense against processing malicious content that exploited 3rd party vulnerabilities like [[https://imagetragick.com/][ImageTragick]] |
| 63 | + |
| 64 | +Our processing nodes had too much sensitive configuration data and could access the storage directly. |
| 65 | + |
| 66 | +* Other problems: maintenance |
| 67 | + |
| 68 | +- We can add nodes to the cluster easily |
| 69 | +- We didn't have a way to enable "drain" mode to pull machines out of rotation without killing any running jobs |
| 70 | + |
| 71 | +* Other problems: multi-tenancy |
| 72 | + |
| 73 | +Our architecture solved the problem for 1 customer. We scaled by adding dedicated resources for each customer: |
| 74 | + |
| 75 | +.image images/diagrams/multi_customers.png |
| 76 | + |
| 77 | +We were now scaling up and down multiple compute clusters on demand. |
| 78 | + |
| 79 | +* Other problems: multi-tenancy, idle machines |
| 80 | + |
| 81 | +Each customer required at least 1 worker, but often times they were idle. |
| 82 | + |
| 83 | +We needed a way to share these backend processing nodes for multiple customers |
| 84 | + |
| 85 | +: needed a way to scale the processing cluster up or down based on overall customer demand, (1 cluster to manage instead of 30 clusters to manage) |
| 86 | + |
| 87 | +* Coming around to a solution |
| 88 | + |
| 89 | +- Wanted to do the processing in a container with limited privileges/credentials |
| 90 | +- Wanted to try to solve a lot of the problems/pain points of our current architecture if possible |
| 91 | +- Wanted to avoid writing our own scheduling algorithms |
| 92 | +- Try to keep it simple |
| 93 | + |
| 94 | +* Started shopping for a container orchestrator |
| 95 | + |
| 96 | +.image images/shopping_cart.jpg |
| 97 | +.caption [[https://www.flickr.com/photos/drb62/4273903651/][Shopping Cart]] By Daniel R. Blume, CC BY-SA 2.0 |
| 98 | + |
| 99 | +* Requirements |
| 100 | + |
| 101 | +- Easy to operate |
| 102 | +- Easy to use API |
| 103 | +- Can schedule jobs based on CPU, Memory and Disk resources |
| 104 | + |
| 105 | +: easy to operate...we were going to be deploying this everywhere (cloud, on-prem, etc.) and operational easy was a primary concern. We couldn't waltz into a new customer's infrastructure and expect them to deploy and manage a complicated solution |
| 106 | +: easy to use API - wanted to make the development experience as easy as possible |
| 107 | +: ability to schedule based on several dimensions, most important to us at the time was disk space |
| 108 | + |
| 109 | +* Nomad to the Rescue |
| 110 | + |
| 111 | +.image images/nomad.svg _ 400 |
| 112 | + |
| 113 | +* What Nomad solved for us out-of-the-box |
| 114 | + |
| 115 | +- Easy to deploy/operate |
| 116 | +- Maintenance (removing/draining nodes) |
| 117 | +- Bin-packing |
| 118 | +.image images/diagrams/heterogeneous_nodes.png |
| 119 | + |
| 120 | +* Easy to use API |
| 121 | + |
| 122 | +.code launch.go /START 1OMIT/,/END 1OMIT/ |
| 123 | +: Pretty easy to define a job using the Go client |
| 124 | + |
| 125 | +* Launch the job |
| 126 | + |
| 127 | +.code launch.go /START 2OMIT/,/END 2OMIT/ |
| 128 | + |
| 129 | +* Nomad written in Go |
| 130 | + |
| 131 | +If the docs weren't clear, there was a full client implementation (the Nomad CLI) to borrow from |
| 132 | + |
| 133 | +* Re-architecting to use Nomad |
| 134 | + |
| 135 | +.image images/diagrams/with_nomad.png |
| 136 | + |
| 137 | +Decided the jobs running in containers wouldn't touch the source storage directly |
| 138 | + |
| 139 | +* Accessing the API |
| 140 | + |
| 141 | +- Container is given a narrowly scoped OAuth token (scoped to a single source file) |
| 142 | +- Allows the job in the container to download the file and store the results for only that file |
| 143 | + |
| 144 | +* Multi-tenancy |
| 145 | + |
| 146 | +We inject configuration into the container via environment variables |
| 147 | + |
| 148 | +- URLs of API endpoints |
| 149 | +- Auth token |
| 150 | + |
| 151 | +Doing this allows us to configure jobs for any customer on a shared Nomad cluster |
| 152 | + |
| 153 | +: use the same container image for all customers |
| 154 | + |
| 155 | +* Solved a lot of our problems |
| 156 | + |
| 157 | +- Security (never a solved problem) - better isolation |
| 158 | +- Maintenance |
| 159 | +- Reduced operational overhead (managing one cluster vs. multiple single-tenant clusters) |
| 160 | +- Heterogeneous nodes, more realistically provisioned |
| 161 | +- Better utilization of resources |
| 162 | + |
| 163 | +: We built an "unplaceable job" alert. We monitor for any jobs in Nomad's "pending" state that can't ever be placed due to the resources requested by the job vs. the resources available in the cluster |
| 164 | + |
| 165 | +* Flexibility - Other Container Schedulers? |
| 166 | + |
| 167 | +- Launch jobs in AWS ECS |
| 168 | +- Launch jobs in k8s? |
| 169 | + |
| 170 | +: Our code's written in Go. |
| 171 | +: We built our scheduler as an interface. So there's a Nomad implementation, an ECS implemenation, and we'll likely build a k8s implementation next. |
| 172 | +: allows us to swap implementations based on customer's needs |
| 173 | + |
| 174 | +* Flexibility - dev environment |
| 175 | + |
| 176 | +- Big advocate of local development environments |
| 177 | +- Full stack |
| 178 | +- As close to production as possible |
| 179 | + |
| 180 | +* Dev env |
| 181 | + |
| 182 | +- We use a monolithic Docker image that contains our full stack |
| 183 | +- Runs the same version of Nomad and Consul as production |
| 184 | +- We use the Nomad "raw_exec" driver |
| 185 | + |
| 186 | +: unless we're testing a new version of Nomad and/or Consul |
| 187 | + |
| 188 | +* Why raw_exec? |
| 189 | + |
| 190 | +- Avoids having to do something creative like Docker-in-Docker |
| 191 | +- Avoids having to bundle up our processing binary into a container image every time it's recompiled |
| 192 | +- Allows rapid iteration without faking anything. |
| 193 | + |
| 194 | +: confidence |
| 195 | + |
| 196 | +* Summary - Moving our batch processing to Nomad... |
| 197 | + |
| 198 | +- Required architectural changes that unlocked additional flexability in our platform |
| 199 | +- Saved us from having to implement a scheduler |
| 200 | +- Helped us better utilize compute resources |
0 commit comments