forked from rust-lang/crates.io
-
Notifications
You must be signed in to change notification settings - Fork 1
Merge latest rust-lang changes, resolve conflicts #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
BatmanAoD
wants to merge
1,664
commits into
PritiKumr:favorite-crate-owner
Choose a base branch
from
BatmanAoD:favorite-crate-owner
base: favorite-crate-owner
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Merge latest rust-lang changes, resolve conflicts #3
BatmanAoD
wants to merge
1,664
commits into
PritiKumr:favorite-crate-owner
from
BatmanAoD:favorite-crate-owner
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Short version: We have a memory issue after switching to 1.32 which came from the allocator switch. 1.31 used jemalloc 4. jemallocator uses jemalloc 5, which behaves differently. With jemalloc 5 as configured in the PR, we still have an issue which needs to be addressed eventually, but the behavior is improved to the point where I'm not concerned about it causing a service disruption. My current theory points to the problem coming from civet, which we're removing soon anyway. Looking into this further will require a lot of time and effort, so I'd like to see if we have a major improvement from removing civet before going further. I believe that the behavior that we're seeing with this config is the same as a rare behavior we saw on Rust 1.31, so I do not think this is any riskier than locking back down to 1.31. This is also what has been in production for the last several days as we've been debugging. The `let _` when enabling background threads is that they're not supported on all platforms (notably mac). The function is safe to call on unsupported platforms. Folks who are more familiar with jemalloc than I am recommended this over `#[cfg(not(macos))]`. This PR also locks down the Rust version so we don't unexpectedly have issues from a version upgrade mixed with code changes again, or an unexpected full build with no cache during an incident. I expect that we will just bump this immediately when a new stable version comes out, but we should make sure we always deploy right after bumping it. Everything after this is more detail on what I've said above, and/or documenting what I've found so it doesn't get lost. Long version: After upgrading to Rust 1.32, we saw what appeared to be a large memory leak. I've been spending a lot of time over the last week attempting to narrow this down, as well as come up with a permanent fix. The fact that the 1.32 upgrade caused this makes sense, as it changed the default allocator from jemalloc 4 to the system allocator. What doesn't make sense is why the system allocator would cause us to have a major memory leak, but jemalloc 4 causes us to have completely stable usage, and jemalloc 5 does something completely different (more on that later) From looking into this we've learned a few interesting things: - We can run out of memory and it's completely fine. - If we run out of memory, we will start swapping to disk which in theory should cause major perf issues, but in practice it barely caused a noticable effect on response times - We are more affected by our choice of allocator than we thought. - Our memory usage is in fact coming from our server, not nginx. - I've long wondered why exactly we were using so much memory, but haven't had time to look into it in detail until now. I've always assumed it was nginx's buffers, but this makes it clear that most or all of our memory usage is in fact coming from the server. - Whatever is causing this issue seems to require production levels of traffic, as we haven't been able to reproduce locally or in staging. - The underlying problem almost certainly comes from one of our C dependencies which I'll elaborate on below - We are reading and writing pages to disk (aka swapping) even when we're nowhere near max memory usage. My understanding is this should not be happening. - Our allocations don't appear to correlate to crate uploads, which is the only time I'd expect us to be allocating a ton of memory. The spikes *sometimes* come from uploads, but not always. And the amount that gets allocated is way larger than the size of the crate being uploaded (0.18MB crate correlated with a 113MB increase in RSS) Here's the behavior we've seen with various allocators: - 1.31 default allocator (jemalloc 4 unprefixed) - Memory usage very rapidly grows to *some* number (usually between 200MB and 350MB) and remains extremely stable at that number until the server restarts. The number it grows to is different every restart, but whatever number it reaches is the same between both servers. - On rare occasions, there is a spike in memory usage around 1-2 hours before restart. I've suspected in the past this is related to the connection pool exhaustion issue, since IIRC the memory usage spiked the last time we had that issue occur in production. The spike only affects a single server when it happens. - 1.32 default allocator (system allocator) - Memory usage grows rapidly and unbounded, consuming all available RAM within 3 hours and continuing to grow into swap beyond that. I do not know if it eventually reaches an upper bound as I've never let it run for long after it begins swapping - jemallocator (jemalloc 5, prefixed, no background threads) - Similar to system allocator, but much slower growth - jemallocator unprefixed (jemalloc 5, unprefixed, no background threads) - Memory usage grows to ~150MB, and is either stable or very slowly grows towards ~200MB. Memory usage is similar between servers, but not exactly the same. "stable" in this context means not obviously leaking or growing in unreasonable ways, but it moves around much more than jemalloc 4 did (25MB or so vs <5MB) - It's almost always only one server that grows, but I've seen both grow at least once - After 16 hours, memory usage grows extremely rapidly on one or occasionally both servers to some number (usually between 400MB and 520MB) and stays there until server restart. - The number it grows to appears to be roughly 2.25 the usage before the growth. And by roughly I mean within 1-2% of that number, and it extremely consistently exhibits this behavior. - Because unprefixing jemalloc's symbols caused the change here, it's pretty clear that it's a leak in memory allocated by one of our C dependencies. We have quite a few of these, but the ones that would make the most sense to be the source are either civet, miniz (coming from flate2), libgit2, or libpq (coming from Diesel) - Diesel has been run through valgrind in the past, and I did a brief code audit as well, so I don't think it's libpq - It wouldn't make a lot of sense for libgit2 to be the cause, but once we've moved index updates off the web server there's very little reason we can't just subshell out to git - civet makes the most sense to be the cause, and is something we're intending to remove - flate2 would also make sense and is much harder to remove - The 16 hour time period before the spike was very consistent, and not correlated with a time of day, so I don't think it was triggered by specific traffic patterns - Once the spike happens memory usage tends to remain pretty stable there, with one notable exception of the first day this was deployed where it grew to consume all available memory, then randomly decided to allocate 116MB of swap, and shortly afterwards it freed that amount of RSS. At the time I thought the fact that it stopped at almost exactly 512MB of RAM was too specific to be a coincidence but I now believe that it was a coincidence - jemallocator unprefixed, `jemalloc_ctl` (jemalloc 5, unprefixed, background threads enabled) - This is the config in this PR - Similar behavior to background threads turned off, but it takes 22-23 hours for the spike to occur - The behavior of the jemallocator crate is the same on both 1.32 and 1.31 My current working theory is that we do actually have a memory leak, but what is being leaked is something extremely small. The reason this causes such a huge difference in behavior on the system allocator is that these small leaked allocations are causing fragmentation issues. For this to be the case, it would require that we are slowly requesting larger and larger allocations over time, which is what leads me to believe that civet is the cause. I suspect the growth is from civet growing its request/response buffers (which should be the largest single allocations we have), but the leak is something very small that is just causing the allocator to be unable to re-use previously freed blocks. It's also possible that these re-allocations are just causing this much fragmentation naturally without the leak, but I have a hard time believing the system allocator is actually performing *that* much worse here. The slow consistent memory growth before the huge spike also implies the small leak to me. This would also explain some of the behavior of jemalloc, and why we're seeing that spike. jemalloc puts smaller allocations into the same arena, so would prevent the fragmentation caused by leaking smaller allocations, and potentially would be able to hide it completely for a while. The fact that the spike is consistently 2.25x the previous max RSS is too specific not to be the growth factor on something. I think it is jemalloc growing one of its larger arenas, but it could also be the size of some internal buffer used by civet or something else. I believe that the behavior we're seeing with jemalloc 5 isn't actually different overall from what we saw with jemalloc 4, other than the spike happening sooner. I think the spike we very rarely saw in jemalloc 4 would consistently have occurred if the server didn't restart every 24 hours. I suspect switching from civet to hyper will remove this issue entirely. If it doesn't, I intend to dig in further. The next steps would be to log more from jemalloc on a regular interval, get some stack traces from jemalloc for large allocations, and possibly even running valgrind in production for brief periods. Most of these things will require a lot of work and have a very low signal to noise ratio, which is why I've chosen to stop here at a point where I'm confident we don't have an impending outage, and rule a few more things out before digging deeper. Other things that I have done which are not in this PR or reflected in the code base are: - Enabling more verbose logging on Heroku's side - This gives us more granularity than the metrics graph gives us (every 20 seconds, and also separated by server) - Additional stats beyond the metrics graphs (disk cache, pages written to disk) - Reached out to support to see if we can disable the daily restart - I don't intend to do this right now if it's possible, but it's a good option to have to aid us in debugging this and the connection pooling issue. - It's not something they offer out of the box, but the platform team is looking into making it possible for us
…isher The publisher's email address is deliberatel copied into this table rather than referencing the users table. If people delete their account, remove their email, etc we still have a way to attempt to contact the publisher of a particular version in the event of a DMCA complaint.
If there is one, because we're still in the optional stage, but we can start recording them now if possible.
…carols10cents Add support for Azure DevOps badge I added a project of mine that is already using `Azure Devops` into `fixtures/crates.js`. Let me know if that's an inconvenience and I'll remove it. I just put it there to easily see that everything was working as expected. I also added some tests by following the existing code example. See also rust-lang/cargo#6264
Add jemalloc, lock Rust version Short version: We have a memory issue after switching to 1.32 which came from the allocator switch. 1.31 used jemalloc 4. jemallocator uses jemalloc 5, which behaves differently. With jemalloc 5 as configured in the PR, we still have an issue which needs to be addressed eventually, but the behavior is improved to the point where I'm not concerned about it causing a service disruption. My current theory points to the problem coming from civet, which we're removing soon anyway. Looking into this further will require a lot of time and effort, so I'd like to see if we have a major improvement from removing civet before going further. I believe that the behavior that we're seeing with this config is the same as a rare behavior we saw on Rust 1.31, so I do not think this is any riskier than locking back down to 1.31. This is also what has been in production for the last several days as we've been debugging. The `let _` when enabling background threads is that they're not supported on all platforms (notably mac). The function is safe to call on unsupported platforms. Folks who are more familiar with jemalloc than I am recommended this over `#[cfg(not(macos))]`. This PR also locks down the Rust version so we don't unexpectedly have issues from a version upgrade mixed with code changes again, or an unexpected full build with no cache during an incident. I expect that we will just bump this immediately when a new stable version comes out, but we should make sure we always deploy right after bumping it. Everything after this is more detail on what I've said above, and/or documenting what I've found so it doesn't get lost. Long version: After upgrading to Rust 1.32, we saw what appeared to be a large memory leak. I've been spending a lot of time over the last week attempting to narrow this down, as well as come up with a permanent fix. The fact that the 1.32 upgrade caused this makes sense, as it changed the default allocator from jemalloc 4 to the system allocator. What doesn't make sense is why the system allocator would cause us to have a major memory leak, but jemalloc 4 causes us to have completely stable usage, and jemalloc 5 does something completely different (more on that later) From looking into this we've learned a few interesting things: - We can run out of memory and it's completely fine. - If we run out of memory, we will start swapping to disk which in theory should cause major perf issues, but in practice it barely caused a noticable effect on response times - We are more affected by our choice of allocator than we thought. - Our memory usage is in fact coming from our server, not nginx. - I've long wondered why exactly we were using so much memory, but haven't had time to look into it in detail until now. I've always assumed it was nginx's buffers, but this makes it clear that most or all of our memory usage is in fact coming from the server. - Whatever is causing this issue seems to require production levels of traffic, as we haven't been able to reproduce locally or in staging. - The underlying problem almost certainly comes from one of our C dependencies which I'll elaborate on below - We are reading and writing pages to disk (aka swapping) even when we're nowhere near max memory usage. My understanding is this should not be happening. - Our allocations don't appear to correlate to crate uploads, which is the only time I'd expect us to be allocating a ton of memory. The spikes *sometimes* come from uploads, but not always. And the amount that gets allocated is way larger than the size of the crate being uploaded (0.18MB crate correlated with a 113MB increase in RSS) Here's the behavior we've seen with various allocators: - 1.31 default allocator (jemalloc 4 unprefixed) - Memory usage very rapidly grows to *some* number (usually between 200MB and 350MB) and remains extremely stable at that number until the server restarts. The number it grows to is different every restart, but whatever number it reaches is the same between both servers. - On rare occasions, there is a spike in memory usage around 1-2 hours before restart. I've suspected in the past this is related to the connection pool exhaustion issue, since IIRC the memory usage spiked the last time we had that issue occur in production. The spike only affects a single server when it happens. - 1.32 default allocator (system allocator) - Memory usage grows rapidly and unbounded, consuming all available RAM within 3 hours and continuing to grow into swap beyond that. I do not know if it eventually reaches an upper bound as I've never let it run for long after it begins swapping - jemallocator (jemalloc 5, prefixed, no background threads) - Similar to system allocator, but much slower growth - jemallocator unprefixed (jemalloc 5, unprefixed, no background threads) - Memory usage grows to ~150MB, and is either stable or very slowly grows towards ~200MB. Memory usage is similar between servers, but not exactly the same. "stable" in this context means not obviously leaking or growing in unreasonable ways, but it moves around much more than jemalloc 4 did (25MB or so vs <5MB) - It's almost always only one server that grows, but I've seen both grow at least once - After 16 hours, memory usage grows extremely rapidly on one or occasionally both servers to some number (usually between 400MB and 520MB) and stays there until server restart. - The number it grows to appears to be roughly 2.25 the usage before the growth. And by roughly I mean within 1-2% of that number, and it extremely consistently exhibits this behavior. - Because unprefixing jemalloc's symbols caused the change here, it's pretty clear that it's a leak in memory allocated by one of our C dependencies. We have quite a few of these, but the ones that would make the most sense to be the source are either civet, miniz (coming from flate2), libgit2, or libpq (coming from Diesel) - Diesel has been run through valgrind in the past, and I did a brief code audit as well, so I don't think it's libpq - It wouldn't make a lot of sense for libgit2 to be the cause, but once we've moved index updates off the web server there's very little reason we can't just subshell out to git - civet makes the most sense to be the cause, and is something we're intending to remove - flate2 would also make sense and is much harder to remove - The 16 hour time period before the spike was very consistent, and not correlated with a time of day, so I don't think it was triggered by specific traffic patterns - Once the spike happens memory usage tends to remain pretty stable there, with one notable exception of the first day this was deployed where it grew to consume all available memory, then randomly decided to allocate 116MB of swap, and shortly afterwards it freed that amount of RSS. At the time I thought the fact that it stopped at almost exactly 512MB of RAM was too specific to be a coincidence but I now believe that it was a coincidence - jemallocator unprefixed, `jemalloc_ctl` (jemalloc 5, unprefixed, background threads enabled) - This is the config in this PR - Similar behavior to background threads turned off, but it takes 22-23 hours for the spike to occur - The behavior of the jemallocator crate is the same on both 1.32 and 1.31 My current working theory is that we do actually have a memory leak, but what is being leaked is something extremely small. The reason this causes such a huge difference in behavior on the system allocator is that these small leaked allocations are causing fragmentation issues. For this to be the case, it would require that we are slowly requesting larger and larger allocations over time, which is what leads me to believe that civet is the cause. I suspect the growth is from civet growing its request/response buffers (which should be the largest single allocations we have), but the leak is something very small that is just causing the allocator to be unable to re-use previously freed blocks. It's also possible that these re-allocations are just causing this much fragmentation naturally without the leak, but I have a hard time believing the system allocator is actually performing *that* much worse here. The slow consistent memory growth before the huge spike also implies the small leak to me. This would also explain some of the behavior of jemalloc, and why we're seeing that spike. jemalloc puts smaller allocations into the same arena, so would prevent the fragmentation caused by leaking smaller allocations, and potentially would be able to hide it completely for a while. The fact that the spike is consistently 2.25x the previous max RSS is too specific not to be the growth factor on something. I think it is jemalloc growing one of its larger arenas, but it could also be the size of some internal buffer used by civet or something else. I believe that the behavior we're seeing with jemalloc 5 isn't actually different overall from what we saw with jemalloc 4, other than the spike happening sooner. I think the spike we very rarely saw in jemalloc 4 would consistently have occurred if the server didn't restart every 24 hours. I suspect switching from civet to hyper will remove this issue entirely. If it doesn't, I intend to dig in further. The next steps would be to log more from jemalloc on a regular interval, get some stack traces from jemalloc for large allocations, and possibly even running valgrind in production for brief periods. Most of these things will require a lot of work and have a very low signal to noise ratio, which is why I've chosen to stop here at a point where I'm confident we don't have an impending outage, and rule a few more things out before digging deeper. Other things that I have done which are not in this PR or reflected in the code base are: - Enabling more verbose logging on Heroku's side - This gives us more granularity than the metrics graph gives us (every 20 seconds, and also separated by server) - Additional stats beyond the metrics graphs (disk cache, pages written to disk) - Reached out to support to see if we can disable the daily restart - I don't intend to do this right now if it's possible, but it's a good option to have to aid us in debugging this and the connection pooling issue. - It's not something they offer out of the box, but the platform team is looking into making it possible for us
…b, r=sgrif Invalid emails from GitHub This supersedes rust-lang#1251.
Rate limit the crate publish endpoint This adds a restriction to the number of crates that can be published from a single IP address in a single time period. This is done per IP instead of per-user, since that's what we can do with nginx alone. We may want to do additional rate limiting per token to force spammers to register additional accounts, and not just change their IP. This will use what's called the "leaky bucket" strategy for rate limiting. Basically every user has a bucket of tokens they use to publish crates. They start with 10 tokens available. Every time they hit this endpoint, they use one of the tokens. We give them a new token each minute. What this means is that you can upload 1 crate per minute, but we allow you to upload up to 10 in a short period of time before we start enforcing the rate limit. When someone does hit the rate limit, they will receive a 429 response. We could also allow it to instead just slow down the requests, refusing to process them until a token is available (queueing a max of 10 requests). This reserves 10 megabyte of memory for the IP table, which means we can hold about 80000 IPs. When the table is full, it will attempt to drop the oldest record, and if that doesn't give enough space, it'll give a 503. Keep in mind this is all in memory, not shared between our servers. This means that it is possible (but not guaranteed) that someone can upload 20 crates, and then send 2 requests per minute. On Heroku and any system that involves proxies, the remote_addr of the request becomes useless, as it will be the proxy forwarding your request rather than the client itself. Most proxies will set or append an `X-Forwarded-For` header, putting whatever remote_addr it received at the end of the the header. So if there are 3 proxies, the header will look like real_ip, proxy1, proxy2 However, if we're not careful this makes us vulnerable to IP spoofing. A lot of implementations just look at the first value in the header. All the proxies just append the header though (they don't know if they're getting traffic from the origin or a proxy), so you end up with spoofed_ip, real_ip, proxy1, proxy2 nginx, Rails, and many other implementations avoid this by requiring an allowlist of trusted IPs. With this configuration, the realip only kicks in if remote_addr is a trusted proxy, and then it will replace it with the last non-trusted IP from the header.
…jtgeibel Record verified email, if present, of the publisher in the version record Connects to rust-lang#1620. We can start recording emails if we have them, even though we're not requiring verified emails yet. This builds on top of rust-lang#1561.
Switch to `conduit-hyper` in place of `civet` `conduit-hyper` now handles normalizing paths, such that `//api/v1` becomes `/api/v1` and doesn't break our router.
Because that's at least on 2019-02-28 (the date we announced) in all timezones, and should definitely be after the release scheduled for that day. As per the plan at rust-lang/crates-io-cargo-teams#8
The frontend Docker image appears to be following a pretty standard Node convention of adding the package.json, running `npm install`, and then copying the rest of the application, which makes rebuilding new docker images much faster if there are no dependency changes. However, in the great `yarn` cleanup of 2017, `yarn install` was removed but the corresponding `npm install` was not added back. This means that when the frontend docker image tries to run ember, ember is not available because it was never installed.
Revert "Auto merge of rust-lang#1618 - jtgeibel:conduit-hyper-round2, r=sgrif" This reverts commit 69368a6, reversing changes made to 6fa8ad3. When deploying this change, we saw response times nearly triple, and our CPU usage went through the roof. Additionally, this did not appear to fix the memory issues we were hoping to address by removing civet. /cc @jtgeibel
This column is never used in Rust code, and we're doing a lot of funkiness to work around its existence. Even if we wanted to use this column in Rust code, we wouldn't be able to since the diesel_ltree crate is using Diesel 0.16. There's a more recent version that uses Diesel 1.0, but I don't see us using this column in the future more than we are today. The column is still used in raw SQL, so we haven't removed it from the actual database schema.
Remove rand 0.3 from the dependencies
We originally logged this in order to help us separate malicious traffic. Ultimately we've never ended up using this at all. As we're reviewing our privacy policies, this seems like a good time to stop logging things we don't care about that expose additional information about our users.
When using `docker-compose up` to bring up everything for development, we mount the source directory as read-only so that changes can only happen on the host machine and not because of something that the code running in docker does. This read-only mount was causing problems because `patch`, called by `diesel migration run`, would try to write to the filesystem and fail. When bringing up a docker environment for the first time from a fresh clone, we expect that the schema.rs file is already pristine, and that we just need to run migrations. For that reason, use `diesel migration run`'s `--locked-schema` option to just run migrations. This change makes it so that a new contributor can clone the repo, run `docker-compose up`, and have an almost complete environment to poke at and start exploring. Fixes rust-lang#1631
…albini Further address performance regression in search Even after rust-lang#1746, we're still seeing a performance issue with search in production. Now it's limited to searches that are a single letter, or short 2 letter words like 'do'. It's caused by any search that would cause PG to warn that the query contains only stopwords. It appears the code path taken when `plainto_tsquery` returns an empty query is substantially slower than it would be otherwise, even if the query contains stopwords. The reason this has started causing problems now is that rust-lang#1560 caused the query to change from performing a nested loop join to a hash join. Due to what appears to be a bug in PG, `plainto_tsquery` is getting called once per row when a hash join is performed. When the query is passed as the only argument, the function is declared as `STABLE`, meaning that within a single statement it will always return the same result for the same arguments, so PG should only be calling it once (or at least only a handful of times). There's a second form available where you explicitly pass the language as an argument. This form is marked as `IMMUTABLE`, so the query planner will just replace the call to the function with its results. Unfortunately, PG is picky about how we pass the language. It doesn't consider a cast from `text` to `regconfig` to be `IMMUTABLE`, only `STABLE` (which is valid, since it's based on a `pg_catalog` lookup -- The fact that it accepts a string literal as `IMMUTABLE` actually seems wrong). The actual value is the OID of the row in `pg_ts_config`, which is *not* stable. Since `regconfig('english'::text)` is not considered `IMMUTABLE`, we just need to embed it as a string literal instead.
…geibel Add support for graceful shutdown of the server The server process now intercepts SIGINT and SIGTERM to initiate a graceful shutdown of the server. This is useful when tracking memory leaks locally, as both `hyper` and `civet` are given a chance to return memory and shutdown cleanly. However, this will not improve things in production as we also run an instance of `nginx`, which will close all connections after receiving a SIGTERM from Heroku. From my preliminary investigation, it appears we may need to customize the buildpack to change this behavior. Additionally, the server will now briefly sleep before notifying Heroku that it is ready to receive connections. Also, when using `hyper` the `Runtime` is configured to use 4 background threads. This overrides the default, which is one thread per CPU and provides consistency between differently sized dynos. The number of `conduit` background threads are still configured via `SERVER_THREADS` and defaults to 50 in production.
Since rust-lang#1560 we've had various performance problems with search. While we've addressed most of the biggest issues, we still have poor performance when the search term is 2 characters or longer (between 100-200ms at the time of writing this commit, though there is a PR being merged which can regress this another 30%). This level of performance isn't the end of the world, but the majority of our 1 or 2 character searches come from a handful of crawlers which hit us very frequently, and I'd prefer not to spend this much time on queries for crawlers. (There's quite a few of these, they all do things like search for 's', then 'se', then 'ser', etc over and over again -- User Agent only identifies the HTTP lib they used, which varies). The performance issues come from two places. The first problem is that we can't use our indexes when the search term is 2 characters, due to how trigrams work. This means that we fall back to doing a sequential scan of the entire table, which will only get worse as time goes on. For single letter searches, the second issue comes from the sheer number of rows we get back, which have to go into an expensive hash join. If you search for 'a', you get back 13k results. At the end of the day, getting every crate with 'a' in its name is not useful, so I've tried to go with a solution that both improves our performance and also return more useful results. The operator I've used is meant to return whether any words are sufficiently similar, but since our search term is shorter than a trigram the behavior is a little different. For 2 letter searches, it ends up being "any word begins with the term", and for 1 letter searches, it's "any word is equal to the term". Here are some example results for "do" and "a" ``` do afi_docf alice-download async_docker avocado_derive cargo-build-docker cargo_crates-io_docs-rs_test cargo_crates-io_docs-rs_test2 cargo-do cargo-doc-coverage cargo-docgen cargo-dock cargo-docker cargo-docker-builder cargo-docserve cargo-docserver cargo-doctor cargo-download cargo-external-doc cargo-pack-docker cargo-serve-doc devboat-docker doapi do-async doc doc_9_testing_12345 docbase_io doc-cfg doc-comment doccy doc_file docker docker4rs ``` ``` a a-range cortex-a jacques_a_dit magic-number-a manish_this_is_a_test poke-a-mango vmx-just-a-test-001-maincrate wasm-bindgen-test-crate-a ``` Drawbacks --- The original motivation for switching to `LIKE` in search was to make sure libssh shows up when searching for ssh. This will regress that for any `lib*` crates with less than 2 letter names. There aren't very many of these: - lib - libc - libcw - libdw - libgo - libjp - libm - libnv - libpm - libr - libs - libsm - libxm I'm less concerned about the single letter cases, as those are already going to be buried on page 87, but a few of the 2 letter cases you might legitimately search for. None of these crates have high traffic, and fixing this generally isn't really possible without introducing some special case indexes *only* for this case. We could also work around this by always searching for "lib*" in addition to whatever you searched for. This also means that searching for `a` will no longer include the crate `a1`. I'm not as concerned about this, if you want all crates starting with the letter a, we already have `/crates?letter=a` for that. With this change, our performance should be back to reasonable levels for all search terms. Before -- ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1) -> WindowAgg (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1) -> Sort (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1) Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name Sort Method: quicksort Memory: 9996kB -> Hash Right Join (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1) Hash Cond: (recent_crate_downloads.crate_id = crates.id) -> Seq Scan on recent_crate_downloads (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1) -> Hash (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 6985kB -> Seq Scan on crates (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1) Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text)) Rows Removed by Filter: 13359 Planning Time: 0.555 ms Execution Time: 165.998 ms ``` After -- ``` QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1) -> WindowAgg (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1) -> Sort (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1) Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name Sort Method: quicksort Memory: 30kB -> Nested Loop Left Join (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1) -> Bitmap Heap Scan on crates (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1) Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)) Heap Blocks: exact=9 -> BitmapOr (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1) -> Bitmap Index Scan on index_crates_name_search (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1) Index Cond: (''::tsquery @@ textsearchable_index_col) -> Bitmap Index Scan on index_crates_name_tgrm (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1) Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text) -> Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9) Index Cond: (crate_id = crates.id) ```
…rols10cents Reduce the number of git clones in the test framework The test framework spends a lot of I/O and allocations initializing and cloning git repositories, even though most tests do not enqueue a background job to modify the index. A `TestApp::full()` method is added which initializes the `TestApp` with an index, background job runner, and playback proxy. It is also now possible to enqueue multiple crate publishes before running a batch of background jobs. A `Drop` impl on `TestAppInner` ensures that all pending migrations are run and that no jobs have been queued unless `TestApp::full()` was used. Unlike the `enqueue_publish` helper the `yank` and `unyank` methods automatically run migrations out of convenience since otherwise the database is not updated. This can be changed in the future if a test cares about controlling exactly when the jobs are run. Finally, this PR includes several other improvements: * The initial commit for each remaining upstream index is now made directly against the bare repo, rather than doing a clone and push. * The proxy only initializes a `Client` when recording. * A playback proxy is not created when calling `app()` Overall, when running the tests in `all`, these changes reduce the cumulative allocation size from 3.45 GB to 689 MB. Allocation counts are reduced from 4,625,804 to 1,611,351.
…ries, r=jtgeibel Use a stricter form of search for extremely short queries Since rust-lang#1560 we've had various performance problems with search. While we've addressed most of the biggest issues, we still have poor performance when the search term is 2 characters or longer (between 100-200ms at the time of writing this commit, though there is a PR being merged which can regress this another 30%). This level of performance isn't the end of the world, but the majority of our 1 or 2 character searches come from a handful of crawlers which hit us very frequently, and I'd prefer not to spend this much time on queries for crawlers. (There's quite a few of these, they all do things like search for 's', then 'se', then 'ser', etc over and over again -- User Agent only identifies the HTTP lib they used, which varies). The performance issues come from two places. The first problem is that we can't use our indexes when the search term is 2 characters, due to how trigrams work. This means that we fall back to doing a sequential scan of the entire table, which will only get worse as time goes on. For single letter searches, the second issue comes from the sheer number of rows we get back, which have to go into an expensive hash join. If you search for 'a', you get back 13k results. At the end of the day, getting every crate with 'a' in its name is not useful, so I've tried to go with a solution that both improves our performance and also return more useful results. The operator I've used is meant to return whether any words are sufficiently similar, but since our search term is shorter than a trigram the behavior is a little different. For 2 letter searches, it ends up being "any word begins with the term", and for 1 letter searches, it's "any word is equal to the term". Here are some example results for "do" and "a" ``` do afi_docf alice-download async_docker avocado_derive cargo-build-docker cargo_crates-io_docs-rs_test cargo_crates-io_docs-rs_test2 cargo-do cargo-doc-coverage cargo-docgen cargo-dock cargo-docker cargo-docker-builder cargo-docserve cargo-docserver cargo-doctor cargo-download cargo-external-doc cargo-pack-docker cargo-serve-doc devboat-docker doapi do-async doc doc_9_testing_12345 docbase_io doc-cfg doc-comment doccy doc_file docker docker4rs ``` ``` a a-range cortex-a jacques_a_dit magic-number-a manish_this_is_a_test poke-a-mango vmx-just-a-test-001-maincrate wasm-bindgen-test-crate-a ``` Drawbacks --- The original motivation for switching to `LIKE` in search was to make sure libssh shows up when searching for ssh. This will regress that for any `lib*` crates with less than 2 letter names. There aren't very many of these: - lib - libc - libcw - libdw - libgo - libjp - libm - libnv - libpm - libr - libs - libsm - libxm I'm less concerned about the single letter cases, as those are already going to be buried on page 87, but a few of the 2 letter cases you might legitimately search for. None of these crates have high traffic, and fixing this generally isn't really possible without introducing some special case indexes *only* for this case. We could also work around this by always searching for "lib*" in addition to whatever you searched for. This also means that searching for `a` will no longer include the crate `a1`. I'm not as concerned about this, if you want all crates starting with the letter a, we already have `/crates?letter=a` for that. With this change, our performance should be back to reasonable levels for all search terms. Before -- ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1) -> WindowAgg (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1) -> Sort (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1) Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name Sort Method: quicksort Memory: 9996kB -> Hash Right Join (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1) Hash Cond: (recent_crate_downloads.crate_id = crates.id) -> Seq Scan on recent_crate_downloads (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1) -> Hash (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 6985kB -> Seq Scan on crates (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1) Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text)) Rows Removed by Filter: 13359 Planning Time: 0.555 ms Execution Time: 165.998 ms ``` After -- ``` QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1) -> WindowAgg (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1) -> Sort (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1) Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name Sort Method: quicksort Memory: 30kB -> Nested Loop Left Join (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1) -> Bitmap Heap Scan on crates (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1) Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)) Heap Blocks: exact=9 -> BitmapOr (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1) -> Bitmap Index Scan on index_crates_name_search (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1) Index Cond: (''::tsquery @@ textsearchable_index_col) -> Bitmap Index Scan on index_crates_name_tgrm (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1) Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text) -> Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9) Index Cond: (crate_id = crates.id) Planning Time: 0.553 ms Execution Time: 0.386 ms ```
This makes the error returned what we're really trying to test for
This will fix our CI on the beta branch, which is currently failing due to a known breaking inference change. See rust-lang/rust#60958 for more info. I have reviewed the [upstream changes] relative to our current version and see no issues with the changes there (although we could alternatively backport just the needed fix). [upstream changes]: lettre/lettre@v0.9.0...0ead3cd
Update to latest master for lettre and lettre_email This will fix our CI on the beta branch, which is currently failing due to a known breaking inference change. See rust-lang/rust#60958 for more info. I have reviewed the [upstream changes] relative to our current version and see no issues with the changes there (although we could alternatively backport just the needed fix). [upstream changes]: lettre/lettre@v0.9.0...0ead3cd
I have several local branches updating dependencies and addressing any breaking changes (towards closing out rust-lang#1265). I'd like to avoid lockfile merge conflicts between those branches so I'm bumping all semver compatible versions in this PR. I'd like to let this bake in production for at least a few days before opening PRs bumping major versions in `Cargo.toml`. If this approach is too aggressive, or needs to be reverted, I can open the other PRs more serially and look into breaking this upgrade into smaller pieces.
Fix yank test This test wasn't set up correctly, so the error message that was being checked wasn't quite right. Also, [based on this comment](rust-lang#1697 (comment)) it seems like the names of these tests weren't as descriptive as they could be-- the previous test covers the success case of an owner being able to yank, and the next text checks that nonowners aren't allowed to yank.
Force clear the cache on CI so that cache uploads do not time out.
Rendering the README is a process that can take an arbitrarily long amount of time, and ends with uploading a file to S3 -- which also can take an arbitrarily long period of time. Since there are no README contents which we consider invalid (which I've indicated by changing the return type of the rendering functions to `String` instead of `Result`), there's no reason for us to do any of this work on the web server. Just like the index updates, we can offload this to our worker to be completed at some point in the future. This does mean that a crate will render in our UI before the README has finished rendering. During this brief period, we will behave the same as if the crate had no README file, showing the description instead. Eventually I think it'd make sense to have this job behave like the `render-readmes` worker, where it pulls the tar file down and reads the actual readme file. This would mean that Cargo can stop sending us the README as its own field when its already sending us a tar file which contains the README. I've left it alone for now though, since there's no benefit while Cargo is already sending us the README, and we can't have it stop doing that until we move to publish v2. A few implementation notes: - The `readme` column on `crates` was never used for anything in Rust, but is used on the DB side to generate the search index. I've left the column in place, but removed most uses of it on the Rust side. - The `readme_file` column on `crates` never used for anything. This commit does not include the migration, which must be deployed separately. This commit only includes code which works with or without the migration for deployment. - `background_jobs::Environment` gained two fields that are just straight up copied from `App`. I've opted to do this rather than just giving it `App`, since I do want to separate "the stuff you need to do background jobs" from "the stuff you need to be the web server". Since we do eventually plan on not having s3 uploads happen on the server, I want to bundle all of this into a single field and just have it live on `background_jobs::Environment` eventually. - I did some general cleanup of `krate::publish` while I was in there, removed a lot of pre-1.0 style code. - I've left the `render-readmes` binary alone for now since it's untested. I'll change it to just queue up jobs once we get to the point where jobs are doing all the tar file work as well.
⚠️ This commit contains a destructive migration.⚠️ ⚠️ It must be deployed separately from the previous commit.⚠️ This column was never used. It's no longer referenced in Rust code, so we can just remove it.
…geibel Move README rendering off the web server :warning: This PR contains destructive migrations. The commits included must be deployed separately. :warning: Rendering the README is a process that can take an arbitrarily long amount of time, and ends with uploading a file to S3 -- which also can take an arbitrarily long period of time. Since there are no README contents which we consider invalid (which I've indicated by changing the return type of the rendering functions to `String` instead of `Result`), there's no reason for us to do any of this work on the web server. Just like the index updates, we can offload this to our worker to be completed at some point in the future. This does mean that a crate will render in our UI before the README has finished rendering. During this brief period, we will behave the same as if the crate had no README file, showing the description instead. Eventually I think it'd make sense to have this job behave like the `render-readmes` worker, where it pulls the tar file down and reads the actual readme file. This would mean that Cargo can stop sending us the README as its own field when its already sending us a tar file which contains the README. I've left it alone for now though, since there's no benefit while Cargo is already sending us the README, and we can't have it stop doing that until we move to publish v2. A few implementation notes: - The `readme` column on `crates` was never used for anything in Rust, but is used on the DB side to generate the search index. I've left the column in place, but removed most uses of it on the Rust side. - The `readme_file` column on `crates` never used for anything. - `background_jobs::Environment` gained two fields that are just straight up copied from `App`. I've opted to do this rather than just giving it `App`, since I do want to separate "the stuff you need to do background jobs" from "the stuff you need to be the web server". Since we do eventually plan on not having s3 uploads happen on the server, I want to bundle all of this into a single field and just have it live on `background_jobs::Environment` eventually. - I did some general cleanup of `krate::publish` while I was in there, removed a lot of pre-1.0 style code. - I've left the `render-readmes` binary alone for now since it's untested. I'll change it to just queue up jobs once we get to the point where jobs are doing all the tar file work as well.
I think the limit we'll probably set to start is 1 req/10s with a burst of 30. The error message will tell folks they can either wait for {time until next token} or email us to get the limit increased for them. This is limited per user instead of per ip since rotating your user is harder than rotating your IP. It's stored in the DB since this is only for publishing new crates, which is slow enough already that the DB load of rate limiting there shouldn't matter. I needed to update to Rust 1.33 to get `Duration::as_millis` (note: the way we're using this feature causes UB if the rate limit is slower than 1 request per 292471208 years. I assume this is not a problem) I needed to update to Diesel 1.4.2 to get a fix for diesel-rs/diesel#2017 The algorithm used is pretty much the standard token bucket algorithm. It's *slightly* different in how we set `tokens = max(0, tokens - 1) + tokens_to_add` instead of `tokens = max(0, tokens_to_add + 1)`. This is because the usual implementation checks available tokens before subtracting them (and thus never persists if there aren't enough tokens available). Since we're doing this in a single query, and we can *only* return the final, persisted value, we have to change the calculation slightly to make sure that a user who is out of tokens gets `1` back after the rate limit. A side effect of all of this is that our token count is actually offset by 1. 0 means the user is not only out of tokens, but that we just tried to take a token and couldn't. 1 means an empty bucket, and a full bucket would technically be burst + 1. The alternative would be -1 meaning the user is actually out of tokens, but since we only ever refill the bucket when we're trying to take a token, we never actually persist a full bucket. I figured a range of 0...burst made more sense than -1..burst.
Add more aggressive rate limiting for publishing new crates This is still incomplete, but the bulk of the code has been written so I figured I'd get some eyes on it. Right now this just panics instead of returning an error if the user is out of tokens. Still left to do are: - The two ignored test cases - Implementing the actual error type - Per-user burst rate overrides - cron job to restrict the table size and clean up stale buckets (I probably won't land this in the initial PR, our users table needs to grow by 2 orders of magnitude for this to really matter -- but I do want to land it as a followup PR since I haven't tested this with cases where now - last_update is greater than a month. It should work fine but I'd rather not have this run against poorly defined semantics) I think the limit we'll probably set to start is 1 req/10s with a burst of 30. The error message will tell folks they can either wait for {time until next token} or email us to get the limit increased for them. This is limited per user instead of per ip since rotating your user is harder than rotating your IP. It's stored in the DB since this is only for publishing new crates, which is slow enough already that the DB load of rate limiting there shouldn't matter. I needed to update to Rust 1.33 to get `Duration::as_millis` (note: the way we're using this feature causes UB if the rate limit is slower than 1 request per 292471208 years. I assume this is not a problem) I needed to update to Diesel 1.4.2 to get a fix for diesel-rs/diesel#2017 The algorithm used is pretty much the standard token bucket algorithm. It's *slightly* different in how we set `tokens = max(0, tokens - 1) + tokens_to_add` instead of `tokens = max(0, tokens_to_add + 1)`. This is because the usual implementation checks available tokens before subtracting them (and thus never persists if there aren't enough tokens available). Since we're doing this in a single query, and we can *only* return the final, persisted value, we have to change the calculation slightly to make sure that a user who is out of tokens gets `1` back after the rate limit. A side effect of all of this is that our token count is actually offset by 1. 0 means the user is not only out of tokens, but that we just tried to take a token and couldn't. 1 means an empty bucket, and a full bucket would technically be burst + 1. The alternative would be -1 meaning the user is actually out of tokens, but since we only ever refill the bucket when we're trying to take a token, we never actually persist a full bucket. I figured a range of 0...burst made more sense than -1..burst.
Run `cargo update` I have several local branches updating dependencies and addressing any breaking changes (towards closing out rust-lang#1265). I'd like to avoid lockfile merge conflicts between those branches so I'm bumping all semver compatible versions in this PR. I'd like to let this bake in production for at least a few days before opening PRs bumping major versions in `Cargo.toml`. If this approach is too aggressive, or needs to be reverted, I can open the other PRs more serially and look into breaking this upgrade into smaller pieces.
Fixes rust-lang#1687. Crates are uploaded to S3 under the name they were first published as, but the download endpoint always uses the name as written in the request. If these names differ in their use of dashes vs. underscores, the download endpoint returns an invalid link. This fix changes the database request that updates the download count to also return the original crate name.
…=sgrif Return correct download link for requests for non-standard crate names. Fixes rust-lang#1687. Crates are uploaded to S3 under the name they were first published as, but the download endpoint always uses the name as written in the request. If these names differ in their use of dashes vs. underscores, the download endpoint returns an invalid link. To avoid adding another database request for every single crate download, this fix changes the database request that updates the download count to also return the original crate name. We will need to confirm that the performance impact of this change is negligible.
Clean up README rendering code. When browsing the code, I noticed that the HTML sanitzation code looked a bit messy, so I attempted to clean it up. This patch should not change the functionality in any way, but the new code may be easier to read and understand – not sure whether that's enough for you to spend time reviewing this. I also wonder whether there is an explicit list of allowed tags in the code. The [defaults in the `ammonia` crate](https://docs.rs/ammonia/2.1.1/src/ammonia/lib.rs.html#260-269) look quite reasonable to me (we'd only need to add `input` for check boxes).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I merged in the changes from rust-lang and resolved all the
cargo build
errors, butnpm test
has failures, and it appears that I've somehow broken some of the UI elements. I don't really know anything about front-end work, though, so unfortunately I don't even really understand thenpm
failures.