Replies: 2 comments 1 reply
-
Apologies for a long winded comment but I wanted to go real deep thinking stuff around this - We can always put the repository URL into static .json served from object cache (as .crate is) by itself and other small metadata to optimise traffic. But regarding dynamic request / responses and rate limits - For a self-hosted deployment There may be just "several" or "a few" crates transitive dependencies perhaps (?) that need to be queried - What is the deal here? client has to wait only a few seconds? I don't think API rate limits are a problem here? What kind of deployment are we talking about where the rate limit - if using dynamic queries - starts giving problems around "waiting too long" or similar? If there is a data around this then it would be good - like does someone - and how many - query 3000 dependencies one by one? For a centralised deployment We are talking about pulling the whole ecosystem 500,000 crate versions metadata? This may be hosting "a lot" of crates with "tons" of transitive dependencies requires pulling a lot of data that might benefit from "bulk" download that justifies the tradeoff(s) with those pesky .csv's. I think these two various scenarios - and between them - should be perhaps defined and quantified more - Is there any data around average distribution of the consumption data? Are we talking about - on average -
The difference here matters because in general
Both consumption pattern and volume matters I get it that there is this whole "nice clean RESTful endpoint" or Swagger/OpenAPI generator thing but it's largely optimised for targeted Request / Response pattern where preloading large amounts of data is typically not handled - Which I initially thought this pattern / volume might have indicated towards and where the metadata .csv loading would have been justified for local centralised caching purposes compared to half a million request/responses - or more when you don't cache the responses - it's still lots back and forth processing compared to "do it once" Benchmarks to rescue Specifics would be really helpful and hard data about access volumes so it can be appropriately benchmarked and evaluated based on that data not just naive rough calculations. When "crawling" pops up, then any reasonable person outside involved deep in the consumer application would think it would need a direct copy - or a mirror - what the metadata .csv files already do provide half GB unpacked that are easily consumed as 138MB archive - if you want em all that is. Sure there is significant savings traffic wise if all they want is the 10 crates metadata... 138MB .CSV over X Request / Responses Quantifying the tradeoff between managing between one request/response 138MB .CSV and x amount of request/responses becomes the issue for both crates.io (considering this software is used by y amount of people with varying access volumes) as well as the individual user of the consumer software how they see it - There is a difference for half a million request/responses over the public network compared to say a single one and then local processing of couple of extracted standard .csv files that can be easily cached around instead of querying them every day. Also when I see "rate limit" mentioned then reasonable person does not necessarily know how hard you hit that rate limit and how much big of a problem it really is - it's about quantifying tradeoffs and I'm wanting this problem solved for both centralised as well as self-hosted things that have differing volume pattern whether it is 4000 or 10 crates or 500000 versions metadata - what is our pain point here? 🙂 Also the usage pattern - it runs in background right? and speed may or may not be too much of an issue a.k.a "as long as it say completes under an hour" or what kind of constraints should we set to balance between the needs of both collective crates.io / client performance ? - Is it once a day thing - What is our time limit really and do we really have to care? - It's all relative Like if it needs to poll everything once a day and it has say 100 crates and between them 1000 transitive dependencies it needs metadata for either 1000 crate dependencies (1000 seconds) Or between 5 VMs that don't share the state (??) potentially 5000 meaning 5000 seconds.. The biggest quantified issue in that case around those benchmarked numbers - as you suspected is an issue already - is lack of shared state that can be solved simply with a tiny docker container / cache locally that serves as a static mirror from memory-loaded .csv removing the network overhead and allowing local processing in adddition - suddenly 1000 seconds may not be that much big of an issue compared to pulling 180MB one-file copy, parsing it and then caching-sharing uncompressed 500MB CSV. Yeah if there is only 10 transitive dependencies to begin with.. then why did we need that 500 MB CSV when it could have just hit that API directly with the limit penalty for 10 seconds processing time? Not a big issue using the atomic API now right? And considering it's relative how about 1000 people doing the same thing? And if we have 1000 people doing the same thing to crates.io then it is 5 Million request/responses caused by this piece of software that could have been just 1000 requests to get 138MB archive that unpacks as 500MB for all those 1000 people separately. Would it be worse to have 5 Million Request / Responses - or - transferring 138 GB cloud-cached static file that the clients could potentially cache continuously and poll updates on top of? But again even the above case has tradeoffs as the above demonstrates... it all depends.. for both crates.io as well as consumer-ends. That's why it is extremely important to frame the issue with specific benchmark(s) relevant to supposed specific volumes of data / request-responses. However if crates.io was to offer something up in the future - Dynamic Options - New (Scaling Up) Dynamic API Request / Response would avoid the client consuming more data than the client required and where the client has to serialise separate requests for separate responses if using collections pattern. To scale the database it could be also solved by using read-only replicas that operate completely on RAM (it's only half GB for the metadata) for serving these responses. This would be additional infrastructure going concern to maintain but anyways it's an option with another type of tradeoff. However if you add collections to the mix then caching these is difficult as everyone will be requesting their own metadata collections. Optimal - not necessarily implementation friendly - for the client could be HTTP/2 multiplexing (potentially w gRPC streaming) / QUIC many serialised request/response pairs avoiding but then again infrastructure costs and feature addition has to be considered as tradeoffs for providing dynamic API option with collections or not. Database Option - Existing re-using Mirror functionality Database dump is 138MB compressed (518M uncompressed) which contains all the metadata (in .csv form) and as I understood there are some unknown difficulties of re-using this already provided functionality together with keeping up-to-date index and consuming dynamically the commits on top of that to keep up to date. Edge-case - creates.io current database however keeps only one unique URL per create and there might be edge-case of diffing between versions where the repository URL has changed. Let's say someone provides easily consumable community crate abstracting that data and making it easily and efficiently consumable for the client end - Would this be any help making it more tolerable (other than dependency shock) to use in this type of consuption setting given if this would be able to state-shared from the beginning and then only dynamically request URLs on crates that were less than 24 hours old not present in .csv provided to mirrors? I mean it could have a transitional trait interface that gets changed as implementation detail when/if there is new underlying consumable future API that does more client implementation friendly implementation. e.g. say crates-metaindex SDK type crate that provides an interface for any application requiring the most efficient available metadata consumption like crates-index crate currently aims to abstract creates.io-index. Static Options
Static - Extending the current object cache Just as people currently bypass API redirect There could be .json alongside .crate This would have some of the available static metadata relevant to the crate version and would provide the repository URL that was provided in the version Cargo.toml manifest. It has to be optimised for commonly used data to reduce traffic - balancing act. Also for the sake of completeness /api/v1/crates/crate/version/repository could simply do http 304 redirect to the said repository URL. For consuming that static .json from object cache client would have to serialise multiple requests and download more data than potentially needed but at least it would not be dynamic response with server processing / traffic etc. opportunistic caching involved. Naturally object count would be double for providing the .json but this would be relatively convenient to implement alongside pushing .crate out. Static - Using the current github rust-lang/crates.io-index The current rust-lang/crates.io-index was primarily intended for fast cargo dependency resolution. Adding an URL if given length 30-60 Bytes would add 15-30 MB to it and goes outside it's intended primary purpose. Static - Adding another rusts-lang/crates.io-metadata github index If we keep metadata unique per crate the repository URL can change between crate versions that makes it more difficult for diff purposes and potentially tracking changes with no guarantees everything is captured. Yes sure this is an edge-case but something similar that instead of providing static metadata crate/crate-version.json simply crate.json would have key-unique for the URL at crate and not crate/version. Another alternative is to have a branch per metadata field e.g. repository-url and then index based on changes like advisoryDb does for version matching at https://github.com/rustsec/advisory-db in value part. Say branch repository-url at path fo/ob/foobar could be {"latest": "0.2.0", "current_val": "latest_version_url"}
{"versions_min": "0.2.0", "val": "version_url"}
{"versions_min": "0.1.0", "versions_max": "0.1.9", "val": "version_url"} The above would ensure
However client would have to understand it thus there is the first "latest" line that could be used by naive clients assuming unique single repository url per crate instead of per crate/version. But then realistically nobody would be using this directly but instead via some abstracted interface like crates-index provides. Abstract Common Interface consumer crate I think -- for now whilst also thinking for the future for the consumer -- the most efficient, easy and quick way would be to implement abstracted community crate that has hidden implementation detail that evolves based on what could be provided via crates.io and what the consumption pattern / volume is given the current as well as the future features / limitations of crates.io. I've started to detail the consumer crate here #4505 It would also provide the motivation as well as the continuous visibility to improve the benchmark performance under specific various quantified access pattern / volume scenarios - |
Beta Was this translation helpful? Give feedback.
-
FYI: There's now a Pull Request open for this: #4548 |
Beta Was this translation helpful? Give feedback.
-
Hi,
this discussion got started (again) from an issue over in Renovate.
For those who don't know: Renovate is a bot that can automatically update your dependencies by creating Pull Requests, similar to Github's dependabot.
EDIT: There's now a PR for this #4548, the PR has now been merged
These types of bots support showing changelogs/release notes and they mostly get those by crawling the source repositories for CHANGELOG files or Github releases. To avoid confusion: This is not about the source code package that can be downloaded from crates.io but an actual link to the source repository.
To support this they need to translate a crate name into a URL.
Currently this source URL is not exposed in the package files in the index, it is however available via the metadata API (e.g. https://crates.io/api/v1/crates/serde_json). This API however is subject to rate limiting.
Renovate has - for implementation reasons - a hard time in implementing a global rate limit because they are using multiple VMs for crawling which don't share any state.
@Turbo87 and I had a quick look at how that metadata response is constructed and it looks like it makes eight database requests to construct that page which is not optimal.
For the use-case mentioned above almost all of that stuff is not needed though. We've contemplated existing options which you can see discussed in these two comments in particular and the whole thread in general:
Add sourceUrl/homepage lookup from crates API for cargo datasource renovatebot/renovate#3486 (comment)
Add sourceUrl/homepage lookup from crates API for cargo datasource renovatebot/renovate#3486 (comment):
I personally believe none of the existing options are a good solution for the task at hand
Use the existing API -> due to the rate limit (see below)
Use the DB dump -> this is not easy to integrate/use, definitely not as easy as calling an API endpoint, it requires having some DB, keeping it up to date, keep shared state etc.
Together we came up with these ideas on how changes in crates.io could
A robust registry API with CDN backing and no rate limiting, e.g. like npm
Change the package file JSON files in the index to include the necessary information
Add a separate endpoint to the API that has the bare minimum information needed (which should require only a single DB request) which could then be exempt from the rate limit or have a different limit applied
Improve the existing endpoint to allow selecting just a subset of fields needed (this could be done in a backwards compatible way I believe), but this would make the rate limiting harder
Change the rate limit policy to exclude a) this endpoint or b) these specific tools
I'm sure there are more ideas, this is not a finished list and we're open for ideas.
There are obviously also options to fix this on the user end (e.g. caching) but I believe that this is a worthwhile thing to improve on the crates.io side because chances are that other similar tools will have the same requirements.
Any opinions on whether this is worthwhile to tackle and any preferences on which option you'd agree to?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions