Make json file format more robust to versioning changes #130

staktrace · 2018-02-24T19:00:20Z

Right now I'm having a problem where trying to do a read_analysis_from_files is silently dropping data. After some debugging it looks like the JSON is failing to decode in some of the files, even though the JSON itself is valid. Considering that I generated the analysis data using a rust stable build and am trying to read it in a different environment using rust nightly with a possibly different version of rls-data it seems like that might be the problem.

As far as I can tell the json files are just the JSON representation of whatever the Analysis struct looks like in the code that generated the JSON. So it's not guaranteed to be portable at all.

I would like to make it so that at the very least the version of rls-data is baked into the JSON so that a user who is trying to read the analysis can quickly figure out which version of rls-data to use during decode. Or even better, actually make the schema well-defined and forwards-compatible so that future modifications to the Analysis struct will still generate JSON that can be (lossily, but with graceful degradation) parsed by decoders using an older Analysis.

The text was updated successfully, but these errors were encountered:

nrc · 2018-02-24T23:36:09Z

rls-data is intended to be an implementation detail and I think it would really restrict what we can do with it to guarantee forward compatibility. However, adding the version to the structs seems a sensible thing to do.

staktrace · 2018-02-25T05:35:53Z

I'd like to better understand the overall approach to rust tooling that you guys are taking. Here's how I see things, please correct me if I'm wrong and/or add context that I'm missing.

Right now the rust tooling story seems to be primarily focused on the RLS as that basically gets rust support for free in a variety of editors/IDEs that support the LSP. And in a nutshell, the RLS flow looks like this:

(1) source code --> (2) compiled by rustc --> (3) generates save-analysis data --> (4) which is consumed by RLS --> (5) used by editor of choice

The protocol between (1) and (2) is a "standardized" protocol (the rust language), and the protocol between (4) and (5) is another "standardized" protocol (the LSP). The transitions from (2) -> (3) and (3) -> (4) are basically whatever rls-data does, and that's why you treat it as an implementation detail.

This seems great for the RLS, but the RLS only enables a certain class of tooling, and there are other classes of tooling that matter too. Let's use searchfox as an example, because that's what I care most about at the moment and it's fairly concrete in my head. Something like searchfox can fit into the above picture in one of a few ways:

(1) source code --> (2) parsed by searchfox

This gives us some basic syntactic analysis but it would be hard to do anything more in-depth (like linking to definitions in faraway crates) without replicating much of rustc. A better solution is:

(1) source code --> (2) parsed by a searchfox wrapper around rustc with hooks into the compiler

This gives us better analysis, probably equivalent to what a clang plugin would provide for C++. This seems reasonable enough. This option:

(1) source code --> (2) compiled by rustc --> (3) generates save-analysis data --> (4) consumed by searchfox

is what we're doing currently, but it relies on "implementation detail" because it consumes a non-standardized protocol. It would even be theoretically possible to do something like this:

(1) source code --> (2) compiled by rustc --> (3) generates save-analysis data --> (4) which is consumed by RLS --> (5) scraped by searchfox

In this case searchfox would build its static index by querying the RLS, effectively scraping all the relevant info out of the RLS and preserving it in a snapshot of sorts. The snapshot (i.e. the searchfox webpages) would then be available without the RLS.

This last approach seems overly convoluted to me, and the first one (searchfox parsing source code) would also be overly convoluted because it requires reimplementing chunks of rustc. So that leaves the middle two options. Let's call those option A (with the wrapper around rustc) and option B (which is what searchfox is doing now, relying on implementation detail). Given that you don't like B, and given the rls-rustc shim you wrote, I presume your preference is that searchfox use option A.

Is this a fair summary of the situation?

The problem I have is that it's much easier to do option B than it is to do option A, even if I were starting from scratch. The shim you wrote enables the built-in save-analysis feature of the compiler, which is in the non-standard rls-data format, and therefore doesn't really count as a true option A. In other words, using the shim still produces the same JSON files as option B, and I still have to parse them. So in order to truly do option A, I have to go one step further than the shim and write my own hooks into the compiler, and do all the same work that the save-analysis code already does, and then produce an output file that's basically equivalent (in terms of information content) to the JSON file, but is in a format that I can reliably parse.

I don't know how hard it is to write compiler hooks, but even if it's not very hard, it seems awfully inefficient to do all this work when we could just modify the existing JSON format slightly in order to suit our needs. The most straightforward way to do this, I think, would be a semver-style versioning of the file format. Which I think we basically get for free if the rls-data crate's version number is written into the file, and bumped following semver: fixing bugs in the produced file would be accompanied by a patch level version number bump; adding new fields to the structures that can be ignored by older consumers can be done with a minor version number bump, and any drastic changes can be done with a major version bump. We'd also want to actually detect incompatible version changes when deserializing and report that as an error.

This would still allow you to make whatever changes you need to the file format, as long as you bump the version number accordingly. It wouldn't be guaranteed forward compatibility, but it would go partway there. A major version bump would require older consumers to do some work to parse the new format, but it wouldn't be a huge burden, and presumably that kind of version bump wouldn't happen too often.

The other thing that goes with this is, of course, stabilizing -Zsave-analysis, which is something I still want. :)

Sorry this turned out so long but it was helpful to me to think through this as I wrote it down, and I would appreciate your comments/feedback as well as any additional context on this.

nrc · 2018-02-25T08:22:44Z

(4) which is consumed by RLS

The key thing is that the RLS is not a monolithic thing. Rather than consider the compiler as one component and the RLS as another with a data API between the two, consider the compiler, the save-analysis data, and rls-analysis as one component, and the rest of the RLS as a separate component. Other tools (including searchfox today, and the new rustdoc - doxidize) just rls-analysis as an API on to the compiler's knowledge of a program.

So, you should use rls-analysis and if we're missing useful API there we can probably add it.

The other thing that goes with this is, of course, stabilizing -Zsave-analysis, which is something I still want. :)

This is unlikely to happen any time soon. Although I'm pretty keen to see some option stabilised for it, it's not needed for any core tools and the compiler team are very anti-stabilisation until we know what the long-term situation looks like.

@staktrace what tool are you trying to implement?

staktrace · 2018-02-25T15:04:29Z

I'm trying to make changes to searchfox so that it can scale up better to what is effectively indexing multiple codebases. Right now in order to index the rust code in m-c, we do a full m-c build as part of the indexing run on the searchfox indexer, and use rls-analysis to read the save-analysis files that were generated. Using this approach, if we want to also index mozilla-beta, mozilla-release, ESR branches, etc. we would need to build those completely as well, which would make the indexing process many hours long and generally doesn't scale.

So instead the plan is to have a taskcluster cron job in gecko's existing automation (e.g. this one and that will run on whatever branches we want to index. These all run in parallel, and then when the searchfox indexer runs, it will just download the save-analysis files from those jobs and use them.

The key thing here is that each of the codebases being indexed may be using a different version of rustc, and so the save-analysis files have different versions, and right now there's no way for the searchfox indexer to know which save-analysis files have which version.

So, you should use rls-analysis and if we're missing useful API there we can probably add it.

Just to be clear - we are using rls-analysis to parse these files. The code is here. The useful API that we're missing is that we want to be able to hand it save-analysis files of an unknown version, and have it parse them all. If the save-analysis file has a newer version than is supported, then it should report an error, and we can update that code to use a newer rls-analysis, which would parse the new version and all the old versions. In other words, the API should be robust and backwards-compatible to older save-analysis files. Right now if the API is given a file that isn't the exact right version it wants, it just silently drops it, which is less than ideal.

nrc · 2018-02-25T19:42:50Z

Thanks for the explanation - it's much clearer now what is going on and what you want. I had not planned for rls-analysis to be able to handle multiple versions of the save-analysis data, but it should in general be possible. The good news is that even though the data is not stable, it does not change much in practice. The bad news is that even quite minor changes can screw up the deserialisation. Working out which version of the data is compatible with which version of rls-analysis sounds like it won't be fun.

Anyway, adding version numbers to the save-analysis data sounds like an easy and positive first step.

Possibly interesting to you: the largest cost in indexing is usually serialisation and deserialisation of the data. In particular running with -Zsave-analysis slows down compilation fairly significantly just due to the serialisation cost. You can somewhat address this by configuring save-analysis to include less data, e.g., only including partial docs rather than full docs helps for well-documented crates (like std). Only including public and accessible items (i.e., skipping local variables and private items) helps a lot so if there are crates which are only used via an API, turning that on is a big win. However, there is not a good way to configure the save-analysis per-crate using Cargo right now. You might be able to do it though, depending on how you're building.

The rustc serialisation (and deserialisation) is much slower than Serde. If we could replace the former with the latter, we would save a lot of time. However, that is more involved than it sounds (and I have forgotten the specifics), but I think it could be do-able if the time is important.

staktrace · 2018-07-24T12:18:21Z

@nrc Currently, without any version numbers baked into the generated JSON, is there any way to determine what version of rls-analysis I should be using with a given rustc version?

nrc · 2018-07-24T21:06:09Z

Currently, without any version numbers baked into the generated JSON, is there any way to determine what version of rls-analysis I should be using with a given rustc version?

Not in general. For any branch in the Rust repo (or I guess you can look for the tagged commits), you can see what version of rls-analysis is used in the Cargo.lock file and that will be compatible.

staktrace · 2018-07-26T15:52:47Z

Ok, thanks.

Just to provide an update on this issue: rust-dev-tools/rls-data#16 added the version field to the data structure, and I have #147 open to make rls-analysis use the new rls-data and do a better job of detecting incompatible versions. Once that is merged and published, I'll make a PR against rustc to use the new rls-analysis and rls-data, and so nightly compilers from that point on will have the version field. Older save-analysis files will not have the version field, but will still be loadable by the new rls-analysis, because the version field is an Option and will just get loaded as None.

Beyond that, for the purposes of searchfox, I intend to do the rls-analysis step as part of the taskcluster job that does the code build, so that we can use an rls-analysis version that's compatible with the rustc compiler that's generating the save-analysis files. That step will produce files in a well-defined format that searchfox can use directly, so we won't need to run the rls-analysis step as part of the searchfox indexing run. That work isn't blocked on this versioning stuff but it will be easier to catch versioning mismatches once these PRs are in. I don't think there's anything else to be done here so I might as well close this issue.

staktrace mentioned this issue Jul 24, 2018

Add a version field in the rls-data Analysis structure rust-dev-tools/rls-data#16

Merged

staktrace closed this as completed Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make json file format more robust to versioning changes #130

Make json file format more robust to versioning changes #130

staktrace commented Feb 24, 2018

nrc commented Feb 24, 2018

staktrace commented Feb 25, 2018

nrc commented Feb 25, 2018

staktrace commented Feb 25, 2018

nrc commented Feb 25, 2018

staktrace commented Jul 24, 2018

nrc commented Jul 24, 2018

staktrace commented Jul 26, 2018

Make json file format more robust to versioning changes #130

Make json file format more robust to versioning changes #130

Comments

staktrace commented Feb 24, 2018

nrc commented Feb 24, 2018

staktrace commented Feb 25, 2018

nrc commented Feb 25, 2018

staktrace commented Feb 25, 2018

nrc commented Feb 25, 2018

staktrace commented Jul 24, 2018

nrc commented Jul 24, 2018

staktrace commented Jul 26, 2018