Skip to content

Rewrite/update compiler source code chapter #765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 3, 2020
Merged
342 changes: 207 additions & 135 deletions src/compiler-src.md
Original file line number Diff line number Diff line change
@@ -1,137 +1,209 @@
# High-level overview of the compiler source

## Crate structure

The main Rust repository consists of a `src` directory, under which
there live many crates. These crates contain the sources for the
standard library and the compiler. This document, of course, focuses
on the latter.

Rustc consists of a number of crates, including `rustc_ast`,
`rustc`, `rustc_target`, `rustc_codegen`, `rustc_driver`, and
many more. The source for each crate can be found in a directory
like `src/libXXX`, where `XXX` is the crate name.

(N.B. The names and divisions of these crates are not set in
stone and may change over time. For the time being, we tend towards a
finer-grained division to help with compilation time, though as incremental
compilation improves, that may change.)

The dependency structure of these crates is roughly a diamond:

```text
rustc_driver
/ | \
/ | \
/ | \
/ v \
rustc_codegen rustc_borrowck ... rustc_metadata
\ | /
\ | /
\ | /
\ v /
rustc_middle
|
v
rustc_ast
/ \
/ \
rustc_span rustc_builtin_macros
```

The `rustc_driver` crate, at the top of this lattice, is effectively
the "main" function for the rust compiler. It doesn't have much "real
code", but instead ties together all of the code defined in the other
crates and defines the overall flow of execution. (As we transition
more and more to the [query model], however, the
"flow" of compilation is becoming less centrally defined.)

At the other extreme, the `rustc_middle` crate defines the common and
pervasive data structures that all the rest of the compiler uses
(e.g. how to represent types, traits, and the program itself). It
also contains some amount of the compiler itself, although that is
relatively limited.

Finally, all the crates in the bulge in the middle define the bulk of
the compiler – they all depend on `rustc_middle`, so that they can make use
of the various types defined there, and they export public routines
that `rustc_driver` will invoke as needed (more and more, what these
crates export are "query definitions", but those are covered later
on).

Below `rustc_middle` lie various crates that make up the parser and error
reporting mechanism. They are also an internal part
of the compiler and not intended to be stable (though they do wind up
getting used by some crates in the wild; a practice we hope to
gradually phase out).

## The main stages of compilation

The Rust compiler is in a bit of transition right now. It used to be a
purely "pass-based" compiler, where we ran a number of passes over the
entire program, and each did a particular check of transformation. We
are gradually replacing this pass-based code with an alternative setup
based on on-demand **queries**. In the query-model, we work backwards,
executing a *query* that expresses our ultimate goal (e.g. "compile
this crate"). This query in turn may make other queries (e.g. "get me
a list of all modules in the crate"). Those queries make other queries
that ultimately bottom out in the base operations, like parsing the
input, running the type-checker, and so forth. This on-demand model
permits us to do exciting things like only do the minimal amount of
work needed to type-check a single function. It also helps with
incremental compilation. (For details on defining queries, check out
the [query model].)

Regardless of the general setup, the basic operations that the
compiler must perform are the same. The only thing that changes is
whether these operations are invoked front-to-back, or on demand. In
order to compile a Rust crate, these are the general steps that we
take:

1. **Parsing input**
- this processes the `.rs` files and produces the AST
("abstract syntax tree")
- the AST is defined in `src/librustc_ast/ast.rs`. It is intended to match the lexical
syntax of the Rust language quite closely.
2. **Name resolution, macro expansion, and configuration**
- once parsing is complete, we process the AST recursively, resolving
paths and expanding macros. This same process also processes `#[cfg]`
nodes, and hence may strip things out of the AST as well.
3. **Lowering to HIR**
- Once name resolution completes, we convert the AST into the HIR,
or "[high-level intermediate representation]". The HIR is defined in
`src/librustc_middle/hir/`; that module also includes the [lowering] code.
- The HIR is a lightly desugared variant of the AST. It is more processed
than the AST and more suitable for the analyses that follow.
It is **not** required to match the syntax of the Rust language.
- As a simple example, in the **AST**, we preserve the parentheses
that the user wrote, so `((1 + 2) + 3)` and `1 + 2 + 3` parse
into distinct trees, even though they are equivalent. In the
HIR, however, parentheses nodes are removed, and those two
expressions are represented in the same way.
3. **Type-checking and subsequent analyses**
- An important step in processing the HIR is to perform type
checking. This process assigns types to every HIR expression,
for example, and also is responsible for resolving some
"type-dependent" paths, such as field accesses (`x.f` – we
can't know what field `f` is being accessed until we know the
type of `x`) and associated type references (`T::Item` – we
can't know what type `Item` is until we know what `T` is).
- Type checking creates "side-tables" (`TypeckTables`) that include
the types of expressions, the way to resolve methods, and so forth.
- After type-checking, we can do other analyses, such as privacy checking.
4. **Lowering to MIR and post-processing**
- Once type-checking is done, we can lower the HIR into MIR ("middle IR"),
which is a **very** desugared version of Rust, well suited to borrowck
but also to certain high-level optimizations.
5. **Translation to LLVM and LLVM optimizations**
- From MIR, we can produce LLVM IR.
- LLVM then runs its various optimizations, which produces a number of
`.o` files (one for each "codegen unit").
6. **Linking**
- Finally, those `.o` files are linked together.


[query model]: query.html
[high-level intermediate representation]: hir.html
[lowering]: lowering.html
> **NOTE**: The structure of the repository is going through a lot of
> transitions. In particular, we want to get to a point eventually where the
> top-level directory has separate directories for the compiler, build-system,
> std libs, etc, rather than one huge `src/` directory.
>
> As of this writing, the std libs have been moved to `library/` and there is
> an ongoing MCP to move the compiler to `compiler/`.

Now that we have [seen what the compiler does](./overview.md), let's take a
look at the structure of the contents of the rust-lang/rust repo.

## Workspace structure

The `rust-lang/rust` repository consists of a single large cargo workspace
containing the compiler, the standard libraries (`core`, `alloc`, `std`,
`proc_macro`, etc), and `rustdoc`, along with the build system and bunch of
tools and submodules for building a full Rust distribution.

As of this writing, this structure is gradually undergoing some transformation
to make it a bit less monolithic and more approachable, especially to
newcommers.

The repository consists of a `src` directory, under which there live many
crates, which are the source for the compiler, build system, tools, etc. This
directory is currently being broken up to be less monolithic. There is also a
`library/` directory, where the standard libraries (`core`, `alloc`, `std`,
`proc_macro`, etc) live.

## Standard library

The standard library crates are all in `library/`. They have intuitive names
like `std`, `core`, `alloc`, etc. There is also `proc_macro`, `test`, and
other runtime libraries.

This code is fairly similar to most other Rust crates except that it must be
built in a special way because it can use unstable features.

## Compiler

> You may find it helpful to read [The Overview Chapter](./overview.md) first,
> which gives an overview of how the compiler works. The crates mentioned in
> this section implement the compiler.
>
> NOTE: As of this writing, the crates all live in `src/`, but there is an MCP
> to move them to a new `compiler/` directory.

The compiler crates all have names starting with `librustc_*`. These are a
collection of around 50 interdependent crates ranging in size from tiny to
huge. There is also the `rustc` crate which is the actual binary (i.e. the
`main` function); it doesn't actually do anything besides calling the
`rustc_driver` crate, which drives the various parts of compilation in other
crates.

The dependency structure of these crates is complex, but roughly it is
something like this:

- `rustc` (the binary) calls [`rustc_driver::main`][main].
- [`rustc_driver`] depends on a lot of other crates, but the main one is
[`rustc_interface`].
- [`rustc_interface`] depends on most of the other compiler crates. It
is a fairly generic interface for driving the whole compilation.
- Most of the other `rustc_*` crates depend on [`rustc_middle`],
which defines a lot of central data structures in the compiler.
- [`rustc_middle`] and most of the other crates depend on a
handful of crates representing the early parts of the
compiler (e.g. the parser), fundamental data structures (e.g.
[`Span`]), or error reporting: [`rustc_data_structures`],
[`rustc_span`], [`rustc_errors`], etc.

[main]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/fn.main.html
[`rustc_driver`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html
[`rustc_interface`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_interface/index.html
[`rustc_middle`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/index.html
[`rustc_data_structures`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_data_structures/index.html
[`rustc_span`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/index.html
[`Span`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/struct.Span.html
[`rustc_errors`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_errors/index.html

You can see the exact dependencies by reading the `Cargo.toml` for the various
crates, just like a normal Rust crate.

One final thing: [`src/llvm-project`] is a submodule for our fork of LLVM
During bootstrapping, LLVM is built and the [`src/librustc_llvm`] and
[`src/rustllvm`] crates contain rust wrappers around LLVM (which is written in
C++), so that the compiler can interface with it.

Most of this book is about the compiler, so we won't have any further
explanation of these crates here.

[`src/llvm-project`]: https://github.com/rust-lang/rust/tree/master/src
[`src/librustc_llvm`]: https://github.com/rust-lang/rust/tree/master/src
[`src/rustllvm`]: https://github.com/rust-lang/rust/tree/master/src

### Big picture

The dependency structure is influenced strongly by two main factors:

1. Organization. The compiler is a _huge_ codebase; it would be an impossibly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity I ran tokei on it - 1.7 million lines of code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you exclude LLVM? That sounds a bit higher than I remember (~500K), but I might be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was only counting lines of rust.

-------------------------------------------------------------------------------
 Language            Files        Lines         Code     Comments       Blanks
-------------------------------------------------------------------------------
 Alex                    4           85           67            0           18
 Assembly                3         1375          756          402          217
 GNU Style Assembly   6740      1516071       721657       473426       320988
 Autoconf              354        38316        22408        10153         5755
 Automake               12         6065         5149          162          754
 BASH                   10         1791         1387          226          178
 Batch                  18         1598         1342           50          206
 C                    6315       934950       529392       304222       101336
 C Header             7670      1448926       897283       335737       215906
 CMake                1196        59281        47496         5105         6680
 C#                      8          766          571          106           89
 C++                 20444      5295316      3766548       869578       659190
 C++ Header             29        11544         9675          809         1060
 CSS                    43         7378         6041          295         1042
 D                       1           18           16            0            2
 Dockerfile             89         2963         2174          308          481
 .NET Resource           2          269          132          113           24
 Emacs Lisp              9         1296          910          241          145
 FORTRAN Modern          8          398            0          371           27
 GDB Script              1           41           13           15           13
 Go                   1912       482175       362283        71149        48743
 Handlebars              4          138          116            7           15
 HEX                     1           15           15            0            0
 HTML                  186       462384       444556         4440        13388
 INI                     3           30           25            0            5
 JavaScript             55         8000         6409          886          705
 JSON                  144        10028        10028            0            0
 LLVM                23437      3668320      1477481      1865951       324888
 LD Script               1            4            4            0            0
 Lua                     1            7            5            0            2
 Makefile              804         6620         4481          697         1442
 Markdown             1936       171532       171532            0            0
 Module-Definition     118        25860        23250           72         2538
 MSBuild                 8          597          485           88           24
 OCaml                  89        11140         5997         3210         1933
 Objective-C          1721       111733        62368        30723        18642
 Objective-C++         496        37842        24784         6426         6632
 Pascal                 10          316           54          228           34
 Perl                   22         9025         5851         1612         1562
 Protocol Buffers        2          173          118           31           24
 Python               1673       237900       172805        24597        40498
 R                       1           15            7            5            3
 RPM Specfile            1          469          390            0           79
 ReStructuredText     1220       219333       219333            0            0
 Ruby                    1          125           89           23           13
 Rust                19032      1672465      1202002       276415       194048
 Scala                   6           54           54            0            0
 Shell                 246        65387        48964         9122         7301
 SVG                    53         5019         4661          358            0
 Swift                   2           23           17            0            6
 SWIG                   69        10949         7683          512         2754
 TeX                     6         3374         2992           22          360
 Plain Text           1506       517261       517261            0            0
 TOML                  797         8688         7335          179         1174
 TypeScript             25         3784         2906          345          533
 Vim Script             12          593          484           41           68
 Visual Studio Sol|      2           47           47            0            0
 XSL                     3          278          226           14           38
 XML                    51       135907       134752           18         1137
 YAML                  747        73902        59039        12268         2595
-------------------------------------------------------------------------------
 Total               99359     17289959     10993906      4310758      1985295
-------------------------------------------------------------------------------

large crate. In part, the dependency structure reflects the code structure
of the compiler.
2. Compile time. By breaking the compiler into multiple crates, we can take
better advantage of incremental/parallel compilation using cargo. In
particular, we try to have as few dependencies between crates as possible so
that we don't have to rebuild as many crates if you change one.

At the very bottom of the dependency tree are a handful of crates that are used
by the whole compiler (e.g. [`rustc_span`]). The very early parts of the
compilation process (e.g. parsing and the AST) depend on only these.

Pretty soon after the AST is constructed, the compiler's [query system][query]
gets set up. The query system is set up in a clever way using function
pointers. This allows us to break dependencies between crates, allowing more
parallel compilation.

However, since the query system is defined in [`rustc_middle`], nearly all
subsequent parts of the compiler depend on this crate. It is a really large
crate, leading to long compile times. Some efforts have been made to move stuff
out of it with limited success. Another unfortunate side effect is that sometimes
related functionality gets scattered across different crates. For example,
linting functionality is scattered across earlier parts of the crate,
[`rustc_lint`], [`rustc_middle`], and other places.

[`rustc_lint`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lint/index.html

More generally, in an ideal world, it seems like there would be fewer, more
cohesive crates, with incremental and parallel compilation making sure compile
times stay reasonable. However, our incremental and parallel compilation haven't
gotten good enough for that yet, so breaking things into separate crates has
been our solution so far.

At the top of the dependency tree are the [`rustc_interface`] and
[`rustc_driver`] crates. [`rustc_interface`] is an unstable wrapper around the
query system that helps to drive the various stages of compilation. Other
consumers of the compiler may use this interface in different ways (e.g.
rustdoc or maybe eventually rust-analyzer). The [`rustc_driver`] crate first
parses command line arguments and then uses [`rustc_interface`] to drive the
compilation to completion.

[query]: ./query.md

[orgch]: ./overview.md

## rustdoc

The bulk of `rustdoc` is in [`librustdoc`]. However, the `rustdoc` binary
itself is [`src/tools/rustdoc`], which does nothing except call [`rustdoc::main`].

There is also javascript and CSS for the rustdocs in [`src/tools/rustdoc-js`]
and [`src/tools/rustdoc-themes`].

You can read more about rustdoc in [this chapter][rustdocch].

[`librustdoc`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustdoc/index.html
[`rustdoc::main`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustdoc/fn.main.html
[`src/tools/rustdoc`]: https://github.com/rust-lang/rust/tree/master/src/tools/rustdoc
[`src/tools/rustdoc-js`]: https://github.com/rust-lang/rust/tree/master/src/tools/rustdoc-js
[`src/tools/rustdoc-themes`]: https://github.com/rust-lang/rust/tree/master/src/tools/rustdoc-themes

[rustdocch]: ./rustdoc.md

## Tests

The test suite for all of the above is in [`src/test/`]. You can read more
about the test suite [in this chapter][testsch].

The test harness itself is in [`src/tools/compiletest`].

[testsch]: ./tests/intro.md

[`src/test/`]: https://github.com/rust-lang/rust/tree/master/src/test
[`src/tools/compiletest`]: https://github.com/rust-lang/rust/tree/master/src/tools/compiletest

## Build System

There are a number of tools in the repository just for building the compiler,
standard library, rustdoc, etc, along with testing, building a full Rust
distribution, etc.

One of the primary tools is [`src/bootstrap`]. You can read more about
bootstrapping [in this chapter][bootstch]. The process may also use other tools
from `src/tools/`, such as [`tidy`] or [`compiletest`].

[`src/bootstrap`]: https://github.com/rust-lang/rust/tree/master/src/bootstrap
[`tidy`]: https://github.com/rust-lang/rust/tree/master/src/tools/tidy
[`compiletest`]: https://github.com/rust-lang/rust/tree/master/src/tools/compiletest

[bootstch]: ./building/bootstrapping.md

## Other

There are a lot of other things in the `rust-lang/rust` repo that are related
to building a full rust distribution. Most of the time you don't need to worry
about them.

These include:
- [`src/ci`]: The CI configuration. This actually quite extensive because we
run a lot of tests on a lot of platforms.
- [`src/doc`]: Various documentation, including submodules for a few books.
- [`src/etc`]: Miscellaneous utilities.
- [`src/tools/rustc-workspace-hack`], and others: Various workarounds to make
cargo work with bootstrapping.
- And more...

[`src/ci`]: https://github.com/rust-lang/rust/tree/master/src/ci
[`src/doc`]: https://github.com/rust-lang/rust/tree/master/src/doc
[`src/etc`]: https://github.com/rust-lang/rust/tree/master/src/etc
[`src/tools/rustc-workspace-hack`]: https://github.com/rust-lang/rust/tree/master/src/tools/rustc-workspace-hack