Skip to content

Propose code string literals #3450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions text/0000-code-literals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
- Feature Name: code_literals
- Start Date: 2023-06-18
- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary
[summary]: #summary

Add a new kind of multi-line string literal for embedding code which
plays nicely with `rustfmt`.

# Motivation
[motivation]: #motivation

- Embedding code as a literal string within a Rust program is often
necessary. A prominent example is the `sqlx` crate, which
has the user write SQL queries as string literals within the program.
- Rust already supports several kinds of multi-line string literal,
but none of them are well suited for embedding code.

1. Normal string literals, eg. `"a string literal"`. These can be
written over multiple lines, but require special characters
to be escaped. Whitespace is significant within the literal,
which means that `rustfmt` cannot fix the indentation of the
code block. For example, beginning with this code:

```rust
if some_condition {
do_something_with(
"
a nicely
indented code
string
"
);
}
```

If the indentation is changed, such as by removing the
conditional, then `rustfmt` must re-format the code like so:

```rust
do_something_with(
"
a nicely
indented code
string
"
);
```

To do otherwise would be to change thange the value of

This comment was marked as resolved.

the string literal.

2. Normal string literals with backslash escaping, eg.
```rust
"
this way\
whitespace at\
the beginning\
of lines can\
be ignored\
"
```

This approach still suffers from the need to escape special
characters. The backslashes at the end of every line are
tedious to write, and are problematic if whitespace is
meaningful within the code. For example, if python code
was being embedded, then the indentation would be lost.
Finally, although `rustfmt` could in principle reformat
these strings, in practice doing so in a reasonable way
is complicated and so this has never been enabled by default.

3. Raw string literals, eg. `r#"I can use "s!"#`

This solves the problem of special characters, but suffers
from the same inability to be reformatted, and the trick
of using an `\` at the end of each line cannot be applied
because escape characters are not recognised.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

In addition to string literals and raw string literals, a third type
of string literal exists: code string literals.

```rust
let code = ```
This is a code string literal

I can use special characters like "" and \ freely.

Indentation is preserved *relative* to the indentation level
of the first line.

It is an error for a line to have "negative" indentation (ie. be
indented less than the indentation of the opening backticks) unless
the line is empty.
```;
```

`rustfmt` will automatically adjust the indentation of the code string
literal as a whole to match the surrounding context, but will never
change the relative indentation within such a literal.

Anything directly after the opening backticks is not considered
part of the string literal. It may be used as a language hint or
processed by macros (similar to the treatment of doc comments).

```rust
let sql = ```sql
SELECT * FROM table;
```;
```

Similar to raw string literals, there is no way to escape characters
within a code string literal. It is expected that procedural macros
would build upon code string literals to add support for such
functionality as required.

If it is necessary to include triple backticks within a code string
literal, more than three backticks may be used to enclose the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm torn here. Using the same thing as in doccomments makes sense, but at the same time when we already have "use more #s to escape more" I don't feel amazing about also having a "use more `s to escape more" construct.

literal, eg.

```rust
let code = ````
```
````;
```

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

A code string literal will begin and end with three or more backticks.
The number of backticks in the terminator must match the number used
to begin the literal.

The value of the string literal will be determined using the following
steps:

1. Start from the first newline after the opening backticks.
2. Take the string exactly as written until the closing backticks.
3. Remove equal numbers of spaces or tabs from every non-empty line
until the first character of the first non-empty line is neither
a space nor a tab, or until every line is empty.
Raise a compile error if this could not be done
due to a "negative" indent or inconsistent whitespace (eg. if
some lines are indented using tabs and some using spaces).

Here are some edge case examples:

```rust
// Empty string
assert_eq!(```foo
```, "");

// Newline
assert_eq!(```

```, "\n");

// No terminating newline
assert_eq!(```
bar```, "bar");

// Terminating newline
assert_eq!(```
bar
```, "bar\n");

// Preserved indent
assert_eq!(```
if a:
print(42)
```, "if a:\n print(42)\n");

// Relative indent
assert_eq!(```
if a:
print(42)
```, "if a:\n print(42)\n");

// Relative to first non-empty line
assert_eq!(```


if a:
print(42)
```, "\n\nif a:\n print(42)\n");
```

The text between the opening backticks and the first newline is
preserved within the AST, but is otherwise unused.

# Drawbacks
[drawbacks]: #drawbacks

The main drawback is increased complexity of the language:

1. It adds a new symbol to the language, which was not previously used.
2. It adds a third way of writing string literals.

# Rationale and alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative I have mentioned on zulip:

Improve include! handling (when passed as literals to macros? in editors?) instead to make it more ergonomic to outline other-language code rather than inlining.

Pros:

  • works better with simple tools that don't handle nested languages well

  • establishes a new indent context, i.e. doesn't need to be adjusted with surrounding code which in my experience can be error-prone if the editor's indentation handling is imperfect. Examples of confusions:

    • inside comments
    • inside macros
    • inside doc comments
    • when current indentation is inconsistent with configured rules
    • when copy-pasting into a differently indented context
  • generally avoids stacking complexity

    some_proc_macro!{
       mod m {
          /// This is an example with nesting and several levels of indentation and whitespaces
          ///
          /// ```rust
          /// let p = h"python
          ///      def py():
          ///          a = '''Lorem ipsum dolor sit amet,
          ///          consectetur adipiscing elit,
          ///          sed do eiusmod tempor incididunt
          ///          ut labore et dolore magna aliqua.'''
          ///          print(a) 
          ///      ";
          /// ```
          ///
          fn nesting_fun() {}
       }
    }

Cons:

  • requires editor support if you want to view or even edit the included file in the context of its parent instead of opening a new view. But showing an overlay might be less complex than all the text nesting
  • context of substitutions may be harder to see

Since this is motivated by making things easier for rustfmt I recommend contacting the maintainers of other tools (syntax highlighters, editors, IDEs, ...) to see if this change helps or adds complexity for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider this an alternative. Requiring powerful editor support to even use the feature makes it a no-go, and having to store things in separate files is a maintenance burden that's worse than the current situation, since it requires coming up with a naming scheme for those files that makes sense, makes it harder to resolve merge conflicts since tools like git will never understand this "magic include", and is way more complicated than what is proposed in this RFC.

The advantages you list I also consider to be problems with your approach. You say it works better with simple tools, but the opposite is true: you end up with something unworkable without powerful editor features. In contrast this RFC doesn't require any editor features at all to be an improvement over the status quo. Any support for nested language is an optional extra that doesn't affect the core functionality.

Your example of "stacking complexity" seems very straightforward tbh. Infinitely better than having to go to a spearate file.

Since this is motivated by making things easier for rustfmt I recommend contacting the maintainers of other tools (syntax highlighters, editors, IDEs, ...) to see if this change helps or adds complexity for them.

It by definition does not add any complexity for tools other than rustfmt, since the only required change as a result of this RFC is allowing a new prefix letter (h proposed here) and tools must already support that. Beside that, anything that is valid to do with a raw string literal is also valid to do with an h raw string literal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It by definition does not add any complexity for tools other than rustfmt, since the only required change as a result of this RFC is allowing a new prefix letter (h proposed here) and tools must already support that. Beside that, anything that is valid to do with a raw string literal is also valid to do with an h raw string literal.

Anything that adds syntax complicates syn and any other tools that use it or otherwise parse rust code. I can't imagine that it would ever be safe to just assume that any string prefix acts like a regular string literal, since raw strings already violate that, hence individual new letters have to be added to anything that parses rust code, including syntax highlighters (although the backup behavior is usually good enough for these).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You end up with something unworkable without powerful editor features.

How? Even many simple editors at least have tabs, panes or similar UI elements to view more than one file at a time.

At its most primitive you rely on your window manager and file browser to open multiple files at the same time in separate windows and show them side by side.

Any support for nested language is an optional extra that doesn't affect the core functionality.

A simple editor can have primitive syntax-highlighting that will work with separate files based on file extensions but won't work with inlined content. So this RFC makes things worse for simple editors

makes it harder to resolve merge conflicts since tools like git will never understand this "magic include"

I don't see how it would make things more difficult for git? If anything it makes diffs simple due to fewer whitespace adjustments.

Requiring powerful editor support to even use the feature makes it a no-go,

Where did I said that a powerful editor would be required? Rather I'm suggesting
a) improve powerful editors
b) keep things simple for simple editors

This covers both.

Your example of "stacking complexity" seems very straightforward tbh. Infinitely better than having to go to a separate file.

What is straight-forward about it? If you actually have to edit, indent, copy-paste, syntax-highlight or auto-complete that there are lots of pitfalls.
Note the outer macro which tends to make things more difficult for tools because at that point point they might not even know anymore whether they're dealing with rust or just things that happen to tokenize like rust.
And it's rust -> macro -> markdown -> codeblock (with language annotation) -> multiline string (with another language annotation).
These languages could be configured to have different indent rules!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think improving support for include! like things to be a negative (proc-macro-include RFC, proc-macro-expand feature would both be great to have), but it's a feature for different usecases than this. This RFC improves support for things that people are already doing. Even if we had better forms of include! I would not pull out 3 lines of SQL to a separate file just to get syntax highlighting, I would simply do what we do currently: use the existing literal strings and fight with rustfmt every time the surrounding code changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separation of languages is the norm and should be encouraged. See the HTML/CSS/JS split that is encouraged instead of having inline script handlers and styles. See template files. See module trees.

You say my approach is a no-go because it makes things more difficult for simple editors. And yet you acknowledge that this RFC will primarily benefit complex editors. While I think my approach would benefit simple editors because they can then work with the outlined language.

At the moment, these strings are in the file and so can be reviewed and have conflicts resolved in-place. By moving them to a separate file you can no longer perform these actions with any context about the surrounding code.

I assume they'd conventionally still be placed in the same directory and show up in the diffs next to each other.

To make that at all workable you'd need a powerful editor to allow treating them as though they weren't in a separate file

Not necessarily. E.g. when you have an SQL query query!(include!("query.psql"), param1="val", param2="val", ...) then it has an API, like a function call. You edit functions separately and then fix their callsites.
So "jump to definition" + error messages from the query! macro about missing arguments would already cover that.

Copy link
Member

@Nemo157 Nemo157 Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yet you acknowledge that this RFC will primarily benefit complex editors.

My expectation is that this RFC will not effect complex editors (in cases where they are not acting as simple editors).

A complex editor that is using heuristics to determine when to apply other-language syntax highlighting to a literal could similarly use those heuristics to determine when to apply other-language auto-formatting to a literal.

This RFC simply provides support for auto-indentation (but not formatting) of literals for simple editors (and complex editors where their heuristics don't apply) that use rustfmt.

EDIT: actually, I forgot that this RFC also included language hints, which would allow a very strong hint to the complex editor heuristics of what other-language to treat a literal as, but it also likely allows editors in between simple and complex to use very simple heuristics and start multi-language highlighting where they couldn't previously.

EDIT2: To clarify some of my categorical assumptions to make sure there's no misunderstanding:

  • simple editor: notepad -> notepad++ -> unconfigured vim
    • no code understanding or only simple regex based highlighting
  • complex editor: neovim/vscode + LSP, jetbrains
    • semantic code understanding, so it actually knows which macro literals are being passed to
  • in between: minimally configured vim/neovim without an LSP
    • still just syntactic code understanding, but better than the simple regexes, so it doesn't know which macro is which to use for multi-language heuristics, but it can parse and use language hints

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separation of languages is the norm and should be encouraged.

That is not my experience. I've almost never seen sql queries pulled out into separate files. Most assembly I've seen is inline. Shader languages are a bit of a mix, and I don't have as much familiarity with it, but I don't think it is at all unusual to include shader code inline, especially if it is small. And this feature would be very useful for help text for cli programs. I can't imagine using a separate file for the help comment for every option in my cli that uses clap.

See the HTML/CSS/JS split that is encouraged instead of having inline script handlers and styles

But we also have frameworks like react, where html and css are embedded in Javascript. Or svelte where the JS is included in an html template.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shader languages are the one case I can think of where people actually care about "separation of languages", and then it has to do more with the fact that GPU code inherently has a modularity to it, because it is run in passes, and people tend to pull out modules into, well, modules. So you may as well have, e.g.

  • code.cpp
  • code.hpp
  • code.vert
  • code.frag

But ofc you may well just encounter something like

  • code.cs
  • code.hlsl

Depending.

Copy link
Member

@the8472 the8472 Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notepad++

Has syntax highlighting.

But we also have frameworks like react, where html and css are embedded in Javascript. Or svelte where the JS is included in an html template.

Yes, and I have encountered issues with that kind of multi-language, framework-specific file formats that makes me prefer separate files. Simple editors just didn't support it at all or mistook it as only one of the languages, complex editors had configuration issues because they picked up the wrong preprocessor version or something which led to lots of bogus squiggles in those files while vanilla JS files had no issues.

Most assembly I've seen is inline.

https://github.com/xiph/rav1e/tree/master/src/arm
https://github.com/memorysafety/rav1d/tree/main/src/x86
https://github.com/rust-lang/stacker/tree/master/psm/src/arch

Though none of that needs to be include!ed / act as a template in the first place, it's static code with a fixed interface and compiled separately. I can't think of a project that needs templated ASM.

[rationale-and-alternatives]: #rationale-and-alternatives

There is lots of room to bike-shed syntax.
If there is significant opposition to the backtick syntax, then an
alternative syntax such as:
```
code"
string
"
```
could be used.

Similarly, the use of more than three backticks may be unpopular.
It's not clear how important it is to be able to nest backticks
within backticks, but a syntax mirroring raw string literals could
be used instead, eg.
```
`# foo
string
#`
```

There is also the question of whether the backtick syntax would
interfere with the ability to paste Rust code snippets into such
blocks. Experimentally, markdown parsers do not seem to have any
problems with this (as demonstrated in this document).

# Prior art
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python's cleandoc is often used for this purpose, and could set a good example of "expected behavior"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[prior-art]: #prior-art

The proposed syntax is primarily based on markdown code block syntax,
which is widely used and should be familiar to most programmers.


# Unresolved questions
[unresolved-questions]: #unresolved-questions

- None

# Future possibilities
[future-possibilities]: #future-possibilities

- Macro authors could perform further processing
on code string literals. These macros could add support for string
interpolation, escaping, etc. without needing to further complicate
the language itself.

- Procedural macros could look at the text following the opening triple
quotes and use that to influence code generation, eg.

```rust
query!(```postgresql
<query>
```)
```

could parse the query in a PostgreSQL specific way.

- Code literals could be used by crates like `html-macro`
or `quote` to provide better surface syntax and faster
compilation.

- Code literals could be used with the `asm!` macro to avoid
needing a new string on every line.