Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the need of NodeJS for generating parsers #465

Open
XVilka opened this issue Oct 18, 2019 · 28 comments · May be fixed by #3780
Open

Remove the need of NodeJS for generating parsers #465

XVilka opened this issue Oct 18, 2019 · 28 comments · May be fixed by #3780
Labels
Milestone

Comments

@XVilka
Copy link
Contributor

XVilka commented Oct 18, 2019

There are no things that JavaScript can do that can't be done with Rust. It makes possible the next logical step - get rid of NodeJS for generating parsers from the description as well.

There is a number of JavaScript parsers in Rust:

@maxbrunsfeld
Copy link
Contributor

It'd be nice to avoid depending on external programs, but in order to evaluate grammar.js files, we need a full JavaScript runtime, not just a parser. Grammars can also use require to load external files, so the runtime needs to be compatible with Node.js in that way.

We could embed a V8 or Spidermonkey interpreter inside of the Tree-sitter CLI, but it seems like overkill to me. Node.js is very easy to install and many users already have it. In fact, I think most people use node.js (and npm) to install tree-sitter.

@thestr4ng3r
Copy link
Contributor

Duktape can potentially be used as a lightweight runtime.

@turbo
Copy link

turbo commented Oct 18, 2019

Throwing this in here, too:

Duktape can potentially be used as a lightweight runtime.

Or bellard's QuickJS, which is almost twice as fast as Duktape even after a recent round of security fixes. Plus the API is nice (subjectively).

@thestr4ng3r
Copy link
Contributor

I think the greatest problem with depending on nodejs is that any project that doesn't ship a pre-generated parser in their source, but generates it on the fly as part of their build system also has to depend on nodejs.
In our case, this is not acceptable. We would have to always keep an already generated parser in our source tree as a workaround, which is bad practice.

@maxbrunsfeld
Copy link
Contributor

Thanks for the explanation. Personally, I think it's fine to check generated code into a source repository, as it is platform-agnostic, and it makes it easier to use the project. But if you're ok with running tree-sitter generate as part of your build, that seems like a reasonable approach.

Just so I understand your use case - why are you ok with your build depending on the tree-sitter binary but not the node binary? Would you prefer to swap out Node for some other executable JavaScript interpreter like QuickJS, DukTape, V7, Rhino, or SpiderMonkey?

Currently, Tree-sitter just looks for node on your PATH and passes JavaScript to it via stdin. We do rely on process.env and require currently, but we could probably remove the dependence on those particular APIs.

@XVilka
Copy link
Contributor Author

XVilka commented Oct 19, 2019

Well, to my taste I don't understand why do you need the full power of JavaScript, why not just use a limited subset of it? So a parser in Rust can be written and no need for JavaScript interpreter whatsoever.

@maxbrunsfeld
Copy link
Contributor

Grammars define helper functions, call methods like Array.map, and .filter, merge objects using the spread operator, etc.

JavaScript is the most widely used programming language in the world. There’s just no reason to create some custom subset with its own interpreter.

@thestr4ng3r
Copy link
Contributor

Well popularity doesn't mean it's good 😜

Just so I understand your use case - why are you ok with your build depending on the tree-sitter binary but not the node binary? Would you prefer to swap out Node for some other executable JavaScript interpreter like QuickJS, DukTape, V7, Rhino, or SpiderMonkey?

The tree-sitter binary will be required in any case, but depending on nodejs just for reading in the grammar seems way too much to me. If you say you do need full javascript, I think such a small and embedded interpreter, which would then be part of the tree-sitter binary, could be a good compromise.

@turbo
Copy link

turbo commented Oct 19, 2019

No need to devolve into JS bashing. FWIW I agree that in general, I like programs to be self-contained, but I can also understand that this is the lowest of low priorities.

I guess the better question would be: If someone were to do the work of integrating one of the smaller JS engines into tree-sitter, would that work be accepted?

@Arcanemagus
Copy link

I guess the better question would be: If someone were to do the work of integrating one of the smaller JS engines into tree-sitter, would that work be accepted?

The issue I see with this is later on when somebody not familiar with this discussion comes in and starts working on a Tree-sitter grammar, sees that they can use JavaScript files and builds it with that... only to use some JavaScript feature that the tiny custom parser doesn't understand. How are they supposed to know that it's a problem in the miniature parser and not with their code? How are they supposed to fix that?

I don't see the benefit of removing the Node.js dependency here, without entirely removing the ability to use JavaScript.

@maxbrunsfeld
Copy link
Contributor

Yeah, I think Node.js needs to remain the default. I am open to generalizing the grammar-evaluating code path so that you can use an alternative JavaScript interpreter if you want. I don't think I'm interested in embedding the JavaScript interpreter into the tree-sitter binary. The vast majority of users will want to use Node.js, so it seems like unnecessary bloat.

@maxbrunsfeld
Copy link
Contributor

Are folks interested in this option: allowing Tree-sitter to use different (and possibly lighter-weight) executables besides node as the JavaScript engine? If not, I think we should close this out.

@turbo
Copy link

turbo commented Nov 5, 2019

use an alternative JavaScript interpreter if you want

Just the choice would be great. I for example use deno for JS scripting (https://github.com/denoland/deno) and not node.

@thestr4ng3r
Copy link
Contributor

Yes, I think that could be a good option.

@maxbrunsfeld
Copy link
Contributor

Ok, sounds good; I will leave this open, and if anyone wants to tweak the javascript evaluation logic to not depend on Node-specific APIs, that'd be great.

@jurov
Copy link

jurov commented Dec 12, 2019

I don't understand, if you need to generate grammar at runtime, why not directly generate the JSON format instead?

@mingodad
Copy link

Just did a test with quickjs with small changes to dsl.js/grammar.js and joining then the end result is identical to the output of nodejs.

/* commenting this in dsl.js
global.alias = alias;
global.blank = blank;
global.choice = choice;
global.optional = optional;
global.prec = prec;
global.repeat = repeat;
global.repeat1 = repeat1;
global.seq = seq;
global.sym = sym;
global.token = token;
global.grammar = grammar;
global.field = field;

const result = require(process.env.TREE_SITTER_GRAMMAR_PATH);
console.log(JSON.stringify(result, null, 2));
*/
//module.exports = grammar({...}); //replacing this in grammar.js
console.log(JSON.stringify(grammar({...}));

@Himujjal
Copy link
Contributor

Himujjal commented Jan 31, 2021

What are the limitations of quickjs that prevents it to be default javascript engine for the purpose of Tree-sitter? AFAIK, everything is synchronous. File operations are already available in QuickJS or could be easily put to use using FFI. Performance isnt a concern either in this case.

So, would a QuickJS engine based pull request be accepted?

With QuickJS, tree-sitter binary could easily be published to OS package managers without the NodeJS dependency.

@carueda
Copy link

carueda commented Feb 4, 2021

Great discussion here. I have found tree-sitter really amazing, however, the dependency on node (besides the tree-sitter binary itself) is practically a showstopper if I wanted to integrate the code generation as part of my build, which is for an embedded system, basically within the context of a typical gcc/makefile workflow where there's no javascript/node involved whatsoever. An alternative strategy -as mentioned above already- is to generate such code in a separate project whose artifacts would then be included basically as external library dependencies; yes, feasible, but very far from ideal imo.

@maxbrunsfeld
Copy link
Contributor

I would like to keep Node.js as the default because for most users, it will be by far the most familiar and least confusing. Error messages and built-in APIs will likely be different from engine to engine.

But, expanding on what I said earlier, I'd be open to pull requests adding either of:

  • A compile time feature that lets you build in a different JavaScript engine
  • A runtime feature that lets you specify a different binary as the javascript interpreter.

The second option would probably be the most flexible. It may take a little design work to decide exactly how Tree-sitter should invoke the JavaScript interpreter and receive the JSON output.

@razzeee
Copy link
Contributor

razzeee commented Mar 3, 2021

It's way easier to write these in javascript/typescript. It get's converted into a json right after that.
The json looks like this https://github.com/elm-tooling/tree-sitter-elm/blob/master/src/grammar.json
I guess you could skip the js part in theory if you wanted. Still I would be surprised, if you could even maintain a grammar like that.

@gwerbin
Copy link

gwerbin commented Nov 26, 2021

Perhaps JavaScript is problematic because it's not easy to embed, and its runtimes are generally quite "heavy", even though grammars use only a small subset of their features.

Whereas something like Lua is designed to be embedded.

There is a system for embedding NodeJS in C++ applications, but I don't know if that can then be used from other languages like Rust, Go, etc. https://nodejs.org/api/embedding.html

If the grammar ultimately gets emitted as JSON, then maybe Tree Sitter should expose a "grammar builder" API, which could be used from any language with an FFI. Then you can generate grammars in Node, Python, Haskell, whatever you want.

@kalkin
Copy link

kalkin commented Nov 26, 2021

Lua would definitely be a way to go, but as long there is no code the whole discussion is useless

@maxbrunsfeld
Copy link
Contributor

👋 Hi everyone, this thread is getting a lot of drive-by comments that aren't very relevant to the specific problem that we're discussing. I'd like to lock this issue in order to reduce unnecessary notifications for myself and other Tree-sitter maintainers.

I'll leave the discussion open for another day or two, as a courtesy to anyone who feels they have important feedback to add. And I'll reiterate few points of my own:

  1. Today, Node.js is not required in any way if you are using a Tree-sitter parser in your application. It is only required if you are re-generating a parser from the grammar. Typically, this is something you do only during development of your grammar.
  2. Even if you do need to regenerate a parser again later, you can still avoid needing Node.js. You can run tree-sitter generate on a JSON file instead of a JavaScript file. The JSON file is much smaller than the parser.c and can be generated from the JavaScript file during development.
  3. If we want to remove the dependency on Node.js during grammar development, we have an easy pathway towards doing that, which won't require much change to the JavaScript-based grammar DSL. I've outlined that above.
  4. There are many JavaScript engines to choose from. Most OS's ship one by default, as part of their web view system.
  5. JavaScript is the world's most commonly-known programming language by far. And grammar development does not need to be embedded in Tree-sitter users' applications. So there's no reason for us to switch the grammar DSL to use a different language.

If (and only if) you have exact, concrete use-cases for the Tree-sitter CLI and Node.js is a problem for you, and you don't think the above proposal can solve your problem, please continue to comment. Thanks!

@sogaiu
Copy link
Collaborator

sogaiu commented Jan 5, 2023

Specifically regarding point 3 (runtime configurability of alternate js runtime) in this comment, below is a draft that lead to success locally.

Is this close to what was referred to in point 3?

(Possibly as a side-benefit this kind of thing might address #1686 as well.)


Tried patching mod.rs like this:

diff --git a/cli/src/generate/mod.rs b/cli/src/generate/mod.rs
index 4838828b..053ffbb0 100644
--- a/cli/src/generate/mod.rs
+++ b/cli/src/generate/mod.rs
@@ -25,6 +25,7 @@ use std::fs;
 use std::io::Write;
 use std::path::{Path, PathBuf};
 use std::process::{Command, Stdio};
+use std::env;
 
 lazy_static! {
     static ref JSON_COMMENT_REGEX: Regex = RegexBuilder::new("^\\s*//.*")
@@ -168,12 +169,16 @@ pub fn load_grammar_file(grammar_path: &Path) -> Result<String> {
 
 fn load_js_grammar_file(grammar_path: &Path) -> Result<String> {
     let grammar_path = fs::canonicalize(grammar_path)?;
-    let mut node_process = Command::new("node")
+    let js_runtime_path = match env::var("TREE_SITTER_JS_RUNTIME_PATH") {
+        Ok(s) => s,
+        Err(_) => "node".to_string(),
+    };
+    let mut node_process = Command::new(js_runtime_path)
         .env("TREE_SITTER_GRAMMAR_PATH", grammar_path)
         .stdin(Stdio::piped())
         .stdout(Stdio::piped())
         .spawn()
-        .expect("Failed to run `node`");
+        .expect("Failed to run js runtime");  // XXX: yeah not right
 
     let mut node_stdin = node_process
         .stdin

To test on a *nix machine:

  • Ensure a local installation of quickjs (see here for availability)
  • Build patched tree-sitter cli via cargo build --release or other appropriate method (will use later so save path somewhere -- typically it's target/release/tree-sitter under the tree-sitter repository root)
  • Fetch some tree-sitter grammar locally (e.g. via git clone)
  • Place make-grammar-json.js and quickjs-wrapper.sh (make executable) in grammar repository root (see bottom of post for .zip files for these files)
  • Set TREE_SITTER_JS_RUNTIME_PATH environment variable to point at quickjs-wrapper.sh
  • Move src directory out of the way
  • Invoke tree-sitter generate (using custom tree-sitter cli binary built earlier)

Success should yield a src directory with appropriate content.


simple .js file - make-grammar-json.zip
wrapper invocation script - quickjs-wrapper.zip

@jakesarjeant
Copy link
Contributor

I've implemented the latter approach from point 3 in a more complete manner in this PR. I'd love to hear whether or not this satisfies the concerns discussed here and/or whether anyone thinks further changes are necessary!

@tgross35
Copy link

tgross35 commented Jan 9, 2024

Can this be closed since #2403 is merged, or are the maintainers interested in an embedded option still?

If anyone is testing around, the best option at this time is probably deno_core https://docs.rs/deno_core/0.244.0/deno_core/. Deno is a pretty well proven alternative to node at this point, Rust runtime around the v8 engine. Kind of perfect for this use case.

The downside is the binary would probably 4x in size, but that may be an acceptable tradeoff.

@tree-sitter tree-sitter deleted a comment from issue-label-bot bot Feb 7, 2024
@dundargoc dundargoc added enhancement Feature request and removed feature_request labels Feb 8, 2024
@ObserverOfTime ObserverOfTime added this to the 1.0 milestone May 7, 2024
@Thaodan
Copy link

Thaodan commented Jul 14, 2024

How does this goal interact with the memory usage required to generate parser?

I'm packaging the parsers tree-sittter of tree-sitter grammars to be used in text editors for openSUSE. To ensure that the parser is compatible and to not trust the generated I generate the parser on each built. The argument that you just need to generate parsers during development doesn't add up to me as the chain of dependencies on what treesitter was used to generate the parser shouldn't be with the upstream from my point of view.

@ObserverOfTime ObserverOfTime modified the milestones: 1.0, 0.26 Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.