Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[70_11] AST Syntax Highlighting for C++ #1993

Merged
merged 27 commits into from
Aug 21, 2024
Merged

Conversation

UnbSky
Copy link
Contributor

@UnbSky UnbSky commented Aug 7, 2024

What

Use xmake to import tree-sitter and tree-sitter-cpp.
Implement a highlighting system using tree-sitter, supporting AST syntax highlighting for C++.
Specific advantages include cross-line string highlighting, recognition of escape sequences within strings, colorful bracket matching and so on.

  • cpp-ast-lang.scm : Records the highlighting colors for keywords and different token types.
  • lang_parser.h & lang_parser.cpp : Programming language parser; uses tree-sitter to compile code and obtain an AST, and then retrieves a series of tokens and their corresponding attributes from the AST.
  • ast_language.cpp : Performs compilation operations using advance and obtains the color for corresponding tokens using get_color.

Why

The old lexical highlighting could not distinguish many highlighting situations.
Compare

How to test your changes?

  1. Build and run.
  2. Turn on"Edit" -> "Preference" -> "Other" -> "AST syntax highlighting".
  3. New cpp code blocks and code lines will use AST syntax highlighting.
  4. AST syntax highlighting can be turned off at any time.

@da-liii da-liii requested a review from jingkaimori August 7, 2024 05:48
@jingkaimori
Copy link
Contributor

if embed code contains three dot like this, the render will be corrupted:

template<class T>
void function(T a, T b){
    // three dots here corrupts the next function ...
};

function((char)a, (int)b);

@UnbSky
Copy link
Contributor Author

UnbSky commented Aug 7, 2024

if embed code contains three dot like this, the render will be corrupted:

template<class T>
void function(T a, T b){
    // three dots here corrupts the next function ...
};

function((char)a, (int)b);

”...“ 通过cork_to_utf8会转换为 U+2026
U+2026 通过utf8_to_cork会变成 "<ldots>"
"<ldots>" 和 ”...“ 不同会导致渲染出问题
现在直接在获得utf8代码的地方加一个replace把U+2026替换回"..."来避免问题

@da-liii
Copy link
Contributor

da-liii commented Aug 7, 2024

@jingkaimori 我们需要思考一下,如何在接口层面规避到Cork编码,特别是代码模式。努力规避Cork编码这件事,应该由我们开发者完成,而不是让OSPP开发者完成。因为这块挺复杂,挺关键的。

@jingkaimori
Copy link
Contributor

如何在接口层面规避到Cork编码,特别是代码模式。

非常难,如果不在ast树处理这块处理Cork,而是在更前的环节处理的话:要么从文档树里拿到的子树可能是utf8编码;要么就只是简单地把utf8_to_cork移到typeset里,还会影响别的language。

这个问题的本质是,在当前的字典下utf8_to_cork(cork_to_utf8(str))不一定等于源字符串str。但是不处理也不行,文档的叶子节点里左尖括号都是用<gtr>表示,而尖括号的映射和三个点的映射在同一个字典里。得想办法甩掉目前的converter

@da-liii
Copy link
Contributor

da-liii commented Aug 7, 2024

如果是选择每一个编程语言的高亮插件都作为动态链接库引入,那么建议采用编译时的开关。这样做工作量大一些。

目前采用的是运行时的开关,这样做的话,建议直接将 tree-sitter-cpp 作为静态链接库引入。

可以后续再做相关优化。src/Plugins下面的都是C++插件,是需要能通过编译开关选择不引入的。

@UnbSky
Copy link
Contributor Author

UnbSky commented Aug 7, 2024

如果是选择每一个编程语言的高亮插件都作为动态链接库引入,那么建议采用编译时的开关。这样做工作量大一些。

目前采用的是运行时的开关,这样做的话,建议直接将 tree-sitter-cpp 作为静态链接库引入。

可以后续再做相关优化。src/Plugins下面的都是C++插件,是需要能通过编译开关选择不引入的。

我先改回静态库,后续如果有问题和需求可以再详细讨论,我现在还不了解你们是如何处理动态链接库的。

@da-liii
Copy link
Contributor

da-liii commented Aug 7, 2024

这个pr没有提供一些关键的测试用例,测试用例都放在:
https://github.com/XmacsLabs/mogan/tree/branch-1.2/TeXmacs/tests/tmu

可以放在单个tmu里面,比如 TeXmacs/tests/tmu/70_2.tmu

测试用例需要挑选一些关键的用例,代码量越少越好

@da-liii
Copy link
Contributor

da-liii commented Aug 8, 2024

LGTM on the interface part and code styles

Copy link
Contributor

@jingkaimori jingkaimori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不少细节还有提升空间

Copy link
Contributor

@jingkaimori jingkaimori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在基于字符串拼接的处理虽然能用,但是还是有点小毛病:

  • 字符串treesitter接受的唯一一种输入形式,而字符串拼接有潜在的性能问题
  • 用字符串表示TSNode节点的类型也不是那么恰当
  • 检查树是不是已经解析过了的哈希部分,也值得改进。

这个pr虽然实现了功能,达到了合并的最低标准,但是距离最简洁的实现还有相当的距离,如果应用这几项修改,代码的结构也许会大变,因此我不赞成合并目前这个pr。考虑到ospp一个月后要结束,如果时间赶不上,合并这一版也不是不行。

@jingkaimori jingkaimori marked this pull request as draft August 8, 2024 08:55
@jingkaimori jingkaimori self-requested a review August 18, 2024 03:50
@jingkaimori jingkaimori marked this pull request as ready for review August 18, 2024 03:54
Copy link
Contributor

@jingkaimori jingkaimori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提一些细节问题

Comment on lines 34 to 40
barcket_symbol_list= list<TSSymbol> (); // '(', ')', '[', ']', '{', '}' ;
barcket_symbol_list << ts_language_symbol_for_name (ts_lang, "(", 1, false);
barcket_symbol_list << ts_language_symbol_for_name (ts_lang, ")", 1, false);
barcket_symbol_list << ts_language_symbol_for_name (ts_lang, "[", 1, false);
barcket_symbol_list << ts_language_symbol_for_name (ts_lang, "]", 1, false);
barcket_symbol_list << ts_language_symbol_for_name (ts_lang, "{", 1, false);
barcket_symbol_list << ts_language_symbol_for_name (ts_lang, "}", 1, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从tree-sitter的头文件出发,无法判断哪些符号是成对出现的括号。

这里可以考虑把这六个符号放到编程语言scm文件里。zed的解决方案也是把这些符号放到语言的配置里。
https://github.com/zed-industries/zed/blob/911112d94a168b6455268cb34c57f10d3455b175/crates/editor/src/highlight_matching_bracket.rs#L7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已放入scm中,可自行设置需要匹配的括号组,和对应匹配括号组颜色循环的周期

@jingkaimori jingkaimori self-requested a review August 19, 2024 13:28
@UnbSky
Copy link
Contributor Author

UnbSky commented Aug 20, 2024

目前给出一个判断部分typeset中的tree来获得代码根节点的方案,不再依赖the_et,可以修复AST高亮的”复制到图片“问题,获得手段不会对原本的typeset产生太大的影响。

  1. bridge_compound_rep::my_typeset可以在正常渲染advance的时候截获代码节点
  2. concater_rep::typeset截获"复制到图片"的第一次advance代码节点
  3. make_lazy_compound截获"复制到图片"的第二次advance代码节点

原本在"复制到图片"无法获得获得代码节点和IP错误的原因是由于print_snippet导致两次调用的advance中, 第一次调用advance的ip是相对于传入print_snippet的选中部分tree的,第二次调用advance的ip则是由于make_lazy函数的调用导致ip不符合期望,两次截获的结点相对于传入advance的结点的IP是相对正确的,因此整个修复方式只需要获得代码节点。

Copy link
Contributor

@jingkaimori jingkaimori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tweaking may be added in separate pr

@jingkaimori jingkaimori merged commit 23aba63 into branch-1.2 Aug 21, 2024
7 checks passed
@jingkaimori jingkaimori deleted the unb/70_2/ast_highlight branch August 21, 2024 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SoC Summer of Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants