Skip to content

Conversation

@haoqixu
Copy link
Contributor

@haoqixu haoqixu commented Nov 7, 2025

Some characters can be represented as a pair of UTF16 code units in TS/JS, for example, '😃' == '\ud83d\ude03'. tsgo does not handle these strings correctly and fails to produce valid binary ast via api's encoder.

This PR might be related to #1701, but it only focuses on fixing the valid surrogate pairs.

} else if codePointIsHighSurrogate(codePoint) && s.char() == '\\' && s.charAt(1) == 'u' {
savedPos := s.pos
nextCodePoint := s.scanUnicodeEscape(flags&EscapeSequenceScanningFlagsReportInvalidEscapeErrors != 0)
if codePointIsLowSurrogate(nextCodePoint) {
Copy link
Member

@DanielRosenwasser DanielRosenwasser Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see what happens when a user has provided a single high surrogate, a low surrogate, or mismatched surrogates.

Maybe we need a compiler test for this in tests/cases/compiler or tests/cases/conformance with the following:

// @declaration: true

// low-high surrogate pair - the "correct" case
export const highLow = "\ud83d\ude03" as const;

// high surrogate
export const high = "\ud83d" as const;

// low surrogate
export const low = "\ude03" as const;

// two high surrogates
export const highHigh = "\ud83d\ud83d" as const;

// two low surrogates
export const lowLow = "\ude03\ude03" as const;

// swapped expected order of surrogates
export const lowHigh = "\ude03\ud83d" as const;

I think you are currently doing the right thing by only consuming when you have a correct pair, but I'm guessing that we're missing coverage here, so might as well add the test now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think I have done some very similar stuff in #2026, which (unfortunately) duplicates some of this scanning code in another package and fixes the bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a compiler test for this in tests/cases/compiler or tests/cases/conformance with the following:

I have added these testcases to tests/cases/compiler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think I have done some very similar stuff in #2026, which (unfortunately) duplicates some of this scanning code in another package and fixes the bugs.

Got it. I noticed that internal/regexpchecker/utf16.go in #2026 handles some similar unicode escapes processing, but I've verified that it doesn't actually fix the issue here. The internal/scanner stores the interpreted string for string literal while preserving the original text for regex literal in tokenValue, which is used by internal/api/encoder for encoding ast into binary foramt. We still need to update the scanner's code to properly decode valid surrogate pairs for strings.

As for invalid escapes, it involves encoding them into some kind of non-strict utf8 without data loss, transmitting them through the binary format and restoring them in the TypeScript API. I think it would be better to track that in a separate issue (perhaps in #1701) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that internal/regexpchecker/utf16.go in #2026 handles some similar unicode escapes processing

We still need to update the scanner's code to properly decode valid surrogate pairs for strings.

Perhaps we could consider refactoring by extracting the code from internal/regexpchecker/utf16.go into a reusable package. This way, it could be repurposed for use within internal/scanner as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect all of the code in that PR needs to move back into the scanner package anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants