Fix scanning of valid surrogate pairs #2032

haoqixu · 2025-11-07T09:20:11Z

Some characters can be represented as a pair of UTF16 code units in TS/JS, for example, '😃' == '\ud83d\ude03'. tsgo does not handle these strings correctly and fails to produce valid binary ast via api's encoder.

This PR might be related to #1701, but it only focuses on fixing the valid surrogate pairs.

DanielRosenwasser · 2025-11-07T23:30:10Z

internal/scanner/scanner.go

+		} else if codePointIsHighSurrogate(codePoint) && s.char() == '\\' && s.charAt(1) == 'u' {
+			savedPos := s.pos
+			nextCodePoint := s.scanUnicodeEscape(flags&EscapeSequenceScanningFlagsReportInvalidEscapeErrors != 0)
+			if codePointIsLowSurrogate(nextCodePoint) {


I think I need to see what happens when a user has provided a single high surrogate, a low surrogate, or mismatched surrogates.

Maybe we need a compiler test for this in tests/cases/compiler or tests/cases/conformance with the following:

// @declaration: true // low-high surrogate pair - the "correct" case export const highLow = "\ud83d\ude03" as const; // high surrogate export const high = "\ud83d" as const; // low surrogate export const low = "\ude03" as const; // two high surrogates export const highHigh = "\ud83d\ud83d" as const; // two low surrogates export const lowLow = "\ude03\ude03" as const; // swapped expected order of surrogates export const lowHigh = "\ude03\ud83d" as const;

I think you are currently doing the right thing by only consuming when you have a correct pair, but I'm guessing that we're missing coverage here, so might as well add the test now.

Also, I think I have done some very similar stuff in #2026, which (unfortunately) duplicates some of this scanning code in another package and fixes the bugs.

Maybe we need a compiler test for this in tests/cases/compiler or tests/cases/conformance with the following:

I have added these testcases to tests/cases/compiler.

Also, I think I have done some very similar stuff in #2026, which (unfortunately) duplicates some of this scanning code in another package and fixes the bugs.

Got it. I noticed that internal/regexpchecker/utf16.go in #2026 handles some similar unicode escapes processing, but I've verified that it doesn't actually fix the issue here. The internal/scanner stores the interpreted string for string literal while preserving the original text for regex literal in tokenValue, which is used by internal/api/encoder for encoding ast into binary foramt. We still need to update the scanner's code to properly decode valid surrogate pairs for strings.

As for invalid escapes, it involves encoding them into some kind of non-strict utf8 without data loss, transmitting them through the binary format and restoring them in the TypeScript API. I think it would be better to track that in a separate issue (perhaps in #1701) 🤔

I noticed that internal/regexpchecker/utf16.go in #2026 handles some similar unicode escapes processing

We still need to update the scanner's code to properly decode valid surrogate pairs for strings.

Perhaps we could consider refactoring by extracting the code from internal/regexpchecker/utf16.go into a reusable package. This way, it could be repurposed for use within internal/scanner as well

I suspect all of the code in that PR needs to move back into the scanner package anyway

haoqixu added 3 commits November 7, 2025 16:56

add testcases

f138528

Fix scanning of valid surrogate pairs

bc121e4

npx hereby format

9117cb9

DanielRosenwasser reviewed Nov 7, 2025

View reviewed changes

add testcases

c949dc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scanning of valid surrogate pairs #2032

Fix scanning of valid surrogate pairs #2032

Uh oh!

haoqixu commented Nov 7, 2025

Uh oh!

DanielRosenwasser Nov 7, 2025 •

edited

Loading

Uh oh!

jakebailey Nov 7, 2025

Uh oh!

haoqixu Nov 10, 2025

Uh oh!

haoqixu Nov 10, 2025

Uh oh!

haoqixu Nov 10, 2025

Uh oh!

jakebailey Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix scanning of valid surrogate pairs #2032

Are you sure you want to change the base?

Fix scanning of valid surrogate pairs #2032

Uh oh!

Conversation

haoqixu commented Nov 7, 2025

Uh oh!

DanielRosenwasser Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakebailey Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

haoqixu Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

haoqixu Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

haoqixu Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

jakebailey Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DanielRosenwasser Nov 7, 2025 •

edited

Loading