Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access file and directory _names_ that are invalid UTF-8 #919

Open
cosmos72 opened this issue Feb 20, 2025 · 10 comments
Open

Cannot access file and directory _names_ that are invalid UTF-8 #919

cosmos72 opened this issue Feb 20, 2025 · 10 comments

Comments

@cosmos72
Copy link

cosmos72 commented Feb 20, 2025

The functions

(current-directory) (cd) (directory-list)
(file-exists?) (file-regular?) (file-directory?) (file-symbolic-link?) (file-access-time) (file-change-time)
(file-modification-time) (mkdir) (delete-file) (delete-directory) (rename-file) (chmod) (get-mode)

described in Section 9.16. File System Interface https://cisco.github.io/ChezScheme/csug10.0/io.html#./io:h16 operate on file names and directory names represented as Scheme strings.

When actually accessing the file system on POSIX systems, such strings are automatically converted from/to UTF-8.

This has the side effect that existing files and directories whose names are invalid UTF-8 cannot be accessed by the functions listed above.

Example:

$ mkdir example
$ cd example
$ touch $(printf 'AAA\xffzzz')
$ ls -l aaa*
-rw-r--r-- 1 user users 0 Feb 20 10:15 'AAA'$'\377''zzz'
$ chezscheme
> (define x (car (sort string<? (directory-list "."))))

> x
"AAA�zzz"

> (char->integer (string-ref x 3))
65533

> (delete-file x)
#f

> (delete-file x #t)
Exception in delete-file: failed for AAA�zzz: no such file or directory
Type (debug) to enter the debugger.

The problem is: byte #xff is not a valid UTF-8 sequence,
and (directory-list) converts it to replacement character #xFFFD as per UTF-8 error-handling rules.

As a consequence, the file created with shell command touch $(printf 'AAA\xffzzz')
and all other files or directories whose names are invalid UTF-8
cannot be accessed with Chez Scheme functions listed above.

Since POSIX file system specifications do not require that files or directory names are valid UTF-8,
this leaves the above Chez Scheme functions in the uncomfortable position of failing on some valid POSIX file and directory names.

A solution could be to convert file and directory names from/to UTF-8b (note the 'b') instead of UTF-8,
because UTF-8b is an extension of UTF-8 designed exactly to losslessly convert any byte or byte sequence.

For a definition of UTF-8b, see
https://peps.python.org/pep-0383
https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

@cosmos72
Copy link
Author

[ADDENDUM]

In case it's not explaned clearly enough,
I am talking about the name of files and directories in a POSIX file system - not their content.

@cosmos72 cosmos72 changed the title Cannot access file and directory names containing invalid UTF-8 Cannot access file and directory _names_ that are invalid UTF-8 Feb 20, 2025
@melted
Copy link
Contributor

melted commented Feb 24, 2025

I don't think UTF-8b as such will help with roundtripping through Chez Scheme strings as they are UTF-32 internally. But there is of course plenty of room in to represent invalid UTF-8 bytes in them. I wonder, what happens if someone tries to use a string and not just roundtrip a string that contains bad bytes? As far as I can see the minimum interface is a predicate for if a string is improper and that you can decode it to a bytevector, and most other string functions would just throw if you try to use them.

Wouldn't it be more explicit if these file functions alternatively took or returned a bytevector?

@cosmos72
Copy link
Author

cosmos72 commented Feb 25, 2025

In my shell "schemesh" written in Chez Scheme, I am currently converting byte sequences (interpreted as UTF-8b) to Chez Scheme strings (yes, they are UTF-32 internally) and back. It works flawlessly.

The trick I used is a custom C function that calls Schar(0xdc80...0xdcff) and returns to Chez the produced characters, because vanilla (integer->char) intentionally throws for those values.

Only 128 characters need to be produced in this way, so they can be cached in Chez - no need to call C every time.

The other direction, (char->integer) is trivial: it already correctly converts characters in the range #\xdc80 ... #\dcff

[UPDATE] about modifying the functions to also accept bytevectors: I have implemented that too in my schemesh, but in Scheme strings are more convenient. And existing programs would need to be updated to take advantage of them.

Clearly, returning bytevectors instead of strings from (directory-list) would break compatibility with exisisting programs.

All considered, my proposal to convert file names from/to UTF-8b is equivalent to saying that these functions accept or return UTF-32b Scheme strings: most programs will not even notice the change and benefit from the fix, while the ones who do notice the UTF-32b characters are currently broken anyway because of this bug

@cosmos72
Copy link
Author

cosmos72 commented Feb 26, 2025

[UPDATE 2] This issue is actually more widespread than initially reported.

It also affects Chez Scheme on Windows because filenames there are not required to be valid UTF-16, see https://zaferbalkan.com/surrogates/

And on POSIX systems it also affects all string-based interfaces to the operating system, including at least:

  • (command-line-arguments) : a user or a script may launch Chez Scheme with command line arguments that are not valid UTF-8. For example, a file name to be loaded and executed. But (command-line-arguments) will return the arguments after replacing any invalid UTF-8 with the replacement character #\FFFD - this is internally performed by (utf8->string) - thus the command line arguments will be garbled, and if they refer to a file name, the file will not be found.

  • (open-file-input-port) (open-file-output-port) (open-file-input/output-port) accept file paths as UTF-32 Scheme strings, thus they cannot represent - much less open - any file whose name, when represented in bytes as POSIX file systems do, contain invalid UTF-8

  • (open-process-ports) and (process) accept arguments as UTF-32 Scheme strings, thus cannot launch executables whose path contain invalid UTF-8.

@dpk
Copy link

dpk commented Mar 2, 2025

UTF-8b doesn’t work for Scheme because it requires unpaired surrogates to be supported as character objects and in strings, which isn’t allowed by R6RS.

The solution devised for R7RS large by John Cowan (many years contributor to the Unicode Standard) is described here: https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling

@cosmos72
Copy link
Author

cosmos72 commented Mar 2, 2025

Could you kindly help me find the relevant R6RS section that implies unpaired surrogates are not allowed as character objects and in strings?

If that's the case, then strict R6S6 compliance is surely difficult to obtain.

On the other hand, the proposal you cited has an unpleasant side effect: it mangles valid UTF-8 file names that happen to decode to valid Unicode "noncharacters" (i.e. codepoints in the range U+FDD0..U+FDEF).

Unicode standard https://www.unicode.org/versions/Unicode15.0.0/ch23.pdf states:

Applications are free to use any of these noncharacter code points internally.
They have no standard interpretation when exchanged outside the context of internal use.

If "application" is taken to mean an R6RS-compliant Scheme implementation,
then the proposal above is acceptable - although personally I still find it complicated and invasive.

If "application" is taken to mean an R6RS-compliant Scheme program - and I find this interpretation more likely -
then in this context the R6RS-compliant Scheme implementation has the role of a library,
and should not mangle noncharacters because they are reserved for the application.

@dpk
Copy link

dpk commented Mar 2, 2025

@dpk
Copy link

dpk commented Mar 2, 2025

Noncharacter error handling does not mangle anything: mangling implies an irreversible process because some ambiguity would be created. The affected noncharacters (which are not all of the noncharacters in Unicode, only a small and well-defined subset) are safely quoted and unquoted, a reversible and unambiguous process.

I acknowledge there is a type aliasing issue, but this is the best solution available under the constraints that 1. file name objects (and many other strings that come from the operating system, such as environment variables) have to be representable as Scheme strings for backwards compatibility and 2. Scheme strings must be pure sequences of Unicode scalar values

@cosmos72
Copy link
Author

cosmos72 commented Mar 2, 2025

I meant mangling as in "C++ function names mangling", which is reversible.
Yes, it's a kind of escaping, since noncharacter sequences are reversibly replaced with longer sequences.
Still, they are replaced, which means an application can no longer transparently use them.

@LiberalArtist
Copy link
Contributor

under the constraints that 1. file name objects (and many other strings that come from the operating system, such as environment variables) have to be representable as Scheme strings for backwards compatibility

FWIW Racket extends procedures that operate on file paths to accept path-string?s: either strings, treated the usual way, or path? values, which are essentially bytevectors with some invariants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants