-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot access file and directory _names_ that are invalid UTF-8 #919
Comments
[ADDENDUM] In case it's not explaned clearly enough, |
I don't think UTF-8b as such will help with roundtripping through Chez Scheme strings as they are UTF-32 internally. But there is of course plenty of room in to represent invalid UTF-8 bytes in them. I wonder, what happens if someone tries to use a string and not just roundtrip a string that contains bad bytes? As far as I can see the minimum interface is a predicate for if a string is improper and that you can decode it to a bytevector, and most other string functions would just throw if you try to use them. Wouldn't it be more explicit if these file functions alternatively took or returned a bytevector? |
In my shell "schemesh" written in Chez Scheme, I am currently converting byte sequences (interpreted as UTF-8b) to Chez Scheme strings (yes, they are UTF-32 internally) and back. It works flawlessly. The trick I used is a custom C function that calls Only 128 characters need to be produced in this way, so they can be cached in Chez - no need to call C every time. The other direction, [UPDATE] about modifying the functions to also accept bytevectors: I have implemented that too in my schemesh, but in Scheme strings are more convenient. And existing programs would need to be updated to take advantage of them. Clearly, returning bytevectors instead of strings from All considered, my proposal to convert file names from/to UTF-8b is equivalent to saying that these functions accept or return UTF-32b Scheme strings: most programs will not even notice the change and benefit from the fix, while the ones who do notice the UTF-32b characters are currently broken anyway because of this bug |
[UPDATE 2] This issue is actually more widespread than initially reported. It also affects Chez Scheme on Windows because filenames there are not required to be valid UTF-16, see https://zaferbalkan.com/surrogates/ And on POSIX systems it also affects all string-based interfaces to the operating system, including at least:
|
UTF-8b doesn’t work for Scheme because it requires unpaired surrogates to be supported as character objects and in strings, which isn’t allowed by R6RS. The solution devised for R7RS large by John Cowan (many years contributor to the Unicode Standard) is described here: https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling |
Could you kindly help me find the relevant R6RS section that implies unpaired surrogates are not allowed as character objects and in strings? If that's the case, then strict R6S6 compliance is surely difficult to obtain. On the other hand, the proposal you cited has an unpleasant side effect: it mangles valid UTF-8 file names that happen to decode to valid Unicode "noncharacters" (i.e. codepoints in the range U+FDD0..U+FDEF). Unicode standard https://www.unicode.org/versions/Unicode15.0.0/ch23.pdf states:
If "application" is taken to mean an R6RS-compliant Scheme implementation, If "application" is taken to mean an R6RS-compliant Scheme program - and I find this interpretation more likely - |
Noncharacter error handling does not mangle anything: mangling implies an irreversible process because some ambiguity would be created. The affected noncharacters (which are not all of the noncharacters in Unicode, only a small and well-defined subset) are safely quoted and unquoted, a reversible and unambiguous process. I acknowledge there is a type aliasing issue, but this is the best solution available under the constraints that 1. file name objects (and many other strings that come from the operating system, such as environment variables) have to be representable as Scheme strings for backwards compatibility and 2. Scheme strings must be pure sequences of Unicode scalar values |
I meant mangling as in "C++ function names mangling", which is reversible. |
FWIW Racket extends procedures that operate on file paths to accept |
The functions
described in Section 9.16. File System Interface https://cisco.github.io/ChezScheme/csug10.0/io.html#./io:h16 operate on file names and directory names represented as Scheme strings.
When actually accessing the file system on POSIX systems, such strings are automatically converted from/to UTF-8.
This has the side effect that existing files and directories whose names are invalid UTF-8 cannot be accessed by the functions listed above.
Example:
The problem is: byte #xff is not a valid UTF-8 sequence,
and (directory-list) converts it to replacement character #xFFFD as per UTF-8 error-handling rules.
As a consequence, the file created with shell command
touch $(printf 'AAA\xffzzz')
and all other files or directories whose names are invalid UTF-8
cannot be accessed with Chez Scheme functions listed above.
Since POSIX file system specifications do not require that files or directory names are valid UTF-8,
this leaves the above Chez Scheme functions in the uncomfortable position of failing on some valid POSIX file and directory names.
A solution could be to convert file and directory names from/to UTF-8b (note the 'b') instead of UTF-8,
because UTF-8b is an extension of UTF-8 designed exactly to losslessly convert any byte or byte sequence.
For a definition of UTF-8b, see
https://peps.python.org/pep-0383
https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
The text was updated successfully, but these errors were encountered: