Skip to content

perlunicode: Add discussion about malformations #23553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 151 additions & 28 deletions pod/perlunicode.pod
Original file line number Diff line number Diff line change
Expand Up @@ -1817,50 +1817,173 @@ through C<0x10FFFF>.)

=head2 Security Implications of Unicode

First, read
L<Unicode Security Considerations|https://www.unicode.org/reports/tr36>.
The security implications of Unicode are quite complicated, as you might
expect from it trying to handle all the world's scripts. An
introduction is in
L<Unicode Security Methods|https://www.unicode.org/reports/tr39>.

Also, note the following:
Here are a few examples of pitfalls

=over 4

=item *

Malformed UTF-8

UTF-8 is very structured, so many combinations of bytes are invalid. In
the past, Perl tried to soldier on and make some sense of invalid
combinations, but this can lead to security holes, so now, if the Perl
core needs to process an invalid combination, it will either raise a
fatal error, or will replace those bytes by the sequence that forms the
Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.

Every code point can be represented by more than one possible
syntactically valid UTF-8 sequence. Early on, both Unicode and Perl
considered any of these to be valid, but now, all sequences longer
than the shortest possible one are considered to be malformed.
=item Confusables

Many characters in Unicode look similar enough to other characters that
they could be easily confused with each other. This is true even within
the same script, for example English, where the digit C<0> and a capital
letter C<O> may look like each other. But people who use that script
know to look out for that.

In Unicode, a digit in one script may be confusable with a digit having
a different numeric value in another. A malicious website could use
this to make it appear that the price of something is less than what
you actually get charged for. (You can use L<perlre/Script Runs> to
make sure such digits are not being inter-mixed.)

This is a general problem with internet addresses. The people who give out
domain names need to be careful to not give out ones that spoof other ones
(examples in L<perlre/Script Runs>).

And computer program identifier names can be such that they look like
something they're not, and hence could fool a code reviewer, for
example. Script runs on the individual identifiers can catch many of
these, but not all. All the letters in the ASCII word C<scope> have
look-alikes in Cyrillic, though those do not form a real word. Using
those Cyrillic letters in that order would almost certainly be an
attempt at spoofing.

=item Malformed text
X<REPLACEMENT CHARACTER>

Successful attacks have been made against websites and databases by
passing strings to them that aren't actually legal; the receiver
fails to realize this; and performs an action it otherwise wouldn't,
based on what it thinks the input meant. Such strings are said to be
"malformed" or "illformed".

Vast sums of money have been lost to such attacks. It became important
to not fall for them, which involves detecting malformed text and taking
appropriate action (or inaction).

The Unicode REPLACEMENT CHARACTER (U+FFFD) is crucial to detecting and
handling these. It has no purpose other than to indicate it is a
substitute for something else. It is generally displayed as a white
question mark on a dark background that is shaped like a diamond (a
rectangle rotated 45 degrees).

When a malformed string is encountered, the code processing it should
substitute the REPLACEMENT CHARACTER for it. There are now strict rules
as to what parts get replaced. You should never try to infer what the
replaced part was meant to be. To do so could be falling into an
attacker's trap.

Many of the attack vectors were not originally envisioned by Unicode's
creators, nor by its implementers, such as Perl. Rules about what were
acceptable strings were originally laxer, tightened as the school of hard
knocks dictated.

Unfortunately, the Perl interpreter's methods of working with Unicode
strings were developed when we too were naive about the possibility of
attacks. Because of concerns about breaking existing code and continued
naivety about the consequences, there has been resistance to changing
it, and so our implementation has lagged behind Unicode's requirements
and recommendations. But, over the years, various improvements have
been added to minimize the issues.

Therefore, it is important to use the latest version of the Perl
interpreter when working with Unicode. Not unil Perl v5.44 is it fully
hardened against known attack vectors. And who knows what new ideas
clever atackers may come up with in the future that we will have to
change to counter. (Although no new ones have become known recently
that the Unicode Standard wasn't prepared for.) And CPAN modules can
easily lag behind the interpreter itself.

Note that finding a REPLACEMENT CHARACTER in your string doesn't
necessarily mean there is an attack. It is a perfectly legal input
character, for whatever reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, WinPerl's generic C coding policy is and MS's help docs say finding REPLACEMENT CHARACTER is I/O error, the SATA/SCSI/IDE cable was yanked, no more information is available. Imagine feeding all of BMP, or 100s of invalid utf8 surrogates into a state machine/de-dupe logic/HV* hash, and every last string pops out the front end or backend end of the API being memcmp(username1, username2, len) == 0 identical. Yet they were all unique different customers/ip addresses/zip codes/email addresses 1 millisecond ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that TUS does not prohibit someone from placing the REPLACEMENT CHARACTER in Unicode strings. It probably is a bad idea, but it isn't illegal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot the exact experiments I did with MS's multiple in-house, competing, UTF16 <- and -> UTF8 converters, located on the Windows installer .iso in various .dlls. But atleast one of their .dlls doesn't returned U+FFFD on the UTF8 side, but instead starts returning UTF8 code points from this range https://en.wikipedia.org/wiki/Private_Use_Areas which I think MS is using to return their secret error codes hidden inside PUA code points on how the UTF16 input byte stream failed "validation" whatever that particular MS .dll is calling "validating an input UTF16 bytestream".

Hence I've never discussed or considered beyond 10 seconds for blindly dropping MS algorithms/const RO static array tables of metadata on WinPerl, into where Perl brews its own in-house const RO static array tables of metadata. I don't trust the MS APIs, they are there to serve MS's in-house and and users of MS OSes and the public commercially distributed Win32-only ecosystem software.

That isn't the same scope or venn diagram of users as Perl 5 users. Perhaps it could be proven a certain MS API is identical for all possible inputs and outputs as a perl in house API, but im not qualified to prove that through unit tests. Prob not worth the maint either.

One of those reasons is when there is an encoding for which there are
Unicode equivalents for most, but not all characters in it. You just
use the REPLACEMENT CHARACTER for the missing ones. As long as most of
the text is translatable, the results could be intelligible to a human
reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ыОУ СХОУЛД ГИЖЕ АН ХОНЕРАБЛЕ МЕНТИОН ТО КОИ╦ АС ТХЕ ОНЛЫ КОДЕ ПАГЕ ВХЕРЕ ТХИС ИС ТРУЕ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of your point here. I happen to be able to read Cyrillic, and surprisingly several of the words you gave here are pronounceable. My point is if you have valid text in some encoding that you are translating to Unicode, but when you encounter a character that doesn't have a Unicode TUS explicitly says to substitute the REPLACEMENT CHARACTER in your translation, rather than doing anything else. (Maybe you could return failure, I suppose.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is 0xFFEE and low 7b "?" is not readable with human eyes. Its not an overcompressed jpeg image, or what I did with KOI8 bitwise math logic up above. Its a bad disk sector, you will never know what was behind the square, and no amount of $ to a data recovery firm will get that character back, or give you a good enough guess what it used to be.

If its AI generated auto captions, "???" "..." or nowadays AI algos just print "*music*" for 8 minutes instead of low 7b "?" on the screen. I just dont want anyone to think a 0xFFEE or a '?' in a CSV file is ever acceptable coding practices and to walk away and move on with daily life after seeing it for half a second in a system tracing log/sql table/dev tools console.

This, in fact, was one of the main reasons (besides malformation
substitution) for the creation of the REPLACEMENT CHARACTER in the very
first version of the Unicode Standard (often abbreviated TUS). Back
then, many characters were missing that have since been added (the first
release had 40,000 characters; Unicode 16.0 has nearly 300K).

And, transmission errors where bits get dropped or disk sector failures
can also cause malformations. The REPLACEMENT CHARACTER gets
substituted, and the results may be legible to a human as long as the
error rate is low enough.

Now that Unicode is far more complete, and the odds of finding a
character that Unicode doesn't know about are far lower, the primary use
of the REPLACEMENT CHARACTER is to substitute for malformations in
strings.

When you are programming in pure Perl, you end up relying on the
underlying interpreter and modules to handle these kinds of nuances.
You are responsible, however, for knowing the encoding(s) needed for
your program to interact with the outside world. For example, a common
encoding for files is UTF-8. You could use the following to read one:

use PerlIO::encoding;
my $path = "path-to-UTF-8-file";
open my $fh, "<:encoding(UTF-8)", $path
or die "Couldn't open $path: $!";

This, behind-the-scenes, uses the L<Encode> module to translate the
contents of C<$path> to something Perl can understand. L<C<Encode>>
knows how to handle a wide variety of encodings. Use this paradigm as
well to output to a file;

use PerlIO::encoding;
my $out_path = "path-to-UTF-8-file";
open my $fh, ">:encoding(UTF-8)", $out_path
or die "Couldn't open $out_path: $!";

(You can also use L<perlfunc/C<binmode>> to change the encoding of an
already-open file.)

(There are fewer options to specifying the encoding of arguments passed
to your Perl program or to interact with environment variables. See
L<perlrun/PERL_UNICODE>.)

Skip to the end of this item unless you are writing in XS.

When writing in XS, and manipulating Unicode strings, you need to know
more about the internals. L<perlapi/Unicode Support> lists the
available API elements for working with it. UTF-8 is how Unicode
strings are currently stored internally by the interpreter. You B<need>
to be using functions in the C<utf8_to_uv> family when parsing a string
encoded in UTF-8. This avoids the pitfalls of earlier API functions,
whose names contained C<to_uvchr> instead of plain C<to_uv>.

=item Illegal code points

Unicode considers many code points to be illegal, or to be avoided.
Perl generally accepts them, once they have passed through any input
Perl generally accepts them anyway, once they have passed through any input
filters that may try to exclude them. These have been discussed above
(see "Surrogates" under UTF-16 in L</Unicode Encodings>,
L</Noncharacter code points>, and L</Beyond Unicode code points>).

=item *
If you are writing in XS, the L<perlapi/utf8_to_uv> family of functions
has ones that can exclude common varieties of these. In particular,
C<strict_utf8_to_uv>, excludes all but the most restrictive set defined
by TUS.

=back

=head2 Regular Expressions

Regular expression pattern matching may surprise you if you're not
accustomed to Unicode. Starting in Perl 5.14, several pattern
modifiers are available to control this, called the character set
modifiers. Details are given in L<perlre/Character set modifiers>.

=back

As discussed elsewhere, Perl has one foot (two hooves?) planted in
each of two worlds: the old world of ASCII and single-byte locales, and
the new world of Unicode, upgrading when necessary.
If your legacy code does not explicitly use Unicode, no automatic
switch-over to Unicode should happen.

=head2 Unicode in Perl on EBCDIC

Unicode is supported on EBCDIC platforms. See L<perlebcdic>.
Expand Down
Loading