Skip to content

Commit 363a10e

Browse files
committed
perlrun: add caution that the -C flag does not validate nor produce UTF-8
1 parent 6fbe2c7 commit 363a10e

File tree

1 file changed

+21
-10
lines changed

1 file changed

+21
-10
lines changed

pod/perlrun.pod

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -279,19 +279,30 @@ X<-C>
279279

280280
The B<-C> flag controls some of the Perl Unicode features.
281281

282+
B<CAUTION:> As with the L<C<:utf8> PerlIO layer|PerlIO/:utf8>, none of
283+
the features enabled by this flag or the equivalent C<PERL_UNICODE>
284+
environment variable validate that input is valid UTF-8, nor guarantee
285+
to produce valid UTF-8. Instead it will assume input is provided in
286+
Perl's internal upgraded byte encoding, and provide output in this
287+
encoding, which is a superset of UTF-8 that can encode any character
288+
allowed in Perl strings. This can result in broken Perl strings or
289+
output bytes which are not valid UTF-8. This internal encoding will be
290+
referred to as C<utf8> below to differentiate it from a strict UTF-8
291+
encoding format.
292+
282293
As of 5.8.1, the B<-C> can be followed either by a number or a list
283294
of option letters. The letters, their numeric values, and effects
284295
are as follows; listing the letters is equal to summing the numbers.
285296

286-
I 1 STDIN is assumed to be in UTF-8
287-
O 2 STDOUT will be in UTF-8
288-
E 4 STDERR will be in UTF-8
297+
I 1 STDIN is assumed to be in utf8
298+
O 2 STDOUT will be in utf8
299+
E 4 STDERR will be in utf8
289300
S 7 I + O + E
290-
i 8 UTF-8 is the default PerlIO layer for input streams
291-
o 16 UTF-8 is the default PerlIO layer for output streams
301+
i 8 :utf8 is the default PerlIO layer for input streams
302+
o 16 :utf8 is the default PerlIO layer for output streams
292303
D 24 i + o
293304
A 32 the @ARGV elements are expected to be strings encoded
294-
in UTF-8
305+
in utf8
295306
L 64 normally the "IOEioA" are unconditional, the L makes
296307
them conditional on the locale environment variables
297308
(the LC_ALL, LC_CTYPE, and LANG, in the order of
@@ -307,22 +318,22 @@ perl.h gives W/128 as PERL_UNICODE_WIDESYSCALLS "/* for Sarathy */"
307318
perltodo mentions Unicode in %ENV and filenames. I guess that these will be
308319
options e and f (or F).
309320

310-
For example, B<-COE> and B<-C6> will both turn on UTF-8-ness on both
321+
For example, B<-COE> and B<-C6> will both turn on utf8-ness on both
311322
STDOUT and STDERR. Repeating letters is just redundant, not cumulative
312323
nor toggling.
313324

314325
The C<io> options mean that any subsequent open() (or similar I/O
315326
operations) in main program scope will have the C<:utf8> PerlIO layer
316-
implicitly applied to them, in other words, UTF-8 is expected from any
317-
input stream, and UTF-8 is produced to any output stream. This is just
327+
implicitly applied to them, in other words, utf8 is expected from any
328+
input stream, and utf8 is produced to any output stream. This is just
318329
the default set via L<C<${^OPEN}>|perlvar/${^OPEN}>,
319330
with explicit layers in open() and with binmode() one can
320331
manipulate streams as usual. This has no effect on code run in modules.
321332

322333
B<-C> on its own (not followed by any number or option list), or the
323334
empty string C<""> for the L</PERL_UNICODE> environment variable, has the
324335
same effect as B<-CSDL>. In other words, the standard I/O handles and
325-
the default C<open()> layer are UTF-8-fied I<but> only if the locale
336+
the default C<open()> layer are utf8-fied I<but> only if the locale
326337
environment variables indicate a UTF-8 locale. This behaviour follows
327338
the I<implicit> (and problematic) UTF-8 behaviour of Perl 5.8.0.
328339
(See L<perl581delta/UTF-8 no longer default under UTF-8 locales>.)

0 commit comments

Comments
 (0)