Skip to content

Commit 1e146e7

Browse files
committed
Doc update: clarify ASCII options and update ChangeLog and HTML
1 parent 231824f commit 1e146e7

11 files changed

+2038
-1959
lines changed

ChangeLog

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -130,13 +130,18 @@ includes underscore.
130130

131131
33. Changed the meaning of [:xdigit:] in UCP mode to match Perl. It now also
132132
matches the "fullwidth" versions of the hex digits. Just like it is done for
133-
[:digit:], PCRE2_EXTRA_ASCII_DIGIT can be used to keep this class ASCII only.
133+
[:digit:], PCRE2_EXTRA_ASCII_DIGIT can be used to keep this class ASCII only
134+
without affecting other POSIX classes.
134135

135136
34. GitHub PR305 fixes a potential integer overflow in pcre2_dfa_match().
136137

137-
35. Updated handling of \b and \B in UCP mode to match the changes to \w in 32
138+
35. Updated handling of \b and \B in UCP mode to match the changes to \w in 32
138139
above because \b and \B are defined in terms of \w.
139140

141+
36. Within a pattern (?aT) and (?-aT) set and reset the PCRE2_EXTRA_ASCII_DIGIT
142+
option, and (?aP) also sets (?aT) so that (?-aP) disables all ASCII
143+
restrictions on POSIX classes.
144+
140145

141146
Version 10.42 11-December-2022
142147
------------------------------

doc/html/pcre2api.html

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2014,13 +2014,16 @@ <h1>pcre2api man page</h1>
20142014
PCRE2_EXTRA_ASCII_DIGIT
20152015
</pre>
20162016
This option forces the POSIX character classes [:digit:] and [:xdigit:] to
2017-
match only ASCII digits, even when PCRE2_UCP is set.
2017+
match only ASCII digits, even when PCRE2_UCP is set. It can be changed within
2018+
a pattern by means of the (?aT) option setting.
20182019
<pre>
20192020
PCRE2_EXTRA_ASCII_POSIX
20202021
</pre>
2021-
This option forces the POSIX character classes to match only ASCII characters,
2022-
even when PCRE2_UCP is set. It can be changed within a pattern by means of the
2023-
(?aP) option setting.
2022+
This option forces all the POSIX character classes, including [:digit:] and
2023+
[:xdigit:], to match only ASCII characters, even when PCRE2_UCP is set. It can
2024+
be changed within a pattern by means of the (?aP) option setting, but note that
2025+
this also sets PCRE2_EXTRA_ASCII_DIGIT in order to ensure that (?-aP) unsets
2026+
all ASCII restrictions for POSIX classes.
20242027
<pre>
20252028
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
20262029
</pre>
@@ -4137,7 +4140,7 @@ <h1>pcre2api man page</h1>
41374140
</P>
41384141
<br><a name="SEC43" href="#TOC1">REVISION</a><br>
41394142
<P>
4140-
Last updated: 19 September 2023
4143+
Last updated: 12 October 2023
41414144
<br>
41424145
Copyright &copy; 1997-2023 University of Cambridge.
41434146
<br>

doc/html/pcre2compat.html

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,10 @@ <h1>pcre2compat man page</h1>
7171
7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
7272
built with Unicode support (the default). The properties that can be tested
7373
with \p and \P are limited to the general category properties such as Lu and
74-
Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
75-
derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs
76-
(surrogate) property, but in PCRE2 its use is limited. See the
74+
Nd, the derived properties Any and LC (synonym L&), script names such as Greek
75+
or Han, Bidi_Class, Bidi_Control, and a few binary properties. Both PCRE2 and
76+
Perl support the Cs (surrogate) property, but in PCRE2 its use is limited. See
77+
the
7778
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
7879
documentation for details. The long synonyms for property names that Perl
7980
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
@@ -239,6 +240,12 @@ <h1>pcre2compat man page</h1>
239240
fall into any stack-overflow limit. PCRE2 made a similar change at release
240241
10.30, and also has many build-time and run-time customizable limits.
241242
</P>
243+
<P>
244+
21. Unlike Perl, PCRE2 doesn't have character set modifiers and specially no way
245+
to set characters by context just like Perl's "/d". A regular expression using
246+
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
247+
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
248+
</P>
242249
<br><b>
243250
AUTHOR
244251
</b><br>
@@ -254,7 +261,7 @@ <h1>pcre2compat man page</h1>
254261
REVISION
255262
</b><br>
256263
<P>
257-
Last updated: 19 September 2023
264+
Last updated: 12 October 2023
258265
<br>
259266
Copyright &copy; 1997-2023 University of Cambridge.
260267
<br>

doc/html/pcre2pattern.html

Lines changed: 29 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1521,7 +1521,7 @@ <h1>pcre2pattern man page</h1>
15211521
The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
15221522
and space (32). If locale-specific matching is taking place, the list of space
15231523
characters may be different; there may be fewer or more of them. "Space" and
1524-
\s match the same set of characters.
1524+
\s match the same set of characters, as do "word" and \w.
15251525
</P>
15261526
<P>
15271527
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
@@ -1538,15 +1538,15 @@ <h1>pcre2pattern man page</h1>
15381538
By default, characters with values greater than 127 do not match any of the
15391539
POSIX character classes, although this may be different for characters in the
15401540
range 128-255 when locale-specific matching is happening. However, in UCP mode,
1541-
some of the classes are changed so that Unicode character properties are used.
1542-
This is achieved by replacing certain POSIX classes with other sequences, as
1543-
follows:
1541+
unless certain options are set (see below), some of the classes are changed so
1542+
that Unicode character properties are used. This is achieved by replacing
1543+
POSIX classes with other sequences, as follows:
15441544
<pre>
15451545
[:alnum:] becomes \p{Xan}
15461546
[:alpha:] becomes \p{L}
15471547
[:blank:] becomes \h
15481548
[:cntrl:] becomes \p{Cc}
1549-
[:digit:] becomes \p{Nd} unless PCRE2_EXTRA_ASCII_DIGIT is set
1549+
[:digit:] becomes \p{Nd}
15501550
[:lower:] becomes \p{Ll}
15511551
[:space:] becomes \p{Xps}
15521552
[:upper:] becomes \p{Lu}
@@ -1581,16 +1581,20 @@ <h1>pcre2pattern man page</h1>
15811581
<P>
15821582
[:xdigit:]
15831583
In addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
1584-
versions of those characters, whose Unicode code points start at U+FF10. The
1585-
effect of PCRE2_UCP can be negated by setting the PCRE2_EXTRA_ASCII_DIGIT
1586-
option, just like it does for [:digit]. This is a change that was made in
1587-
PCRE release 10.43 for Perl compatibility.
1584+
versions of those characters, whose Unicode code points start at U+FF10. This
1585+
is a change that was made in PCRE release 10.43 for Perl compatibility.
15881586
</P>
15891587
<P>
15901588
The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
1591-
with code points less than 256. The effect of PCRE2_UCP on all POSIX classes
1592-
can be negated by setting the PCRE2_EXTRA_ASCII_POSIX option, either when
1593-
calling <b>pcre2_compile()</b> or internally within the pattern.
1589+
with code points less than 256.
1590+
</P>
1591+
<P>
1592+
There are two options that can be used to restrict the POSIX classes to ASCII
1593+
characters when PCRE2_UCP is set. The option PCRE2_EXTRA_ASCII_DIGIT affects
1594+
just [:digit:] and [:xdigit:]. Within a pattern, this can be set and unset by
1595+
(?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
1596+
for all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
1597+
(?aP) and (?-aP) set and unset both these options for consistency.
15941598
</P>
15951599
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
15961600
<P>
@@ -1609,7 +1613,9 @@ <h1>pcre2pattern man page</h1>
16091613
<a href="#smallassertions">"Simple assertions"</a>
16101614
above), and in a Perl-style pattern the preceding or following character
16111615
normally shows which is wanted, without the need for the assertions that are
1612-
used above in order to give exactly the POSIX behaviour.
1616+
used above in order to give exactly the POSIX behaviour. Note also that the
1617+
PCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so
1618+
it also affects these POSIX sequences.
16131619
</P>
16141620
<br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br>
16151621
<P>
@@ -1643,8 +1649,8 @@ <h1>pcre2pattern man page</h1>
16431649
</pre>
16441650
For example, (?im) sets caseless, multiline matching. It is also possible to
16451651
unset these options by preceding the relevant letters with a hyphen, for
1646-
example (?-im). The two "extended" options are not independent; unsetting either
1647-
one cancels the effects of both of them.
1652+
example (?-im). The two "extended" options are not independent; unsetting
1653+
either one cancels the effects of both of them.
16481654
</P>
16491655
<P>
16501656
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
@@ -1665,7 +1671,8 @@ <h1>pcre2pattern man page</h1>
16651671
aD for PCRE2_EXTRA_ASCII_BSD
16661672
aS for PCRE2_EXTRA_ASCII_BSS
16671673
aW for PCRE2_EXTRA_ASCII_BSW
1668-
aP for PCRE2_EXTRA_ASCII_POSIX
1674+
aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
1675+
aT for PCRE2_EXTRA_ASCII_DIGIT
16691676
r for PCRE2_EXTRA_CASELESS_RESTRICT
16701677
J for PCRE2_DUPNAMES
16711678
U for PCRE2_UNGREEDY
@@ -1675,6 +1682,11 @@ <h1>pcre2pattern man page</h1>
16751682
above, it sets (or unsets) all the ASCII options.
16761683
</P>
16771684
<P>
1685+
PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX
1686+
is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
1687+
restrictions for POSIX classes.
1688+
</P>
1689+
<P>
16781690
When one of these option changes occurs at top level (that is, not inside group
16791691
parentheses), the change applies until a subsequent change, or the end of the
16801692
pattern. An option change within a group (see below for a description of
@@ -3832,7 +3844,7 @@ <h1>pcre2pattern man page</h1>
38323844
</P>
38333845
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
38343846
<P>
3835-
Last updated: 04 October 2023
3847+
Last updated: 12 October 2023
38363848
<br>
38373849
Copyright &copy; 1997-2023 University of Cambridge.
38383850
<br>

doc/html/pcre2syntax.html

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -391,10 +391,11 @@ <h1>pcre2syntax man page</h1>
391391
of the group.
392392
<pre>
393393
(?a) all ASCII options
394-
(?aD) restrict \d to ASCII, even in UCP mode
395-
(?aS) restrict \s to ASCII, even in UCP mode
396-
(?aW) restrict \w to ASCII, even in UCP mode
397-
(?aP) restrict POSIX classes to ASCII even in UCP mode
394+
(?aD) restrict \d to ASCII in UCP mode
395+
(?aS) restrict \s to ASCII in UCP mode
396+
(?aW) restrict \w to ASCII in UCP mode
397+
(?aP) restrict all POSIX classes to ASCII in UCP mode
398+
(?aT) restrict POSIX digit classes to ASCII in UCP mode
398399
(?i) caseless
399400
(?J) allow duplicate named groups
400401
(?m) multiline
@@ -404,9 +405,14 @@ <h1>pcre2syntax man page</h1>
404405
(?U) default ungreedy (lazy)
405406
(?x) ignore white space except in classes or \Q...\E
406407
(?xx) as (?x) but also ignore space and tab in classes
407-
(?-...) unset option(s)
408+
(?-...) unset the given option(s)
408409
(?^) unset imnrsx options
409410
</pre>
411+
(?aP) implies (?aT) as well, though this has no additional effect. However, it
412+
means that (?-aP) is really (?-PT) which disables all ASCII restrictions for
413+
POSIX classes.
414+
</P>
415+
<P>
410416
Unsetting x or xx unsets both. Several options may be set at once, and a
411417
mixture of setting and unsetting such as (?i-x) is allowed, but there may be
412418
only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
@@ -620,7 +626,7 @@ <h1>pcre2syntax man page</h1>
620626
</P>
621627
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
622628
<P>
623-
Last updated: 30 September 2023
629+
Last updated: 12 October 2023
624630
<br>
625631
Copyright &copy; 1997-2023 University of Cambridge.
626632
<br>

doc/html/pcre2unicode.html

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,12 @@ <h1>pcre2unicode man page</h1>
5252
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
5353
The Unicode properties that can be tested are a subset of those that Perl
5454
supports. Currently they are limited to the general category properties such as
55-
Lu for an upper case letter or Nd for a decimal number, the Unicode script
56-
names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived
57-
properties Any and LC (synonym L&). Full lists are given in the
55+
Lu for an upper case letter or Nd for a decimal number, the derived properties
56+
Any and LC (synonym L&), the Unicode script names such as Arabic or Han,
57+
Bidi_Class, Bidi_Control, and a few binary properties.
58+
</P>
59+
<P>
60+
The full lists are given in the
5861
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
5962
and
6063
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
@@ -510,7 +513,7 @@ <h1>pcre2unicode man page</h1>
510513
REVISION
511514
</b><br>
512515
<P>
513-
Last updated: 04 February 2023
516+
Last updated: 12 October 2023
514517
<br>
515518
Copyright &copy; 1997-2023 University of Cambridge.
516519
<br>

0 commit comments

Comments
 (0)