@@ -1521,7 +1521,7 @@ <h1>pcre2pattern man page</h1>
1521
1521
The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1522
1522
and space (32). If locale-specific matching is taking place, the list of space
1523
1523
characters may be different; there may be fewer or more of them. "Space" and
1524
- \s match the same set of characters.
1524
+ \s match the same set of characters, as do "word" and \w .
1525
1525
</ P >
1526
1526
< P >
1527
1527
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
@@ -1538,15 +1538,15 @@ <h1>pcre2pattern man page</h1>
1538
1538
By default, characters with values greater than 127 do not match any of the
1539
1539
POSIX character classes, although this may be different for characters in the
1540
1540
range 128-255 when locale-specific matching is happening. However, in UCP mode,
1541
- some of the classes are changed so that Unicode character properties are used.
1542
- This is achieved by replacing certain POSIX classes with other sequences, as
1543
- follows:
1541
+ unless certain options are set (see below), some of the classes are changed so
1542
+ that Unicode character properties are used. This is achieved by replacing
1543
+ POSIX classes with other sequences, as follows:
1544
1544
< pre >
1545
1545
[:alnum:] becomes \p{Xan}
1546
1546
[:alpha:] becomes \p{L}
1547
1547
[:blank:] becomes \h
1548
1548
[:cntrl:] becomes \p{Cc}
1549
- [:digit:] becomes \p{Nd} unless PCRE2_EXTRA_ASCII_DIGIT is set
1549
+ [:digit:] becomes \p{Nd}
1550
1550
[:lower:] becomes \p{Ll}
1551
1551
[:space:] becomes \p{Xps}
1552
1552
[:upper:] becomes \p{Lu}
@@ -1581,16 +1581,20 @@ <h1>pcre2pattern man page</h1>
1581
1581
< P >
1582
1582
[:xdigit:]
1583
1583
In addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
1584
- versions of those characters, whose Unicode code points start at U+FF10. The
1585
- effect of PCRE2_UCP can be negated by setting the PCRE2_EXTRA_ASCII_DIGIT
1586
- option, just like it does for [:digit]. This is a change that was made in
1587
- PCRE release 10.43 for Perl compatibility.
1584
+ versions of those characters, whose Unicode code points start at U+FF10. This
1585
+ is a change that was made in PCRE release 10.43 for Perl compatibility.
1588
1586
</ P >
1589
1587
< P >
1590
1588
The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
1591
- with code points less than 256. The effect of PCRE2_UCP on all POSIX classes
1592
- can be negated by setting the PCRE2_EXTRA_ASCII_POSIX option, either when
1593
- calling < b > pcre2_compile()</ b > or internally within the pattern.
1589
+ with code points less than 256.
1590
+ </ P >
1591
+ < P >
1592
+ There are two options that can be used to restrict the POSIX classes to ASCII
1593
+ characters when PCRE2_UCP is set. The option PCRE2_EXTRA_ASCII_DIGIT affects
1594
+ just [:digit:] and [:xdigit:]. Within a pattern, this can be set and unset by
1595
+ (?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
1596
+ for all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
1597
+ (?aP) and (?-aP) set and unset both these options for consistency.
1594
1598
</ P >
1595
1599
< br > < a name ="SEC11 " href ="#TOC1 "> COMPATIBILITY FEATURE FOR WORD BOUNDARIES</ a > < br >
1596
1600
< P >
@@ -1609,7 +1613,9 @@ <h1>pcre2pattern man page</h1>
1609
1613
< a href ="#smallassertions "> "Simple assertions"</ a >
1610
1614
above), and in a Perl-style pattern the preceding or following character
1611
1615
normally shows which is wanted, without the need for the assertions that are
1612
- used above in order to give exactly the POSIX behaviour.
1616
+ used above in order to give exactly the POSIX behaviour. Note also that the
1617
+ PCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so
1618
+ it also affects these POSIX sequences.
1613
1619
</ P >
1614
1620
< br > < a name ="SEC12 " href ="#TOC1 "> VERTICAL BAR</ a > < br >
1615
1621
< P >
@@ -1643,8 +1649,8 @@ <h1>pcre2pattern man page</h1>
1643
1649
</ pre >
1644
1650
For example, (?im) sets caseless, multiline matching. It is also possible to
1645
1651
unset these options by preceding the relevant letters with a hyphen, for
1646
- example (?-im). The two "extended" options are not independent; unsetting either
1647
- one cancels the effects of both of them.
1652
+ example (?-im). The two "extended" options are not independent; unsetting
1653
+ either one cancels the effects of both of them.
1648
1654
</ P >
1649
1655
< P >
1650
1656
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
@@ -1665,7 +1671,8 @@ <h1>pcre2pattern man page</h1>
1665
1671
aD for PCRE2_EXTRA_ASCII_BSD
1666
1672
aS for PCRE2_EXTRA_ASCII_BSS
1667
1673
aW for PCRE2_EXTRA_ASCII_BSW
1668
- aP for PCRE2_EXTRA_ASCII_POSIX
1674
+ aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
1675
+ aT for PCRE2_EXTRA_ASCII_DIGIT
1669
1676
r for PCRE2_EXTRA_CASELESS_RESTRICT
1670
1677
J for PCRE2_DUPNAMES
1671
1678
U for PCRE2_UNGREEDY
@@ -1675,6 +1682,11 @@ <h1>pcre2pattern man page</h1>
1675
1682
above, it sets (or unsets) all the ASCII options.
1676
1683
</ P >
1677
1684
< P >
1685
+ PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX
1686
+ is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
1687
+ restrictions for POSIX classes.
1688
+ </ P >
1689
+ < P >
1678
1690
When one of these option changes occurs at top level (that is, not inside group
1679
1691
parentheses), the change applies until a subsequent change, or the end of the
1680
1692
pattern. An option change within a group (see below for a description of
@@ -3832,7 +3844,7 @@ <h1>pcre2pattern man page</h1>
3832
3844
</ P >
3833
3845
< br > < a name ="SEC32 " href ="#TOC1 "> REVISION</ a > < br >
3834
3846
< P >
3835
- Last updated: 04 October 2023
3847
+ Last updated: 12 October 2023
3836
3848
< br >
3837
3849
Copyright © 1997-2023 University of Cambridge.
3838
3850
< br >
0 commit comments