Skip to content

Commit 794b42f

Browse files
gh-95555: Support Unicode property escapes \p{...} in regular expressions (GH-151969)
Add support for \p{property} and \P{property} escapes in Unicode (str) regular expressions, for the properties the engine can resolve without the unicodedata database. They are matched as CATEGORY opcodes or as fixed sets of character ranges. Supported in this change: many General_Category values (the groups L, N, Z, C and the values Lu, Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn); the binary properties Alphabetic, Lowercase, Uppercase, Numeric, Printable, XID_Start, XID_Continue, Cased and Case_Ignorable; the POSIX compatibility classes; the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point, Join_Control, Pattern_Syntax and Pattern_White_Space; and Regional_Indicator, ASCII_Hex_Digit and Hex_Digit. Property and value names use loose matching (UAX #44 UAX44-LM3), so a property may be spelled \p{Lu}, \p{gc=Lu} or \p{name=yes}. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 908f438 commit 794b42f

10 files changed

Lines changed: 854 additions & 7 deletions

File tree

Doc/library/re.rst

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -613,7 +613,7 @@ character ``'$'``.
613613

614614
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
615615

616-
__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
616+
__ https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-4/#G124142
617617

618618
For 8-bit (bytes) patterns:
619619
Matches any decimal digit in the ASCII character set;
@@ -680,6 +680,51 @@ character ``'$'``.
680680
matches characters which are neither alphanumeric in the current locale
681681
nor the underscore.
682682

683+
.. index:: single: \p; in regular expressions
684+
single: \P; in regular expressions
685+
686+
``\p{property=value}``, ``\p{value}``
687+
Matches any character with the given Unicode property
688+
(see `Unicode Technical Standard #18
689+
<https://unicode.org/reports/tr18/>`_, requirement RL1.2 "Properties").
690+
Property and value names are matched loosely:
691+
case, whitespace, ``'-'`` and ``'_'`` are ignored.
692+
The following properties are supported:
693+
694+
* The ``General_Category`` property (short name ``gc``),
695+
spelled ``\p{Lu}``, ``\p{gc=Lu}`` or, for a one-letter group, ``\p{L}``.
696+
The supported values are the groups ``L``, ``N``, ``Z`` and ``C`` and the
697+
values ``Lu``, ``Lt``, ``Lm``, ``Nd``, ``Nl``, ``No``, ``Zs``, ``Zl``,
698+
``Zp``, ``Cc``, ``Cf``, ``Cs``, ``Co`` and ``Cn``.
699+
* The binary properties ``XID_Start``, ``XID_Continue``, ``Alphabetic``,
700+
``Lowercase``, ``Uppercase``, ``Numeric``, ``Printable``, ``Cased`` and
701+
``Case_Ignorable``. A binary property may also be spelled
702+
``\p{name=yes}`` or ``\p{name=no}``.
703+
* The POSIX compatibility classes ``alpha``, ``alnum``, ``blank``,
704+
``cntrl``, ``digit``, ``graph``, ``lower``, ``print``, ``space``,
705+
``upper``, ``word`` and ``xdigit``.
706+
* The properties ``ASCII``, ``Any``, ``Assigned``,
707+
``Noncharacter_Code_Point``, ``Join_Control``, ``Regional_Indicator``,
708+
``ASCII_Hex_Digit``, ``Hex_Digit``, ``Pattern_Syntax`` and
709+
``Pattern_White_Space``.
710+
711+
Where a supported property corresponds to a :mod:`unicodedata` accessor or
712+
:class:`str` method, the set of characters it matches is exactly the one
713+
they report. For consistency with these, ``space`` follows
714+
:py:meth:`str.isspace` (like ``\s``) and ``xdigit`` matches only the ASCII
715+
hexadecimal digits.
716+
717+
This is only recognized in Unicode (str) patterns.
718+
In bytes patterns it is an error.
719+
720+
.. versionadded:: next
721+
722+
``\P{...}``
723+
Matches any character which does *not* have the given Unicode property.
724+
This is the opposite of ``\p``.
725+
726+
.. versionadded:: next
727+
683728
.. index:: single: \z; in regular expressions
684729
single: \Z; in regular expressions
685730

Doc/whatsnew/3.16.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,13 @@ re
192192
matches an ASCII lowercase consonant.
193193
(Contributed by Serhiy Storchaka in :gh:`152100`.)
194194

195+
* Regular expressions now support Unicode property escapes ``\p{...}`` and
196+
``\P{...}``, which match a character by a Unicode property -- for example
197+
``\p{Lu}`` (an uppercase letter), ``\p{Cased}`` or ``\p{ASCII}``. See
198+
:ref:`the regular expression syntax <re-syntax>` for the supported
199+
properties.
200+
(Contributed by Serhiy Storchaka in :gh:`95555`.)
201+
195202

196203
shlex
197204
-----

Lib/re/_constants.py

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
# update when constants are added or removed
1515

16-
MAGIC = 20230612
16+
MAGIC = 20260622
1717

1818
from _sre import MAXREPEAT, MAXGROUPS # noqa: F401
1919

@@ -150,6 +150,35 @@ def _makecodes(*names):
150150
'CATEGORY_UNI_SPACE', 'CATEGORY_UNI_NOT_SPACE',
151151
'CATEGORY_UNI_WORD', 'CATEGORY_UNI_NOT_WORD',
152152
'CATEGORY_UNI_LINEBREAK', 'CATEGORY_UNI_NOT_LINEBREAK',
153+
154+
# Unicode property categories. These are not affected by the ASCII,
155+
# LOCALE or UNICODE flags.
156+
'CATEGORY_ALPHA', 'CATEGORY_NOT_ALPHA',
157+
'CATEGORY_LOWER', 'CATEGORY_NOT_LOWER',
158+
'CATEGORY_UPPER', 'CATEGORY_NOT_UPPER',
159+
'CATEGORY_NUMERIC', 'CATEGORY_NOT_NUMERIC',
160+
'CATEGORY_PRINTABLE', 'CATEGORY_NOT_PRINTABLE',
161+
'CATEGORY_ALNUM', 'CATEGORY_NOT_ALNUM',
162+
'CATEGORY_XID_START', 'CATEGORY_NOT_XID_START',
163+
'CATEGORY_XID_CONTINUE', 'CATEGORY_NOT_XID_CONTINUE',
164+
'CATEGORY_TITLE', 'CATEGORY_NOT_TITLE',
165+
'CATEGORY_CASED', 'CATEGORY_NOT_CASED',
166+
'CATEGORY_CASE_IGNORABLE', 'CATEGORY_NOT_CASE_IGNORABLE',
167+
# Compound categories: Lu = uppercase letter, N = number.
168+
'CATEGORY_LU', 'CATEGORY_NOT_LU',
169+
'CATEGORY_N', 'CATEGORY_NOT_N',
170+
'CATEGORY_LM', 'CATEGORY_NOT_LM',
171+
'CATEGORY_NL', 'CATEGORY_NOT_NL',
172+
'CATEGORY_NO', 'CATEGORY_NOT_NO',
173+
'CATEGORY_CF', 'CATEGORY_NOT_CF',
174+
'CATEGORY_Z', 'CATEGORY_NOT_Z',
175+
'CATEGORY_ZS', 'CATEGORY_NOT_ZS',
176+
'CATEGORY_C', 'CATEGORY_NOT_C',
177+
'CATEGORY_CN', 'CATEGORY_NOT_CN',
178+
'CATEGORY_ASSIGNED', 'CATEGORY_NOT_ASSIGNED',
179+
'CATEGORY_BLANK', 'CATEGORY_NOT_BLANK',
180+
'CATEGORY_GRAPH', 'CATEGORY_NOT_GRAPH',
181+
'CATEGORY_PRINT', 'CATEGORY_NOT_PRINT',
153182
)
154183

155184

@@ -206,6 +235,39 @@ def _makecodes(*names):
206235
CATEGORY_NOT_LINEBREAK: CATEGORY_UNI_NOT_LINEBREAK
207236
}
208237

238+
# The Unicode property categories are the same regardless of the flags.
239+
CH_PROPERTY = (
240+
CATEGORY_ALPHA, CATEGORY_NOT_ALPHA,
241+
CATEGORY_LOWER, CATEGORY_NOT_LOWER,
242+
CATEGORY_UPPER, CATEGORY_NOT_UPPER,
243+
CATEGORY_NUMERIC, CATEGORY_NOT_NUMERIC,
244+
CATEGORY_PRINTABLE, CATEGORY_NOT_PRINTABLE,
245+
CATEGORY_ALNUM, CATEGORY_NOT_ALNUM,
246+
CATEGORY_XID_START, CATEGORY_NOT_XID_START,
247+
CATEGORY_XID_CONTINUE, CATEGORY_NOT_XID_CONTINUE,
248+
CATEGORY_TITLE, CATEGORY_NOT_TITLE,
249+
CATEGORY_CASED, CATEGORY_NOT_CASED,
250+
CATEGORY_CASE_IGNORABLE, CATEGORY_NOT_CASE_IGNORABLE,
251+
CATEGORY_LU, CATEGORY_NOT_LU,
252+
CATEGORY_N, CATEGORY_NOT_N,
253+
CATEGORY_LM, CATEGORY_NOT_LM,
254+
CATEGORY_NL, CATEGORY_NOT_NL,
255+
CATEGORY_NO, CATEGORY_NOT_NO,
256+
CATEGORY_CF, CATEGORY_NOT_CF,
257+
CATEGORY_Z, CATEGORY_NOT_Z,
258+
CATEGORY_ZS, CATEGORY_NOT_ZS,
259+
CATEGORY_C, CATEGORY_NOT_C,
260+
CATEGORY_CN, CATEGORY_NOT_CN,
261+
CATEGORY_ASSIGNED, CATEGORY_NOT_ASSIGNED,
262+
CATEGORY_BLANK, CATEGORY_NOT_BLANK,
263+
CATEGORY_GRAPH, CATEGORY_NOT_GRAPH,
264+
CATEGORY_PRINT, CATEGORY_NOT_PRINT,
265+
)
266+
for _cat in CH_PROPERTY:
267+
CH_LOCALE[_cat] = _cat
268+
CH_UNICODE[_cat] = _cat
269+
del _cat
270+
209271
CH_NEGATE = dict(zip(CHCODES[::2] + CHCODES[1::2], CHCODES[1::2] + CHCODES[::2]))
210272

211273
# flags

Lib/re/_parser.py

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,22 @@ def checkgroupname(self, name, offset):
310310
msg = "bad character in group name %r" % name
311311
raise self.error(msg, len(name) + offset)
312312

313+
def _property_escape(source, escape, in_set=False):
314+
# handle \p{...} and \P{...} (UTS #18 1.2.4, "Property Syntax")
315+
from . import _properties
316+
if not source.match('{'):
317+
raise source.error("missing {, expected property name")
318+
name = source.getuntil('}', 'property name')
319+
code = _properties.parse_property(name, escape[1] == 'P')
320+
if code is None:
321+
raise source.error("unknown property name %r" % name,
322+
len(name) + len(r'\p{}'))
323+
if in_set and code[1][0] == (NEGATE, None):
324+
# A negated multi-range property cannot be a member of a set.
325+
raise source.error("bad escape %s in character class" % escape,
326+
len(name) + len(r'\p{}'))
327+
return code
328+
313329
def _class_escape(source, escape):
314330
# handle escape code inside character class
315331
code = ESCAPES.get(escape)
@@ -352,6 +368,8 @@ def _class_escape(source, escape):
352368
raise source.error("undefined character name %r" % charname,
353369
len(charname) + len(r'\N{}')) from None
354370
return LITERAL, c
371+
elif c in "pP" and source.istext:
372+
return _property_escape(source, escape, in_set=True)
355373
elif c in OCTDIGITS:
356374
# octal escape (up to three digits)
357375
escape += source.getwhile(2, OCTDIGITS)
@@ -412,6 +430,8 @@ def _escape(source, escape, state):
412430
raise source.error("undefined character name %r" % charname,
413431
len(charname) + len(r'\N{}')) from None
414432
return LITERAL, c
433+
elif c in "pP" and source.istext:
434+
return _property_escape(source, escape)
415435
elif c == "0":
416436
# octal escape
417437
escape += source.getwhile(2, OCTDIGITS)
@@ -566,6 +586,12 @@ def _parse_operand(source, state, nested, here, allow_nested):
566586
sourcematch = source.match
567587
set = []
568588
setappend = set.append
589+
def addmember(code):
590+
# Flatten a \p{...} property's IN into the member set.
591+
if code[0] is IN:
592+
set.extend(code[1])
593+
else:
594+
setappend(code)
569595
compound = None # elements of a standalone nested-set operand
570596
if allow_nested and sourcematch("["):
571597
# A nested set after an operator is the whole operand, used as-is (not
@@ -608,13 +634,13 @@ def _parse_operand(source, state, nested, here, allow_nested):
608634
source.tell() - here)
609635
if that == "]":
610636
# A trailing '-' is a literal.
611-
setappend(code1)
637+
addmember(code1)
612638
setappend((LITERAL, _ord("-")))
613639
return [_charset_node(_uniq(set))], None
614640
if that == "-":
615641
# 'X--': difference, not a range. '--' after a single member
616642
# lands here because the range probe consumed the first '-'.
617-
setappend(code1)
643+
addmember(code1)
618644
return [_charset_node(_uniq(set))], "--"
619645
if that[0] == "\\":
620646
code2 = _class_escape(source, that)
@@ -630,7 +656,7 @@ def _parse_operand(source, state, nested, here, allow_nested):
630656
raise source.error(msg, len(this) + 1 + len(that))
631657
setappend((RANGE, (lo, hi)))
632658
else:
633-
setappend(code1)
659+
addmember(code1)
634660

635661
def _complement(elements, state):
636662
# The complement of `elements` (a single matcher, or a set operation as a

0 commit comments

Comments
 (0)