Skip to content

uniprop and friends are buggy, inconsistent, and potentially replaceable #437

Open
@ab5tract

Description

@ab5tract

The Problem

Raku currently very lightly wraps internal NQP and VM operations for:

  • getting Unicode properties (uniprop)
  • getting Unicode names (uniname / uninames)
  • matching the first-or-only character of a string to a property or a property/property category pair (unimatch)
  • producing Unicode characters (uniparse)
  • producing numerical values from Unicode numbers (unival)

As currently implemented, these routines -- methods and subs alike -- do not perform their functions as well as they could.

Issues

  • uniprop will return 0 if the property provided is not found. This would be better served with a Failure or Exception, or at the very least an empty string so that the return type is consistent.

  • 'Properties' and 'Property Categories' are served via the same mechanism, even though they are only interchangeable from category to property and not the other way around.

    "w".unimatch("Latin", "Script") ==> say() # True
    "w".unimatch("Script", "Latin") ==> say() # False
    
    "w".uniprop("Script") ==> say() # Latin
    "w".uniprop("Latin")  ==> say() # Latin
    "w".uniprop           ==> say() # Ll      
    
  • All of the methods which take properties could fail at compile time when provided with known nonexistent properties (ie, not stored in a variable).

  • Properties are returned from NQP with spaces separating words, but only _ separated property names are accepted going the other direction.

    "e".uniprop("Block") 				# "Basic Latin" 
    "e".uniprop("Basic Latin") 	# 0
    "e".uniprop("Basic_Latin")  # "Basic Latin"
    
    use nqp;
    my $a = nqp::unipropcode("Basic Latin"),
    my $b = nqp::unipropcode("Basic_Latin"),
    $a, $b, $a == $b ==> say()  # (0 6 False)
    
  • The current implementation is not [thread-safe](Unicode property methods/subs are not thread-safe rakudo/rakudo#4871). See also at [the MoarVM level](Unicode ops are critically thread-dangerous MoarVM/MoarVM#1717).

  • There are separate methods for "first character" (uniprop/uniname) versus "full string" (uniprops, uninames). It feels far too Perl-ish at this point to accept a multi-character string as an argument that only operates ona single character.

    • Wouldn't it be reasonable to except uniprops to return a comprehensive list of applicable properties for a single character? Currently there is no way (that I am aware of) to get such a comprehensive list.
  • (Nit-level deficiency) smashedtogetherlowercase is an ugly (at worst) or a non-conformant (at best) naming scheme.

The solution

The instructions for problem-solving state I'm not meant to spend any space on this in the initial post.

So for now, I'll just mention that -- at the HLL level -- I think we could achieve all current functionality as well as much more out of even a single method that utilizes adverbs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    unicodeUnicode and encoding/decoding

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions