Skip to content

[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

vector-of-bool
Copy link
Contributor

@vector-of-bool vector-of-bool commented Jun 26, 2025

Background

Refer: CDRIVER-5983

This PR is a subset of the changes that were original slated for CDRIVER-5983. The URI parameter refactor turned out to be significantly more impractical than initially though, so the changes related to parameter handling have been deferred until the relevant spec documents can consolidate on a cohesive behavior.

Regardless, some of the changes from that work can still be extracted and is useful in general throughout the codebase, specifically related to string handling and parsing.

This PR introduces minor behavioral changes surrounding some edge cases related to parsing.

While initially the commit history was going to be clean for the overall changes, the partial rollback around parameter handling means that commits themselves are not entirely useful in isolation.

Change Summary

String Views

This PR introduces the mlib/str.h header. This header has been re-introduced several times in libmongocrypt and amongoc, and this is the latest incarnation, which I believe is the most solid as it can build upon existing work regarding safe arithmetic and integer utilities.

  • mstr_view is a "string view" type. It consists of a const char* data pointer and a size_t len, wheren len refers to the length of the pointed-to string view.

  • Importantly, the data in mstr_view NOT necessarily null terminated! This means that mstr_view requires new string handling routines that respect the length.

  • Because we can pass slices of strings, this greatly reduces the need to strndup every time we want to take a string slice to pass to another function, further reducing the occurrences of memory leaking, goto fail, and buffer overruns. Code using mtsr_view is almost always far simpler than trying to juggle C strings.

  • There are several string algorithms: mstr_find, mstr_find_first_of, mlib_substr, mstr_contains, which inspect and manipulate string views.

  • mstr_cmp is a string comparison function like mlib_cmp, which supports natural syntax like mstr_cmp(a, !=, b)

  • mstr_split_at(S, Pos, [Drop,] *Pfx, *Sfx) is the most useful algorithm. It takes an index and optional skip parameter, and produces two new string views from an input.

  • mstr_split_around(S, Infix, *Pfx, *Sfx) -> bool builds upon mstr_split_at, and splits the string S around the first occurrence of Infix, returning true if it found Infix. This one API produces the greatest complexity reduction of all. For parsing a CSV string:

    mstr_view remain = csv;
    while (remain.len) {
      mstr_view entry;
      mstr_split_around(remain, mlib_cstring(","), &entry, &remain);
      // Handle `entry`
    }
  • mstr_cstring creates an mstr_view from a C string. This is unfortunately necessary for calling a lot of APIs, but may be unnecessary in the future (see the note on _Generic in mlib/str.h)

  • All algorithms that accept a string position index also accept a negative value to index from the end of the string. They also use upsize_integer in order to inhibit any integer conversion warnings. No need to worry about passing the right type to these functions, it will just Do the Right Thing without any implicit conversions.

  • To printf-format a sized string, one must use the %.*s printf-specifier. This expect an int to specify the string length, followed by the string pointer. Because this is common, the macro MSTR_FMT(X) will expand to the two arguments required for the %.*s specifier.

Case Sensitivity

  • mlib_latin_tolower: Transforms a Basic Latin codepoint to a lowercase letter if it is an uppercase letter. Unlike standard tolower, this function is total in 32-bit integers, is very clear about what "lowercasing" it does, is not sensitive to the locale, and has no undefined behavior.
  • mlib_latin_charcasecmp: Compare two codepoints for equivalence with case-insensitivity in Basic Latin. Uses mlib_latin_tolower.
  • mstr_latin_casecmp: Like mstr_cmp, but case-insensitive using mlib_latin_charcasecmp.

bson_error_t

bson_error_clear

This function takes a bson_error_t* and resets it to a non-error value if the pointer is not null.

bson_error_reset

The new function-like macro bson_error_reset is defined. It is equivalent to bson_error_clear and takes an l-value to a pointer to bson_error_t. If the pointer is null, it is made to point to an anonymous bson_error_t in the local scope, allowing the function to rely on the present of an error object, even if the caller didn't pass one:

bool foo(bson_error_t* err) {
  bson_error_reset(err);  // Clears any "err"
  // `err` is now guaranteed to be non-null
}

This feature relies on C99 compound literals, so this function-like macro cannot be used in C++ 😢.

bson/mongoc_set_error

The set_error functions that do a string-format into the error message now support passing the error message of the object as an input parameter to the format string itself:

bson_set_error(&error, Cat, Code, "Got an error while creating a widget: %s", error.message);

Previously, this would garble the content of error.message. Now, the code uses a temporary buffer to format the message, then copies it over the error.message.

Integer Parsing

Because we are now using sized strings and not C strings, we need new integer parsing, because strtoll and friends expect a null terminator. Additionally, we can do better than strtoll:

  • mlib_i64_parse(Str, Base, *Out) -> int parses a 64-bit signed integer in Str. It differs from strtoll in a few important ways:
    • It does not skip leading whitespace. That's an error.
    • It doesn't immediately stop on a non-digit. That's an error.
    • Doesn't silently reject empty strings. That's an error.
    • It doesn't require a null terminator, of course.
  • mlib_i32_parse(Str, Base, *Out) -> int int32_t version of mlib_i64_parse
  • mlib_nat64_parse(Str, Base, *Out) -> int like mlib_i64_parse, but does not allow sign or base prefixes. Base must be non-zero. The input must be a string of only digits. This is the core of integer parsing.

Previousy, our code using strtoll was silently relying on whitespace trimming and had clunky or incomplete detecting whether strtoll stopped early. These parsing APIs are much more strict.

The implementation may be less efficient than the hyper-optimized strtoll, but it's much more easy to use correctly. Optimizations can come later.

Error Messages

Several APIs have been modified to return better error messages, including APIs that can fail with true/false. For example, %-decoding now explains why %-string is invalid, and that will be included when a URI is rejected.

Other Changes

The following other minor changes were made:

  • Passing an empty string to mongoc_uri_set_compressors was previously an error. The new behavior is to clear the compressors on the URI (as if passing a null pointer).
  • The failure strings in test assertions macros were modified so that VSCode generates proper clickable file links in terminal output.
  • Invalid UTF-8 following %-decoding generated a descriptive log warning in addition to a vague error message. This is changed to just generate a descriptive error without logging.

Not Included

Owning Sized-String mstr

There was going to be a type mstr that owned a sized mutable null-terminated string, with its own set of useful "Does the Right Thing" algorithms (append, splice, copy, erase, replace, push, etc.). I have the code for this, but it can wait for a future PR.

UTF-8 Validation

It may be useful to force mstr_view to only contain UTF-8 validation (i.e. you cannot create an mstr_view without checking that you have valid UTF-8), but that's a larger change for the potential future.

Logging/Errors/Warning Mess

The behavior around logging, warning, and erroring in URI handling is a mess. There are cases where we test that an API logs, and sometimes we expect it to be silent and put the error on an error object instead, and sometimes we expect both!

This inconsistency has been mostly retained. In the future, any API that accepts a bson_error_t* probably shouldn't log anything.

This commit adds `mlib_str_view` and utlities for manipulating sized
string views.
- This change adds new functions to *unset* certain attributes on
  `mongoc_write_concern_t` objects, where previously it was only possible to set
  the value, but there was no way to clear the value.
- This also changes the definition of `is_default()` to consider the values of
  the relevant attributes directly, rather than having an `is_default` field
  that gets set upon any modification. This allows a write concern to re-enter
  the default state after it has been modified.
- The `wtimeout` field of the struct was renamed to `_wtimeout` to
  emphasize that it is private and that no one should modify it directly.
  There were test cases to check what happens if `wtimeout` is directly
  assigned an invalid (negative) value, but this is not actually possible
  to happen from the external API, because the `wtimeout` setter already
  guards against this case. The test cases for this situation have been
  removed. The rename of `_wtimeout` should discourange any code from
  attempting to modify this in a way that could bypass the validation
  of the setter.
This greatly reduces the number of allocated strings
to manage, and simplifies parsing operations.

- Split the URI into all components before trying to apply any of those
  to the URI object.
- Update several internal APIs to pass sized string views, reducing the
  requirement to pass null-terminated strings, and reduing the number
  of redundant `strlen` calls.
- `mstr_split_at/around` greatly simplifies parsing of delimited strings.
- Put case normalization at a lower level to reduce need to case-fold
  strings which necessitates a strdup. Instead, use case-insensitive
  compares in more locations.
- Behavior change: Setting compressors to empty string `""` now clears
  compressors rather than it being an error.
- %-decoding now indicates the position and kind
  of error to the caller.
- %-decoding doesn't use sscanf
- %-decoding allocates the full string up-front
- Error messages related to %-decoding now explain
  the problem.
- Use new sized integer parsing functions.
Allow passing `error.message` as an input to
`_mongoc_set_error` by storing the temporary
format output in a temporary buffer, then
copying that over the `error.message`
This changes host parsing to use sized strings, and
adds more specific error messages in case of parse failures.
@vector-of-bool vector-of-bool marked this pull request as ready for review June 30, 2025 15:01
@vector-of-bool vector-of-bool requested a review from a team as a code owner June 30, 2025 15:01
@vector-of-bool vector-of-bool requested review from rcsanchez97 and kevinAlbs and removed request for rcsanchez97 June 30, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant