[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

vector-of-bool · 2025-06-26T22:08:08Z

Background

This PR is a subset of the changes that were original slated for CDRIVER-5983. The URI parameter refactor turned out to be significantly more impractical than initially though, so the changes related to parameter handling have been deferred until the relevant spec documents can consolidate on a cohesive behavior.

Regardless, some of the changes from that work can still be extracted and is useful in general throughout the codebase, specifically related to string handling and parsing.

This PR introduces minor behavioral changes surrounding some edge cases related to parsing.

While initially the commit history was going to be clean for the overall changes, the partial rollback around parameter handling means that commits themselves are not entirely useful in isolation.

Change Summary

String Views

This PR introduces the mlib/str.h header. This header has been re-introduced several times in libmongocrypt and amongoc, and this is the latest incarnation, which I believe is the most solid as it can build upon existing work regarding safe arithmetic and integer utilities.

mstr_view is a "string view" type. It consists of a const char* data pointer and a size_t len, wheren len refers to the length of the pointed-to string view.
Importantly, the data in mstr_view NOT necessarily null terminated! This means that mstr_view requires new string handling routines that respect the length.
Because we can pass slices of strings, this greatly reduces the need to strndup every time we want to take a string slice to pass to another function, further reducing the occurrences of memory leaking, goto fail, and buffer overruns. Code using mtsr_view is almost always far simpler than trying to juggle C strings.
There are several string algorithms: mstr_find, mstr_find_first_of, mlib_substr, mstr_contains, which inspect and manipulate string views.
mstr_cmp is a string comparison function like mlib_cmp, which supports natural syntax like mstr_cmp(a, !=, b)
mstr_split_at(S, Pos, [Drop,] *Pfx, *Sfx) is the most useful algorithm. It takes an index and optional skip parameter, and produces two new string views from an input.
mstr_split_around(S, Infix, *Pfx, *Sfx) -> bool builds upon mstr_split_at, and splits the string S around the first occurrence of Infix, returning true if it found Infix. This one API produces the greatest complexity reduction of all. For parsing a CSV string:
```
mstr_view remain = csv;
while (remain.len) {
  mstr_view entry;
  mstr_split_around(remain, mlib_cstring(","), &entry, &remain);
  // Handle `entry`
}
```
mstr_cstring creates an mstr_view from a C string. This is unfortunately necessary for calling a lot of APIs, but may be unnecessary in the future (see the note on _Generic in mlib/str.h)
All algorithms that accept a string position index also accept a negative value to index from the end of the string. They also use upsize_integer in order to inhibit any integer conversion warnings. No need to worry about passing the right type to these functions, it will just Do the Right Thing without any implicit conversions.
To printf-format a sized string, one must use the %.*s printf-specifier. This expect an int to specify the string length, followed by the string pointer. Because this is common, the macro MSTR_FMT(X) will expand to the two arguments required for the %.*s specifier.

Case Sensitivity

mlib_latin_tolower: Transforms a Basic Latin codepoint to a lowercase letter if it is an uppercase letter. Unlike standard tolower, this function is total in 32-bit integers, is very clear about what "lowercasing" it does, is not sensitive to the locale, and has no undefined behavior.
mlib_latin_charcasecmp: Compare two codepoints for equivalence with case-insensitivity in Basic Latin. Uses mlib_latin_tolower.
mstr_latin_casecmp: Like mstr_cmp, but case-insensitive using mlib_latin_charcasecmp.

`bson_error_t`

`bson_error_clear`

This function takes a bson_error_t* and resets it to a non-error value if the pointer is not null.

`bson_error_reset`

The new function-like macro bson_error_reset is defined. It is equivalent to bson_error_clear and takes an l-value to a pointer to bson_error_t. If the pointer is null, it is made to point to an anonymous bson_error_t in the local scope, allowing the function to rely on the present of an error object, even if the caller didn't pass one:

bool foo(bson_error_t* err) {
  bson_error_reset(err);  // Clears any "err"
  // `err` is now guaranteed to be non-null
}

This feature relies on C99 compound literals, so this function-like macro cannot be used in C++ 😢.

`bson/mongoc_set_error`

The set_error functions that do a string-format into the error message now support passing the error message of the object as an input parameter to the format string itself:

bson_set_error(&error, Cat, Code, "Got an error while creating a widget: %s", error.message);

Previously, this would garble the content of error.message. Now, the code uses a temporary buffer to format the message, then copies it over the error.message.

Integer Parsing

Because we are now using sized strings and not C strings, we need new integer parsing, because strtoll and friends expect a null terminator. Additionally, we can do better than strtoll:

mlib_i64_parse(Str, Base, *Out) -> int parses a 64-bit signed integer in Str. It differs from strtoll in a few important ways:
- It does not skip leading whitespace. That's an error.
- It doesn't immediately stop on a non-digit. That's an error.
- Doesn't silently reject empty strings. That's an error.
- It doesn't require a null terminator, of course.
mlib_i32_parse(Str, Base, *Out) -> int int32_t version of mlib_i64_parse
mlib_nat64_parse(Str, Base, *Out) -> int like mlib_i64_parse, but does not allow sign or base prefixes. Base must be non-zero. The input must be a string of only digits. This is the core of integer parsing.

Previousy, our code using strtoll was silently relying on whitespace trimming and had clunky or incomplete detecting whether strtoll stopped early. These parsing APIs are much more strict.

The implementation may be less efficient than the hyper-optimized strtoll, but it's much more easy to use correctly. Optimizations can come later.

Error Messages

Several APIs have been modified to return better error messages, including APIs that can fail with true/false. For example, %-decoding now explains why %-string is invalid, and that will be included when a URI is rejected.

Other Changes

The following other minor changes were made:

Passing an empty string to mongoc_uri_set_compressors was previously an error. The new behavior is to clear the compressors on the URI (as if passing a null pointer).
The failure strings in test assertions macros were modified so that VSCode generates proper clickable file links in terminal output.
Invalid UTF-8 following %-decoding generated a descriptive log warning in addition to a vague error message. This is changed to just generate a descriptive error without logging.

Not Included

Owning Sized-String `mstr`

There was going to be a type mstr that owned a sized mutable null-terminated string, with its own set of useful "Does the Right Thing" algorithms (append, splice, copy, erase, replace, push, etc.). I have the code for this, but it can wait for a future PR.

UTF-8 Validation

It may be useful to force mstr_view to only contain UTF-8 validation (i.e. you cannot create an mstr_view without checking that you have valid UTF-8), but that's a larger change for the potential future.

Logging/Errors/Warning Mess

The behavior around logging, warning, and erroring in URI handling is a mess. There are cases where we test that an API logs, and sometimes we expect it to be silent and put the error on an error object instead, and sometimes we expect both!

This inconsistency has been mostly retained. In the future, any API that accepts a bson_error_t* probably shouldn't log anything.

This commit adds `mlib_str_view` and utlities for manipulating sized string views.

- This change adds new functions to *unset* certain attributes on `mongoc_write_concern_t` objects, where previously it was only possible to set the value, but there was no way to clear the value. - This also changes the definition of `is_default()` to consider the values of the relevant attributes directly, rather than having an `is_default` field that gets set upon any modification. This allows a write concern to re-enter the default state after it has been modified. - The `wtimeout` field of the struct was renamed to `_wtimeout` to emphasize that it is private and that no one should modify it directly. There were test cases to check what happens if `wtimeout` is directly assigned an invalid (negative) value, but this is not actually possible to happen from the external API, because the `wtimeout` setter already guards against this case. The test cases for this situation have been removed. The rename of `_wtimeout` should discourange any code from attempting to modify this in a way that could bypass the validation of the setter.

This greatly reduces the number of allocated strings to manage, and simplifies parsing operations. - Split the URI into all components before trying to apply any of those to the URI object. - Update several internal APIs to pass sized string views, reducing the requirement to pass null-terminated strings, and reduing the number of redundant `strlen` calls. - `mstr_split_at/around` greatly simplifies parsing of delimited strings. - Put case normalization at a lower level to reduce need to case-fold strings which necessitates a strdup. Instead, use case-insensitive compares in more locations. - Behavior change: Setting compressors to empty string `""` now clears compressors rather than it being an error.

- %-decoding now indicates the position and kind of error to the caller. - %-decoding doesn't use sscanf - %-decoding allocates the full string up-front - Error messages related to %-decoding now explain the problem. - Use new sized integer parsing functions.

Allow passing `error.message` as an input to `_mongoc_set_error` by storing the temporary format output in a temporary buffer, then copying that over the `error.message`

This changes host parsing to use sized strings, and adds more specific error messages in case of parse failures.

This reverts commit ad481cc.

vector-of-bool added 30 commits June 9, 2025 09:01

mlib/str.h - String utilities

461d468

This commit adds `mlib_str_view` and utlities for manipulating sized string views.

Fix docs builds with older Sphinx 7.1

1f73472

Error utilities for clearing/reseting an error obj

764f439

A more robust integer parsing function

eb549b5

"because" assertions

d353579

Merge branch 'master' into CDRIVER-5983-uri-param-refactor

70ea5ed

str_split_around algorithm

97c35c3

find_first_of algorithm

8013f27

str_contains(_any_of) algo

842dfbf

Rename mlib_str... to mstr...

e7160fb

Support negative indexing

fc6bdd6

Case normalization in mlib/str

11bb078

Integer parsing using sized strings

aba486e

Allow passing an error string to reformat itself

4b96578

Allow passing `error.message` as an input to `_mongoc_set_error` by storing the temporary format output in a temporary buffer, then copying that over the `error.message`

Missing inline spec

3ebe7f8

Unused expr warnings

705680a

misc goofs

9eb3503

Fix missing-init warning

818096e

Tweaked error message for maxstalenessseconds

d0d7e09

uninit warnings

9d71a8f

Tweak logging behavior around URI parse errors

0f67e11

uninit vars

55cab60

Merge branch 'master' into CDRIVER-5983-uri-param-refactor

02657a7

Simplify parsing of host specifiers

058a2f2

This changes host parsing to use sized strings, and adds more specific error messages in case of parse failures.

Use formatting shorthand macro

c894c20

Merge branch 'master' into CDRIVER-5983-uri-param-refactor

7eeeda3

Minor formatting

727e3a7

vector-of-bool added 8 commits June 26, 2025 12:48

[fixup] error message for ipv6 parsing

f0952cc

Remove stray newline in test URI string

5044dd7

Fixup bson_set_error to use a temp string buffer

6f53c7a

Free the wtag on setting it

a5f06e1

Fix integer boundary condition around INT64_MIN

565d167

Minor tweaks and cleanup

7416c4f

Rename mlib_cstring and mstr_substr

eede34c

Revert "Modifications and extensions of write_concern"

fe504f1

This reverts commit ad481cc.

vector-of-bool marked this pull request as ready for review June 30, 2025 15:01

vector-of-bool requested a review from a team as a code owner June 30, 2025 15:01

vector-of-bool requested review from rcsanchez97 and kevinAlbs and removed request for rcsanchez97 June 30, 2025 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

vector-of-bool commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

Are you sure you want to change the base?

[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047

Conversation

vector-of-bool commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Change Summary

String Views

Case Sensitivity

bson_error_t

bson_error_clear

bson_error_reset

bson/mongoc_set_error

Integer Parsing

Error Messages

Other Changes

Not Included

Owning Sized-String mstr

UTF-8 Validation

Logging/Errors/Warning Mess

Uh oh!

Uh oh!

vector-of-bool commented Jun 26, 2025 •

edited

Loading

`bson_error_t`

`bson_error_clear`

`bson_error_reset`

`bson/mongoc_set_error`

Owning Sized-String `mstr`