-
Notifications
You must be signed in to change notification settings - Fork 455
[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vector-of-bool
wants to merge
38
commits into
mongodb:master
Choose a base branch
from
vector-of-bool:CDRIVER-5983-uri-param-refactor.1
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[CDRIVER-5983] Refactor String Handling Around URI Parsing #2047
vector-of-bool
wants to merge
38
commits into
mongodb:master
from
vector-of-bool:CDRIVER-5983-uri-param-refactor.1
+1,888
−878
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds `mlib_str_view` and utlities for manipulating sized string views.
- This change adds new functions to *unset* certain attributes on `mongoc_write_concern_t` objects, where previously it was only possible to set the value, but there was no way to clear the value. - This also changes the definition of `is_default()` to consider the values of the relevant attributes directly, rather than having an `is_default` field that gets set upon any modification. This allows a write concern to re-enter the default state after it has been modified. - The `wtimeout` field of the struct was renamed to `_wtimeout` to emphasize that it is private and that no one should modify it directly. There were test cases to check what happens if `wtimeout` is directly assigned an invalid (negative) value, but this is not actually possible to happen from the external API, because the `wtimeout` setter already guards against this case. The test cases for this situation have been removed. The rename of `_wtimeout` should discourange any code from attempting to modify this in a way that could bypass the validation of the setter.
This greatly reduces the number of allocated strings to manage, and simplifies parsing operations. - Split the URI into all components before trying to apply any of those to the URI object. - Update several internal APIs to pass sized string views, reducing the requirement to pass null-terminated strings, and reduing the number of redundant `strlen` calls. - `mstr_split_at/around` greatly simplifies parsing of delimited strings. - Put case normalization at a lower level to reduce need to case-fold strings which necessitates a strdup. Instead, use case-insensitive compares in more locations. - Behavior change: Setting compressors to empty string `""` now clears compressors rather than it being an error.
- %-decoding now indicates the position and kind of error to the caller. - %-decoding doesn't use sscanf - %-decoding allocates the full string up-front - Error messages related to %-decoding now explain the problem. - Use new sized integer parsing functions.
Allow passing `error.message` as an input to `_mongoc_set_error` by storing the temporary format output in a temporary buffer, then copying that over the `error.message`
This changes host parsing to use sized strings, and adds more specific error messages in case of parse failures.
This reverts commit ad481cc.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Refer: CDRIVER-5983
This PR is a subset of the changes that were original slated for CDRIVER-5983. The URI parameter refactor turned out to be significantly more impractical than initially though, so the changes related to parameter handling have been deferred until the relevant spec documents can consolidate on a cohesive behavior.
Regardless, some of the changes from that work can still be extracted and is useful in general throughout the codebase, specifically related to string handling and parsing.
This PR introduces minor behavioral changes surrounding some edge cases related to parsing.
While initially the commit history was going to be clean for the overall changes, the partial rollback around parameter handling means that commits themselves are not entirely useful in isolation.
Change Summary
String Views
This PR introduces the
mlib/str.h
header. This header has been re-introduced several times in libmongocrypt and amongoc, and this is the latest incarnation, which I believe is the most solid as it can build upon existing work regarding safe arithmetic and integer utilities.mstr_view
is a "string view" type. It consists of aconst char* data
pointer and asize_t len
, wherenlen
refers to the length of the pointed-to string view.Importantly, the data in
mstr_view
NOT necessarily null terminated! This means thatmstr_view
requires new string handling routines that respect the length.Because we can pass slices of strings, this greatly reduces the need to
strndup
every time we want to take a string slice to pass to another function, further reducing the occurrences of memory leaking,goto fail
, and buffer overruns. Code usingmtsr_view
is almost always far simpler than trying to juggle C strings.There are several string algorithms:
mstr_find
,mstr_find_first_of
,mlib_substr
,mstr_contains
, which inspect and manipulate string views.mstr_cmp
is a string comparison function likemlib_cmp
, which supports natural syntax likemstr_cmp(a, !=, b)
mstr_split_at(S, Pos, [Drop,] *Pfx, *Sfx)
is the most useful algorithm. It takes an index and optional skip parameter, and produces two new string views from an input.mstr_split_around(S, Infix, *Pfx, *Sfx) -> bool
builds uponmstr_split_at
, and splits the stringS
around the first occurrence ofInfix
, returningtrue
if it foundInfix
. This one API produces the greatest complexity reduction of all. For parsing a CSV string:mstr_cstring
creates anmstr_view
from a C string. This is unfortunately necessary for calling a lot of APIs, but may be unnecessary in the future (see the note on_Generic
inmlib/str.h
)All algorithms that accept a string position index also accept a negative value to index from the end of the string. They also use
upsize_integer
in order to inhibit any integer conversion warnings. No need to worry about passing the right type to these functions, it will just Do the Right Thing without any implicit conversions.To printf-format a sized string, one must use the
%.*s
printf-specifier. This expect anint
to specify the string length, followed by the string pointer. Because this is common, the macroMSTR_FMT(X)
will expand to the two arguments required for the%.*s
specifier.Case Sensitivity
mlib_latin_tolower
: Transforms a Basic Latin codepoint to a lowercase letter if it is an uppercase letter. Unlike standardtolower
, this function is total in 32-bit integers, is very clear about what "lowercasing" it does, is not sensitive to the locale, and has no undefined behavior.mlib_latin_charcasecmp
: Compare two codepoints for equivalence with case-insensitivity in Basic Latin. Usesmlib_latin_tolower
.mstr_latin_casecmp
: Likemstr_cmp
, but case-insensitive usingmlib_latin_charcasecmp
.bson_error_t
bson_error_clear
This function takes a
bson_error_t*
and resets it to a non-error value if the pointer is not null.bson_error_reset
The new function-like macro
bson_error_reset
is defined. It is equivalent tobson_error_clear
and takes an l-value to a pointer tobson_error_t
. If the pointer is null, it is made to point to an anonymousbson_error_t
in the local scope, allowing the function to rely on the present of an error object, even if the caller didn't pass one:This feature relies on C99 compound literals, so this function-like macro cannot be used in C++ 😢.
bson/mongoc_set_error
The
set_error
functions that do a string-format into the error message now support passing the error message of the object as an input parameter to the format string itself:Previously, this would garble the content of
error.message
. Now, the code uses a temporary buffer to format the message, then copies it over theerror.message
.Integer Parsing
Because we are now using sized strings and not C strings, we need new integer parsing, because
strtoll
and friends expect a null terminator. Additionally, we can do better thanstrtoll
:mlib_i64_parse(Str, Base, *Out) -> int
parses a 64-bit signed integer inStr
. It differs fromstrtoll
in a few important ways:mlib_i32_parse(Str, Base, *Out) -> int
int32_t
version ofmlib_i64_parse
mlib_nat64_parse(Str, Base, *Out) -> int
likemlib_i64_parse
, but does not allow sign or base prefixes.Base
must be non-zero. The input must be a string of only digits. This is the core of integer parsing.Previousy, our code using
strtoll
was silently relying on whitespace trimming and had clunky or incomplete detecting whetherstrtoll
stopped early. These parsing APIs are much more strict.The implementation may be less efficient than the hyper-optimized
strtoll
, but it's much more easy to use correctly. Optimizations can come later.Error Messages
Several APIs have been modified to return better error messages, including APIs that can fail with
true/false
. For example, %-decoding now explains why %-string is invalid, and that will be included when a URI is rejected.Other Changes
The following other minor changes were made:
mongoc_uri_set_compressors
was previously an error. The new behavior is to clear the compressors on the URI (as if passing a null pointer).Not Included
Owning Sized-String
mstr
There was going to be a type
mstr
that owned a sized mutable null-terminated string, with its own set of useful "Does the Right Thing" algorithms (append, splice, copy, erase, replace, push, etc.). I have the code for this, but it can wait for a future PR.UTF-8 Validation
It may be useful to force
mstr_view
to only contain UTF-8 validation (i.e. you cannot create anmstr_view
without checking that you have valid UTF-8), but that's a larger change for the potential future.Logging/Errors/Warning Mess
The behavior around logging, warning, and erroring in URI handling is a mess. There are cases where we test that an API logs, and sometimes we expect it to be silent and put the error on an error object instead, and sometimes we expect both!
This inconsistency has been mostly retained. In the future, any API that accepts a
bson_error_t*
probably shouldn't log anything.