Skip to content

CDRIVER-5869 fix bulk write crash with verbose results #1843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 30, 2025

Conversation

kevinAlbs
Copy link
Collaborator

Summary

Uses integral offsets, rather than a bson_iter_t, to refer to "_id" fields in the bulkWrite payload.

Verified with this patch build: https://spruce.mongodb.com/version/6798e244aed6b800073ef361

Background

See CDRIVER-5869 for background.

The `bson_iter_t` stores a pointer to the referred data. The payload buffer `mongoc_bulkwrite_t::ops` may be reallocated and invalidate existing iterators. Instead, store integral offsets into the payload.
@kevinAlbs kevinAlbs marked this pull request as ready for review January 29, 2025 13:57
Copy link
Collaborator

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the purpose of the id_loc (formerly id_iter) data member is ultimately to populate the insertedId field in the bulk write insert result document by _bulkwriteresult_set_insertresult in _bulkwritereturn_apply_result, which must correlate to the index of a bulk write insert operation.

According to db.collection.bulkWrite Examples (MongoDB Manual) there should be an insertedIds field in the reply document containing the list of index-id pairs (regardless whether the bulk write operation is ordered or unordered). Can this field be parsed instead, in a manner similar to what is currently being done with the upserted field to populate the upsertedId field for an bulk write update operation result? This would avoid the need to retain an iterator to the _id field of the insert document and its resulting implicit reference semantics.

If the insertedIds field cannot be used (I could not find discussion/rationale in #1590), can we avoid implicit reference semantics entirely by instead storing an owning copy (via bson_strdup) of the BSON UTF-8 string (as already done for the .ns data member)?

@kevinAlbs
Copy link
Collaborator Author

According to db.collection.bulkWrite Examples (MongoDB Manual) there should be an insertedIds field in the reply document

That example uses the mongosh helper. The server reply does not return the _id fields. The driver is expected to generate the _id field for inserted documents. Using mongosh:

db.coll.drop()
// Do an insert:
const reply = db.adminCommand({bulkWrite: "admin", ops: [{insert: 0, document: {_id: "foo" } }], nsInfo: [{ns: "test.coll"}] })
print (reply.cursor.firstBatch[0]) // { ok: 1, idx: 0, n: 1 }

For upserts, the server is expected to generate the _id and return it:

db.coll.drop()
// Do an upsert:
const reply = db.adminCommand({bulkWrite: "admin", errorsOnly: false, ops: [{update: 0, filter: { _id: "foo" }, updateMods: { $set: { x: 1 }}, upsert: true }], nsInfo: [{ns: "test.coll"}] })
print (reply.cursor.firstBatch[0]) // { ok: 1, idx: 0, n: 1, nModified: 0, upserted: { _id: 'foo' } }

can we avoid implicit reference semantics entirely by instead storing an owning copy

Possibly yes, but it may come at a non-neglible performance cost. Reporting inserted _ids is only needed when verbose results are requested. mongoc_bulkwrite_append_insertone does not know if the _ids will be needed:

mongoc_bulkwrite_t *bw = mongoc_client_bulkwrite_new (client);
ASSERT_OR_PRINT (mongoc_bulkwrite_append_insertone (bw, "db.c", tmp_bson ("{'_id': 'foo'}"), NULL, &error), error);
mongoc_bulkwriteopts_t *bw_opts = mongoc_bulkwriteopts_new ();
mongoc_bulkwriteopts_set_verboseresults (bw_opts, true); // Report insertedId's in result.
mongoc_bulkwritereturn_t bwr = mongoc_bulkwrite_execute (bw, bw_opts);
ASSERT (bwr.res);
const bson_t *insertresults = mongoc_bulkwriteresult_insertresults (bwr.res);
ASSERT_MATCH (insertresults, BSON_STR ({"0" : {"insertedId" : "foo"}}));

_id can be an arbitrarily BSON value. I expect requesting verbose results is an uncommon use case. I would rather avoid the cost of copying the _id of each inserted document to only be used for verbose results.

I considered a simpler alternative to avoid tracking the _id offset and iterating later: kevinAlbs@126e45c. But this does not avoid the implicit reference so I did not pursue it.

@kevinAlbs kevinAlbs requested a review from eramongodb January 29, 2025 18:34
@mdbmes
Copy link
Contributor

mdbmes commented Jan 29, 2025

Interesting bug. There might be a nonzero security impact (address space read) if the returned IDs and input documents were both accessible to untrusted code.

One early comment I have is that a copy of ObjectID should actually be smaller than this location reference. I don't think there's a performance argument for referencing instead of copying.

Edit: Never mind, I'm forgetting the _id can be an arbitrary type not just an ObjectID.

Copy link
Contributor

@mdbmes mdbmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

// `id_loc` locates the "_id" field of an insert document.
struct {
size_t op_start; // Offset in `mongoc_bulkwrite_t::ops` to the BSON for the insert op: { "document": ... }
size_t op_len; // Length of insert op.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tempting to suggest narrower types here to save a little cache space etc, but this is already SO much smaller than bson_iter_t :)

Co-authored-by: Ezra Chung <[email protected]>
@kevinAlbs kevinAlbs merged commit 56a89c5 into mongodb:master Jan 30, 2025
42 of 44 checks passed
kevinAlbs added a commit that referenced this pull request Jan 31, 2025
* do not use `bson_iter_t` to store `_id` location

The `bson_iter_t` stores a pointer to the referred data. The payload buffer `mongoc_bulkwrite_t::ops` may be reallocated and invalidate existing iterators. Instead, store integral offsets into the payload.

* fix comment describing `ops_byte_len` and `ops_doc_len`

---------

Co-authored-by: Ezra Chung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants