CDRIVER-5869 fix bulk write crash with verbose results #1843

kevinAlbs · 2025-01-29T13:37:53Z

Summary

Uses integral offsets, rather than a bson_iter_t, to refer to "_id" fields in the bulkWrite payload.

Verified with this patch build: https://spruce.mongodb.com/version/6798e244aed6b800073ef361

Background

See CDRIVER-5869 for background.

The `bson_iter_t` stores a pointer to the referred data. The payload buffer `mongoc_bulkwrite_t::ops` may be reallocated and invalidate existing iterators. Instead, store integral offsets into the payload.

eramongodb

AFAICT the purpose of the id_loc (formerly id_iter) data member is ultimately to populate the insertedId field in the bulk write insert result document by _bulkwriteresult_set_insertresult in _bulkwritereturn_apply_result, which must correlate to the index of a bulk write insert operation.

According to db.collection.bulkWrite Examples (MongoDB Manual) there should be an insertedIds field in the reply document containing the list of index-id pairs (regardless whether the bulk write operation is ordered or unordered). Can this field be parsed instead, in a manner similar to what is currently being done with the upserted field to populate the upsertedId field for an bulk write update operation result? This would avoid the need to retain an iterator to the _id field of the insert document and its resulting implicit reference semantics.

If the insertedIds field cannot be used (I could not find discussion/rationale in #1590), can we avoid implicit reference semantics entirely by instead storing an owning copy (via bson_strdup) of the BSON UTF-8 string (as already done for the .ns data member)?

kevinAlbs · 2025-01-29T18:30:37Z

According to db.collection.bulkWrite Examples (MongoDB Manual) there should be an insertedIds field in the reply document

That example uses the mongosh helper. The server reply does not return the _id fields. The driver is expected to generate the _id field for inserted documents. Using mongosh:

db.coll.drop()
// Do an insert:
const reply = db.adminCommand({bulkWrite: "admin", ops: [{insert: 0, document: {_id: "foo" } }], nsInfo: [{ns: "test.coll"}] })
print (reply.cursor.firstBatch[0]) // { ok: 1, idx: 0, n: 1 }

For upserts, the server is expected to generate the _id and return it:

db.coll.drop()
// Do an upsert:
const reply = db.adminCommand({bulkWrite: "admin", errorsOnly: false, ops: [{update: 0, filter: { _id: "foo" }, updateMods: { $set: { x: 1 }}, upsert: true }], nsInfo: [{ns: "test.coll"}] })
print (reply.cursor.firstBatch[0]) // { ok: 1, idx: 0, n: 1, nModified: 0, upserted: { _id: 'foo' } }

can we avoid implicit reference semantics entirely by instead storing an owning copy

Possibly yes, but it may come at a non-neglible performance cost. Reporting inserted _ids is only needed when verbose results are requested. mongoc_bulkwrite_append_insertone does not know if the _ids will be needed:

mongoc_bulkwrite_t *bw = mongoc_client_bulkwrite_new (client);
ASSERT_OR_PRINT (mongoc_bulkwrite_append_insertone (bw, "db.c", tmp_bson ("{'_id': 'foo'}"), NULL, &error), error);
mongoc_bulkwriteopts_t *bw_opts = mongoc_bulkwriteopts_new ();
mongoc_bulkwriteopts_set_verboseresults (bw_opts, true); // Report insertedId's in result.
mongoc_bulkwritereturn_t bwr = mongoc_bulkwrite_execute (bw, bw_opts);
ASSERT (bwr.res);
const bson_t *insertresults = mongoc_bulkwriteresult_insertresults (bwr.res);
ASSERT_MATCH (insertresults, BSON_STR ({"0" : {"insertedId" : "foo"}}));

_id can be an arbitrarily BSON value. I expect requesting verbose results is an uncommon use case. I would rather avoid the cost of copying the _id of each inserted document to only be used for verbose results.

I considered a simpler alternative to avoid tracking the _id offset and iterating later: kevinAlbs@126e45c. But this does not avoid the implicit reference so I did not pursue it.

ghost · 2025-01-29T19:28:06Z

Interesting bug. There might be a nonzero security impact (address space read) if the returned IDs and input documents were both accessible to untrusted code.

One early comment I have is that a copy of ObjectID should actually be smaller than this location reference. I don't think there's a performance argument for referencing instead of copying.

Edit: Never mind, I'm forgetting the _id can be an arbitrary type not just an ObjectID.

ghost

Looks good to me.

ghost · 2025-01-29T19:47:05Z

src/libmongoc/src/mongoc/mongoc-bulkwrite.c

+   // `id_loc` locates the "_id" field of an insert document.
+   struct {
+      size_t op_start;    // Offset in `mongoc_bulkwrite_t::ops` to the BSON for the insert op: { "document": ... }
+      size_t op_len;      // Length of insert op.


It's tempting to suggest narrower types here to save a little cache space etc, but this is already SO much smaller than bson_iter_t :)

src/libmongoc/tests/test-mongoc-bulkwrite.c

Co-authored-by: Ezra Chung <[email protected]>

* do not use `bson_iter_t` to store `_id` location The `bson_iter_t` stores a pointer to the referred data. The payload buffer `mongoc_bulkwrite_t::ops` may be reallocated and invalidate existing iterators. Instead, store integral offsets into the payload. * fix comment describing `ops_byte_len` and `ops_doc_len` --------- Co-authored-by: Ezra Chung <[email protected]>

kevinAlbs added 2 commits January 29, 2025 08:36

add failing test

67bb7e8

do not use bson_iter_t to store _id location

3f3cb04

The `bson_iter_t` stores a pointer to the referred data. The payload buffer `mongoc_bulkwrite_t::ops` may be reallocated and invalidate existing iterators. Instead, store integral offsets into the payload.

kevinAlbs force-pushed the bulkWrite-crash branch from 9338849 to 3f3cb04 Compare January 29, 2025 13:44

kevinAlbs marked this pull request as ready for review January 29, 2025 13:57

kevinAlbs requested review from eramongodb and a user January 29, 2025 13:57

eramongodb requested changes Jan 29, 2025

View reviewed changes

kevinAlbs added 2 commits January 29, 2025 13:07

fix comment describing ops_byte_len and ops_doc_len

388d188

remove out-dated test comment

8337dda

kevinAlbs requested a review from eramongodb January 29, 2025 18:34

ghost approved these changes Jan 29, 2025

View reviewed changes

eramongodb approved these changes Jan 29, 2025

View reviewed changes

src/libmongoc/tests/test-mongoc-bulkwrite.c Outdated Show resolved Hide resolved

do safe cast in test

2fd969c

Co-authored-by: Ezra Chung <[email protected]>

kevinAlbs merged commit 56a89c5 into mongodb:master Jan 30, 2025
42 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CDRIVER-5869 fix bulk write crash with verbose results #1843

CDRIVER-5869 fix bulk write crash with verbose results #1843

Uh oh!

kevinAlbs commented Jan 29, 2025

Uh oh!

eramongodb left a comment

Uh oh!

kevinAlbs commented Jan 29, 2025

Uh oh!

ghost commented Jan 29, 2025 •

edited by ghost

Loading

Uh oh!

ghost left a comment

Uh oh!

ghost Jan 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CDRIVER-5869 fix bulk write crash with verbose results #1843

CDRIVER-5869 fix bulk write crash with verbose results #1843

Uh oh!

Conversation

kevinAlbs commented Jan 29, 2025

Summary

Background

Uh oh!

eramongodb left a comment

Choose a reason for hiding this comment

Uh oh!

kevinAlbs commented Jan 29, 2025

Uh oh!

ghost commented Jan 29, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost left a comment

Choose a reason for hiding this comment

Uh oh!

ghost Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghost commented Jan 29, 2025 •

edited by ghost

Loading