Skip to content

[refine](exec/pipeline) replace std::mutex/std::unique_lock with annotated wrappers for thread safety analysis#63106

Draft
Mryange wants to merge 1 commit intoapache:masterfrom
Mryange:use-thread-safety-in-pipeline
Draft

[refine](exec/pipeline) replace std::mutex/std::unique_lock with annotated wrappers for thread safety analysis#63106
Mryange wants to merge 1 commit intoapache:masterfrom
Mryange:use-thread-safety-in-pipeline

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented May 9, 2026

What problem does this PR solve?

Problem Summary:

This is a follow-up to #63070, extending the thread safety annotation migration into the pipeline executor (be/src/exec/pipeline/). Plain std::mutex and std::unique_lock<std::mutex> were replaced with AnnotatedMutex and LockGuard/UniqueLock wrappers from thread_safety_annotations.h, enabling Clang's -Wthread-safety static analysis to catch lock ordering and data race bugs at compile time.

While applying annotations, two latent data races in debug_string() methods were fixed:

  • Dependency::debug_string() and CountedFinishDependency::debug_string() read _blocked_task.size() without holding _task_lock.

  • PipelineTask::debug_string() read _blocked_dep and all dependency vectors without holding _dependency_lifecycle_lock.

  • CountedFinishDependency::debug_string() read _counter without holding _mtx.

  • RuntimeFilterTimerQueue::start() held two separate mutexes (cv_m + _que_lock) simultaneously. Consolidated into a single AnnotatedMutex _que_lock with std::condition_variable_any and a UniqueLock, eliminating the redundant lock.

  • PriorityTaskQueue::push() checked _closed before acquiring the lock — a data race now caught because _closed is annotated GUARDED_BY(_work_size_mutex). Moved the _closed check inside the lock, and hoisted _compute_level() (which doesn't touch shared state) before the lock.

  • Replaced C-style <stddef.h>/<stdint.h> with <cstddef>/<cstdint>, and removed unused includes.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 9, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking concurrency/lifetime regression. The PR's goal appears to be adding thread-safety annotations and replacing plain mutex guards in BE pipeline dependency/task queue code; the changes are generally focused, but PipelineTask::debug_string() now snapshots raw dependency pointers and releases the lifecycle lock before using them. That breaks the lock's previous lifetime protection against finalize() clearing the owning shared states.

Critical checkpoints: goal is partially achieved, but not safely due to the debug-string lifetime race; scope is small and focused; concurrency is involved and the new lock lifetime in debug_string() is unsafe; no new lifecycle/static initialization/config/serialization compatibility concerns found; parallel code paths touched consistently for the annotated mutex conversion; no new tests are included for this race; no test result changes; observability is the affected area and should not introduce crash risk; no transaction/data write/FE-BE variable passing concerns found; no additional performance concern beyond the small debug snapshot copy.

User focus: no additional user-provided review focus was specified.

fmt::format_to(debug_string_buffer, "{}. {}\n", i,
_read_dependencies[i][j]->debug_string(cast_set<int>(i) + 1));
read_dependencies[i][j]->debug_string(cast_set<int>(i) + 1));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now dereferences raw Dependency* values after _dependency_lifecycle_lock has been released. The lock comment says it protects the dependency containers and the raw pointers they contain, and finalize() takes the same lock before clearing _sink_shared_state, _op_shared_states, and _shared_state_map, which own these dependencies. A concrete race is: debug_string() copies read_dependencies and releases the lock, observes the task as not finalized, then another thread runs finalize() and clears the shared states; the subsequent read_dependencies[i][j]->debug_string() can use a freed Dependency. Please keep the lifetime protection until all raw dependency pointers are no longer dereferenced, or snapshot owning shared_ptr state under the lock instead of raw pointers.

@Mryange Mryange force-pushed the use-thread-safety-in-pipeline branch from e520e45 to 73b7805 Compare May 9, 2026 09:47
@Mryange Mryange marked this pull request as draft May 9, 2026 09:48
@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 9, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29844 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 73b7805fb9aae3618ea8bf973b8196ad75b8e2ae, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17732	3915	3879	3879
q2	q3	10711	869	609	609
q4	4660	460	349	349
q5	7436	1369	1151	1151
q6	190	167	143	143
q7	907	938	772	772
q8	9303	1446	1285	1285
q9	5726	5451	5377	5377
q10	6259	2112	1814	1814
q11	473	271	259	259
q12	628	412	295	295
q13	18083	3347	2777	2777
q14	297	291	269	269
q15	q16	893	860	802	802
q17	936	975	720	720
q18	6598	5670	5653	5653
q19	1159	1288	1089	1089
q20	500	388	272	272
q21	4614	2476	1972	1972
q22	478	429	357	357
Total cold run time: 97583 ms
Total hot run time: 29844 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4810	4685	4749	4685
q2	q3	4673	4810	4235	4235
q4	2147	2201	1423	1423
q5	5005	5029	5285	5029
q6	203	167	136	136
q7	2127	1803	1664	1664
q8	3368	3117	3167	3117
q9	8434	8424	8616	8424
q10	4459	4508	4238	4238
q11	627	417	422	417
q12	717	744	524	524
q13	3224	3621	2957	2957
q14	314	307	278	278
q15	q16	799	934	704	704
q17	1382	1313	1259	1259
q18	7844	7132	7091	7091
q19	1165	1154	1157	1154
q20	2252	2278	1952	1952
q21	6202	5473	4948	4948
q22	543	495	414	414
Total cold run time: 60295 ms
Total hot run time: 54649 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172362 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 73b7805fb9aae3618ea8bf973b8196ad75b8e2ae, data reload: false

query5	4349	665	532	532
query6	350	235	212	212
query7	4316	581	327	327
query8	335	248	251	248
query9	8819	4062	4016	4016
query10	470	342	287	287
query11	5981	2436	2182	2182
query12	197	129	124	124
query13	1266	605	430	430
query14	6896	5304	5060	5060
query14_1	4314	4409	4379	4379
query15	219	205	183	183
query16	1031	471	438	438
query17	1406	778	640	640
query18	2756	496	358	358
query19	302	217	185	185
query20	144	134	132	132
query21	216	143	123	123
query22	13665	14154	14402	14154
query23	17453	16569	16281	16281
query23_1	16309	16366	16342	16342
query24	7503	1746	1337	1337
query24_1	1358	1358	1365	1358
query25	610	540	459	459
query26	1062	310	169	169
query27	2689	619	380	380
query28	4435	1975	1954	1954
query29	1009	635	514	514
query30	303	233	199	199
query31	1118	1061	922	922
query32	89	71	67	67
query33	532	346	285	285
query34	1176	1129	628	628
query35	767	790	686	686
query36	1320	1352	1190	1190
query37	148	95	85	85
query38	3186	3153	3067	3067
query39	944	911	887	887
query39_1	859	878	866	866
query40	232	152	135	135
query41	65	61	59	59
query42	109	110	108	108
query43	334	331	298	298
query44	
query45	213	201	192	192
query46	1099	1194	758	758
query47	2309	2383	2231	2231
query48	403	414	313	313
query49	634	531	423	423
query50	742	288	227	227
query51	4353	4311	4208	4208
query52	106	106	93	93
query53	253	278	205	205
query54	310	286	257	257
query55	92	90	83	83
query56	311	311	304	304
query57	1416	1404	1327	1327
query58	299	276	275	275
query59	1628	1676	1498	1498
query60	350	351	332	332
query61	156	165	172	165
query62	671	615	569	569
query63	239	198	207	198
query64	2316	828	679	679
query65	
query66	1688	510	386	386
query67	29984	29954	29826	29826
query68	
query69	467	358	318	318
query70	1038	1047	948	948
query71	309	283	277	277
query72	2960	2712	2426	2426
query73	834	787	411	411
query74	5091	4905	4725	4725
query75	2778	2642	2329	2329
query76	2307	1142	772	772
query77	425	410	344	344
query78	12771	13019	12402	12402
query79	1522	999	707	707
query80	664	580	479	479
query81	452	288	235	235
query82	1371	169	127	127
query83	361	290	243	243
query84	260	141	112	112
query85	842	517	451	451
query86	395	335	314	314
query87	3399	3354	3239	3239
query88	3584	2697	2697	2697
query89	450	383	352	352
query90	1836	185	189	185
query91	202	218	161	161
query92	84	76	77	76
query93	972	958	565	565
query94	599	363	275	275
query95	677	397	447	397
query96	1105	773	342	342
query97	2719	2729	2606	2606
query98	249	231	238	231
query99	1100	1106	993	993
Total cold run time: 255052 ms
Total hot run time: 172362 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants