Skip to content

[refactor](profile) Decouple profile node reports from display tree#63299

Open
foxtail463 wants to merge 2 commits into
apache:masterfrom
foxtail463:refactor/profile-structured-node-report
Open

[refactor](profile) Decouple profile node reports from display tree#63299
foxtail463 wants to merge 2 commits into
apache:masterfrom
foxtail463:refactor/profile-structured-node-report

Conversation

@foxtail463
Copy link
Copy Markdown
Contributor

Problem Summary:

BE previously reported pipeline profiles as an ordered list, and FE inferred the
meaning of each profile node from its position in that list. The first element
was treated as the fragment-level profile, while the remaining elements were
treated as pipeline profiles.

This couples the profile transport protocol to the display tree layout and makes
the logic fragile when the profile tree structure changes. It also forces FE to
recover semantic information from implicit ordering instead of using explicit
metadata.

This change adds structured profile node reports with explicit node type and
pipeline id. BE now reports fragment-level and pipeline-level profiles with
their semantic type, and FE builds the RuntimeProfile display tree from that
structured data. The aggregated multi-BE pipeline profile only consumes pipeline
reports.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morrySnow
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking rolling-upgrade compatibility issues in the profile-reporting protocol.

Critical checkpoint conclusions:

  • Goal and tests: The profile-node typing change is partially covered by a new FE unit test, but mixed-version FE/BE compatibility is not covered and currently regresses profile reporting during rolling upgrades.
  • Scope: The core profile struct change is focused, but compatibility handling for the old fragment_id_to_profile field and old FE exec-status behavior is missing.
  • Concurrency: The reviewed changes do not introduce new shared mutable state requiring new lock ordering; existing profile maps remain protected by their existing mutexes/locks.
  • Lifecycle/static initialization: No new static/global lifecycle hazards were found in the changed code.
  • Configuration: No new configuration items were added.
  • Compatibility: Blocking issues found for both old-BE/new-FE and new-BE/old-FE profile reporting during rolling upgrades.
  • Parallel paths: Both legacy fragment_id_to_profile and new fragment_id_to_profile_node_reports profile transport paths need to be handled while the old thrift field still exists.
  • Tests: Missing mixed-version compatibility coverage for profile reporting.
  • Observability: Existing logs are sufficient for investigation, but the failures would surface only as dropped/rejected profile updates or status-less exec report handling.
  • Transaction/persistence/data correctness: No direct transaction or storage visibility changes found.
  • Performance: No blocking performance regression found in the changed paths.
  • User focus: No additional user-provided review focus was specified.

Because these issues can break profile reporting or exec-status handling in rolling upgrades, I am requesting changes.

Comment thread be/src/runtime/runtime_query_statistics_mgr.cpp Outdated
if (!profile.isSetFragmentIdToProfile()) {
LOG.warn("{} FragmentIdToProfile is not set", DebugUtil.printId(profile.getQueryId()));
return new Status(TStatusCode.INVALID_ARGUMENT, "FragmentIdToProfile is not set");
if (!profile.isSetFragmentIdToProfileNodeReports()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a new FE reject profile reports from old BEs during a rolling upgrade. The thrift struct still has the legacy fragment_id_to_profile field, and old BEs will continue to send only that field until they are upgraded; with this guard, updateProfile() returns INVALID_ARGUMENT and drops those profiles instead of translating the old TDetailedReportParams format. Please keep a fallback for fragment_id_to_profile until mixed-version reporting is no longer supported.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impact is limited to profile visibility during rolling
upgrade. Query execution status, load progress and correctness are not affected.
Since this report path is only for profile collection and the new structured field is
the intended format after decoupling profile node reports from the display tree, I
would prefer not to add legacy fallback

@foxtail463
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1854/2375)
Line Coverage 64.52% (33329/51653)
Region Coverage 65.23% (16527/25335)
Branch Coverage 55.71% (8829/15848)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31239 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit aa3a9619a518a52b3e6fd891d417be6539905d3e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17665	3935	3881	3881
q2	q3	10776	1398	823	823
q4	4687	470	349	349
q5	7610	2234	2115	2115
q6	242	176	142	142
q7	939	781	637	637
q8	9420	1767	1703	1703
q9	5166	4989	4973	4973
q10	6378	2074	1798	1798
q11	442	277	253	253
q12	628	427	300	300
q13	18082	3456	2786	2786
q14	262	259	241	241
q15	q16	795	778	712	712
q17	936	889	1000	889
q18	7227	5668	5536	5536
q19	1159	1329	1149	1149
q20	556	413	270	270
q21	5608	2623	2376	2376
q22	438	371	306	306
Total cold run time: 99016 ms
Total hot run time: 31239 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4180	4093	4119	4093
q2	q3	4518	4943	4272	4272
q4	2118	2206	1422	1422
q5	4393	4314	4306	4306
q6	231	184	130	130
q7	1964	1939	1727	1727
q8	2571	2187	2178	2178
q9	8012	7887	7552	7552
q10	4560	4530	4059	4059
q11	604	414	377	377
q12	726	747	516	516
q13	3293	3624	2950	2950
q14	290	292	272	272
q15	q16	713	735	644	644
q17	1347	1315	1330	1315
q18	8053	7338	7197	7197
q19	1187	1142	1134	1134
q20	2200	2204	1929	1929
q21	5301	4623	4497	4497
q22	530	480	428	428
Total cold run time: 56791 ms
Total hot run time: 50998 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169145 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit aa3a9619a518a52b3e6fd891d417be6539905d3e, data reload: false

query5	4328	660	524	524
query6	343	227	201	201
query7	4229	590	318	318
query8	324	224	209	209
query9	8849	4012	4011	4011
query10	453	346	311	311
query11	5825	2376	2204	2204
query12	193	133	131	131
query13	1278	608	444	444
query14	6450	5358	5071	5071
query14_1	4373	4331	4323	4323
query15	216	205	179	179
query16	1039	461	470	461
query17	1163	776	609	609
query18	2716	491	371	371
query19	226	220	167	167
query20	140	135	135	135
query21	216	141	121	121
query22	13618	13583	13414	13414
query23	17273	16364	16069	16069
query23_1	16196	16273	16173	16173
query24	7543	1778	1285	1285
query24_1	1298	1293	1280	1280
query25	548	479	419	419
query26	1326	317	171	171
query27	2686	537	347	347
query28	4327	1934	1929	1929
query29	961	616	489	489
query30	308	237	201	201
query31	1111	1065	959	959
query32	89	75	73	73
query33	531	344	296	296
query34	1168	1111	639	639
query35	762	782	677	677
query36	1346	1337	1173	1173
query37	152	102	91	91
query38	3215	3135	3051	3051
query39	948	908	910	908
query39_1	876	874	883	874
query40	232	150	127	127
query41	67	77	63	63
query42	110	111	111	111
query43	319	318	296	296
query44	
query45	211	201	189	189
query46	1103	1219	710	710
query47	2290	2393	2231	2231
query48	362	430	298	298
query49	620	487	384	384
query50	1000	338	255	255
query51	4276	4379	4232	4232
query52	106	106	94	94
query53	249	279	203	203
query54	327	289	248	248
query55	94	91	85	85
query56	308	317	309	309
query57	1415	1408	1326	1326
query58	300	274	271	271
query59	1534	1590	1404	1404
query60	319	321	308	308
query61	156	156	150	150
query62	679	620	564	564
query63	244	208	202	202
query64	2367	795	626	626
query65	
query66	1651	472	351	351
query67	30103	29979	29281	29281
query68	
query69	465	343	305	305
query70	998	1058	973	973
query71	313	273	271	271
query72	3057	2663	2387	2387
query73	844	760	432	432
query74	5062	4902	4735	4735
query75	2668	2604	2226	2226
query76	2304	1129	743	743
query77	387	413	333	333
query78	12189	12087	11693	11693
query79	1446	1017	720	720
query80	663	558	448	448
query81	449	273	244	244
query82	1399	156	123	123
query83	362	282	249	249
query84	267	141	107	107
query85	881	532	445	445
query86	386	348	329	329
query87	3406	3450	3240	3240
query88	3500	2657	2639	2639
query89	450	388	337	337
query90	1975	179	183	179
query91	180	167	142	142
query92	82	80	76	76
query93	1493	1506	898	898
query94	533	337	289	289
query95	679	398	448	398
query96	1044	764	327	327
query97	2699	2707	2565	2565
query98	237	231	230	230
query99	1100	1099	997	997
Total cold run time: 253453 ms
Total hot run time: 169145 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/57) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.49% (20640/38587)
Line Coverage 37.15% (195089/525082)
Region Coverage 33.55% (152799/455408)
Branch Coverage 34.57% (66580/192622)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 85.96% (49/57) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27843/37787)
Line Coverage 57.61% (301694/523705)
Region Coverage 54.83% (252109/459823)
Branch Coverage 56.36% (108980/193352)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 43.33% (13/30) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants