Skip to content

[fix] avoid concurrent tablet stat iteration failures#63298

Open
yx-keith wants to merge 3 commits into
apache:masterfrom
yx-keith:fix-tablet-stat-concurrency
Open

[fix] avoid concurrent tablet stat iteration failures#63298
yx-keith wants to merge 3 commits into
apache:masterfrom
yx-keith:fix-tablet-stat-concurrency

Conversation

@yx-keith
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: #59138

Related PR: #xxx

Problem Summary:
TabletStatMgr may hit concurrency issues when FE metadata changes during tablet stat collection. MaterializedIndex.getTablets() and LocalTablet.getReplicas() return internal mutable lists directly, and updateTabletStat() also has a stale-metadata window between getTabletMeta() and getReplica().

Solution:
This PR makes the tablet stat read path more robust:
return snapshot lists from MaterializedIndex.getTablets() and LocalTablet.getReplicas()
skip stale stat updates in TabletStatMgr.updateTabletStat() when tablet metadata is removed concurrently

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30762 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17684	3852	3886	3852
q2	q3	10794	1362	839	839
q4	4680	465	348	348
q5	7585	2254	2109	2109
q6	323	171	142	142
q7	942	761	640	640
q8	9454	1679	1601	1601
q9	6740	4943	4899	4899
q10	6445	2108	1803	1803
q11	426	282	236	236
q12	685	422	290	290
q13	18228	3381	2755	2755
q14	257	257	232	232
q15	q16	825	781	705	705
q17	947	958	927	927
q18	6750	5633	5453	5453
q19	1204	1219	1038	1038
q20	524	398	259	259
q21	5630	2617	2327	2327
q22	432	358	307	307
Total cold run time: 100555 ms
Total hot run time: 30762 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4267	4195	4164	4164
q2	q3	4432	4897	4353	4353
q4	2158	2189	1375	1375
q5	4416	4241	4284	4241
q6	224	173	127	127
q7	2090	1899	1611	1611
q8	2633	2153	2143	2143
q9	7781	7702	7721	7702
q10	4570	4477	4082	4082
q11	585	422	370	370
q12	894	750	515	515
q13	3286	3622	3039	3039
q14	301	301	277	277
q15	q16	720	727	651	651
q17	1381	1340	1377	1340
q18	7904	7359	7085	7085
q19	1091	1099	1071	1071
q20	2213	2199	1934	1934
q21	5343	4619	4507	4507
q22	531	459	400	400
Total cold run time: 56820 ms
Total hot run time: 50987 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 167902 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

query5	4350	651	509	509
query6	342	227	195	195
query7	4300	536	306	306
query8	319	226	218	218
query9	8800	3992	3995	3992
query10	448	332	305	305
query11	5794	2349	2212	2212
query12	180	131	128	128
query13	1311	616	404	404
query14	5900	5342	5043	5043
query14_1	4373	4345	4322	4322
query15	209	208	185	185
query16	989	458	472	458
query17	1161	742	612	612
query18	2512	482	368	368
query19	222	212	170	170
query20	146	135	134	134
query21	222	143	124	124
query22	13636	13525	13451	13451
query23	17127	16371	15996	15996
query23_1	16213	16078	16194	16078
query24	7408	1743	1284	1284
query24_1	1281	1293	1292	1292
query25	539	460	404	404
query26	1306	317	173	173
query27	2703	551	357	357
query28	4431	1964	1935	1935
query29	1001	623	496	496
query30	306	239	200	200
query31	1115	1049	933	933
query32	86	73	74	73
query33	530	354	292	292
query34	1157	1111	639	639
query35	774	769	679	679
query36	1340	1304	1204	1204
query37	164	106	88	88
query38	3212	3131	3038	3038
query39	921	916	903	903
query39_1	871	894	870	870
query40	234	148	125	125
query41	66	66	63	63
query42	108	109	116	109
query43	327	321	284	284
query44	
query45	208	204	189	189
query46	1096	1157	711	711
query47	2329	2332	2167	2167
query48	406	395	297	297
query49	629	489	365	365
query50	956	334	256	256
query51	4302	4293	4238	4238
query52	105	104	94	94
query53	252	286	207	207
query54	314	265	245	245
query55	89	86	86	86
query56	284	292	303	292
query57	1433	1470	1379	1379
query58	327	275	269	269
query59	1601	1723	1501	1501
query60	320	330	315	315
query61	162	154	151	151
query62	669	627	576	576
query63	241	204	211	204
query64	2415	813	621	621
query65	
query66	1742	488	366	366
query67	30252	30162	29001	29001
query68	
query69	455	334	299	299
query70	1015	949	995	949
query71	307	273	267	267
query72	2900	2684	2367	2367
query73	883	737	395	395
query74	5064	4945	4757	4757
query75	2681	2595	2249	2249
query76	2310	1134	773	773
query77	395	414	355	355
query78	12201	12121	11639	11639
query79	1381	1037	718	718
query80	656	581	476	476
query81	467	280	248	248
query82	427	156	124	124
query83	357	277	257	257
query84	268	144	110	110
query85	955	629	450	450
query86	407	345	313	313
query87	3378	3322	3238	3238
query88	3464	2664	2643	2643
query89	432	397	334	334
query90	1936	172	183	172
query91	176	173	143	143
query92	80	78	73	73
query93	1605	1475	881	881
query94	531	352	324	324
query95	681	387	341	341
query96	965	821	325	325
query97	2685	2687	2566	2566
query98	238	224	239	224
query99	1120	1072	945	945
Total cold run time: 251899 ms
Total hot run time: 167902 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 14.29% (2/14) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants