Skip to content

[Draft](be) push CHAR padding strip down to page decoder#63291

Open
csun5285 wants to merge 1 commit into
apache:masterfrom
csun5285:feature/shrink-char-padding-pushdown
Open

[Draft](be) push CHAR padding strip down to page decoder#63291
csun5285 wants to merge 1 commit into
apache:masterfrom
csun5285:feature/shrink-char-padding-pushdown

Conversation

@csun5285
Copy link
Copy Markdown
Contributor

@csun5285 csun5285 commented May 15, 2026

CHAR padding is kept only inside KeyCoder, used exclusively when short-key index entries and PK index entries are compared. Every other layer sees and produces unpadded CHAR.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@csun5285 csun5285 closed this May 15, 2026
@csun5285 csun5285 reopened this May 15, 2026
@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31623 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 80fd402175dd8b3a63a7726f77068e72641c52c3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17804	3903	3869	3869
q2	q3	10920	1463	820	820
q4	4810	484	343	343
q5	10630	2320	2194	2194
q6	408	178	136	136
q7	977	791	624	624
q8	9590	1678	1697	1678
q9	7030	4974	4918	4918
q10	6454	2125	1792	1792
q11	442	276	242	242
q12	645	429	297	297
q13	18171	3371	2748	2748
q14	265	256	230	230
q15	q16	808	788	715	715
q17	998	982	922	922
q18	6848	5733	6007	5733
q19	1259	1361	1106	1106
q20	504	402	258	258
q21	5847	2678	2679	2678
q22	475	379	320	320
Total cold run time: 104885 ms
Total hot run time: 31623 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4551	4673	4434	4434
q2	q3	4871	5216	4597	4597
q4	2149	2217	1470	1470
q5	4794	4652	4718	4652
q6	241	182	131	131
q7	1824	1600	1398	1398
q8	2228	1918	1921	1918
q9	7291	7307	7201	7201
q10	4509	4420	4018	4018
q11	533	380	344	344
q12	712	717	505	505
q13	3043	3407	2772	2772
q14	296	293	248	248
q15	q16	680	705	603	603
q17	1270	1235	1231	1231
q18	7842	6963	6952	6952
q19	1136	1164	1076	1076
q20	2214	2216	1915	1915
q21	5319	4576	4400	4400
q22	530	466	412	412
Total cold run time: 56033 ms
Total hot run time: 50277 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168893 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 80fd402175dd8b3a63a7726f77068e72641c52c3, data reload: false

query5	4328	642	515	515
query6	329	238	197	197
query7	4289	564	304	304
query8	315	227	219	219
query9	8823	4042	3975	3975
query10	461	341	289	289
query11	5841	2357	2130	2130
query12	179	131	125	125
query13	1263	610	439	439
query14	5891	5393	5035	5035
query14_1	4360	4364	4306	4306
query15	208	205	188	188
query16	1003	465	450	450
query17	1175	765	615	615
query18	2596	504	368	368
query19	217	207	181	181
query20	145	134	133	133
query21	215	144	118	118
query22	13591	13547	13418	13418
query23	17252	16354	15995	15995
query23_1	16220	16262	16114	16114
query24	7389	1782	1295	1295
query24_1	1293	1263	1303	1263
query25	539	481	414	414
query26	1300	326	185	185
query27	2664	566	347	347
query28	4458	1964	1980	1964
query29	979	608	491	491
query30	313	242	204	204
query31	1117	1068	929	929
query32	88	76	72	72
query33	541	372	303	303
query34	1160	1142	615	615
query35	770	782	674	674
query36	1366	1334	1202	1202
query37	151	101	90	90
query38	3198	3144	3043	3043
query39	931	912	904	904
query39_1	878	868	859	859
query40	228	149	126	126
query41	67	64	78	64
query42	111	110	109	109
query43	320	321	292	292
query44	
query45	208	200	192	192
query46	1033	1181	718	718
query47	2351	2342	2243	2243
query48	390	418	292	292
query49	631	488	391	391
query50	998	349	242	242
query51	4329	4325	4213	4213
query52	107	107	93	93
query53	258	289	197	197
query54	312	269	253	253
query55	95	97	88	88
query56	302	311	298	298
query57	1398	1406	1339	1339
query58	315	281	263	263
query59	1552	1616	1453	1453
query60	321	322	317	317
query61	160	158	152	152
query62	687	620	574	574
query63	256	201	198	198
query64	2396	829	637	637
query65	
query66	1679	483	357	357
query67	30053	30007	29891	29891
query68	
query69	461	344	306	306
query70	1044	969	940	940
query71	317	278	269	269
query72	3034	2812	2569	2569
query73	805	770	424	424
query74	5065	4914	4734	4734
query75	2666	2575	2284	2284
query76	2288	1142	749	749
query77	396	411	334	334
query78	12062	12116	11540	11540
query79	1208	1017	700	700
query80	580	534	448	448
query81	449	275	240	240
query82	239	159	120	120
query83	279	277	249	249
query84	261	142	111	111
query85	849	572	450	450
query86	382	344	321	321
query87	3371	3330	3287	3287
query88	3482	2661	2657	2657
query89	423	389	336	336
query90	2160	180	176	176
query91	173	164	141	141
query92	80	76	73	73
query93	1343	1523	944	944
query94	528	353	318	318
query95	666	478	349	349
query96	1002	760	305	305
query97	2686	2682	2585	2585
query98	236	227	223	223
query99	1126	1126	982	982
Total cold run time: 251038 ms
Total hot run time: 168893 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.33% (125/150) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.60% (27805/37779)
Line Coverage 57.49% (301005/523543)
Region Coverage 54.59% (250928/459698)
Branch Coverage 56.20% (108639/193292)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/33) 🎉
Increment coverage report
Complete coverage report

@csun5285 csun5285 force-pushed the feature/shrink-char-padding-pushdown branch from 80fd402 to fb80b35 Compare May 16, 2026 01:32
Previously CHAR(N) was stored padded with '\0' to N on disk and
unpadded at the top of every read path by
Block::shrink_char_type_column_suffix_zero, plus various query-time
paths re-padded predicate values to match. That meant every CHAR read
paid for an extra column scan, and the padding contract leaked across
many layers.

This commit pushes the strip down so segments hold unpadded CHAR
slices natively:

  - Binary*PageDecoder strnlen CHAR slices on decode (dict pool + plain
    pages), so column reads emit unpadded data directly.
  - OlapColumnDataConvertorChar no longer pads on write; segments
    written by the new code contain natural-length slices.
  - ZoneMap from_olap_string strnlens CHAR min/max on read.
  - Predicate creators (comparison / in-list / not-in) and
    delete_handler no longer pad CHAR predicate values.
  - segment_iterator drops _char_type_idx / _has_char_type and the
    three shrink_char_type_column_suffix_zero calls.
  - Block / ColumnArray / ColumnMap / ColumnStruct lose the now-unused
    shrink_padding_chars overrides; ColumnDictionary drops
    get_shrink_value.
  - RowCursor::pad_char_fields() removed.

Index byte-format stability is preserved by keeping the pad inside
the KeyCoder. KeyCoderTraits<CHAR>::encode_ascending /
full_encode_ascending pad to schema_length internally (new
schema_length parameter, default 0, only consulted by CHAR). Short-key
index, PK index and segment min-max keys therefore remain
byte-identical to old BE writes, so cross-version lookups keep working.

BloomFilter requires a format flag: old segments hashed the
zero-padded CHAR, new segments hash the unpadded value, so the reader
honours BloomFilterIndexPB.unpadded_char_filter and only probes when
the predicate hashing matches the segment hashing. Old segments fall
back to skipping BF pruning for CHAR -- safe (no false negatives), just
slower.

Tests updated: zone_map_index_test CharColumnPadding now expects
unpadded min/max; key_coder_test passes schema_length=0;
segment_writer_full_encode_keys_test passes per-column
key_index_sizes; char_type_padding_test rewritten around the new
contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@csun5285 csun5285 force-pushed the feature/shrink-char-padding-pushdown branch from fb80b35 to 2fb1707 Compare May 16, 2026 01:46
@csun5285
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31114 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2fb170769c6af38fe25a6a3cfc6da410c0cbca61, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17588	3958	3950	3950
q2	q3	10979	1368	805	805
q4	4741	477	361	361
q5	10357	2306	2112	2112
q6	370	180	138	138
q7	957	787	632	632
q8	9691	1829	1471	1471
q9	6947	4954	4970	4954
q10	6485	2122	1856	1856
q11	445	266	247	247
q12	686	418	291	291
q13	18283	3352	2803	2803
q14	263	258	237	237
q15	q16	813	767	711	711
q17	879	940	988	940
q18	7132	5791	5559	5559
q19	1181	1246	961	961
q20	521	407	258	258
q21	5756	2532	2609	2532
q22	432	359	296	296
Total cold run time: 104506 ms
Total hot run time: 31114 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4195	4119	4211	4119
q2	q3	4535	4901	4365	4365
q4	2083	2200	1390	1390
q5	4377	4281	5107	4281
q6	243	190	140	140
q7	1980	1799	1585	1585
q8	2424	2132	2057	2057
q9	7844	7790	7570	7570
q10	4592	4531	4098	4098
q11	588	411	370	370
q12	750	742	516	516
q13	3302	3721	2948	2948
q14	303	309	272	272
q15	q16	731	738	659	659
q17	1316	1332	1322	1322
q18	7819	7336	6810	6810
q19	1167	1096	1135	1096
q20	2217	2218	1936	1936
q21	5339	4657	4480	4480
q22	520	466	407	407
Total cold run time: 56325 ms
Total hot run time: 50421 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168927 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2fb170769c6af38fe25a6a3cfc6da410c0cbca61, data reload: false

query5	4338	655	537	537
query6	324	216	202	202
query7	4320	535	299	299
query8	322	222	206	206
query9	8848	4030	3988	3988
query10	454	344	301	301
query11	5750	2337	2147	2147
query12	191	130	127	127
query13	1320	595	447	447
query14	5879	5322	5037	5037
query14_1	4318	4282	4294	4282
query15	206	202	184	184
query16	1030	455	406	406
query17	1112	732	587	587
query18	2698	487	346	346
query19	213	192	161	161
query20	139	132	128	128
query21	210	137	120	120
query22	13549	13548	13345	13345
query23	17064	16434	16037	16037
query23_1	16220	16338	16025	16025
query24	7431	1764	1313	1313
query24_1	1292	1325	1320	1320
query25	592	510	451	451
query26	1321	322	174	174
query27	2728	567	340	340
query28	4434	1971	1967	1967
query29	1047	670	527	527
query30	312	242	204	204
query31	1126	1081	948	948
query32	95	77	76	76
query33	552	367	321	321
query34	1171	1124	659	659
query35	801	815	683	683
query36	1341	1380	1179	1179
query37	154	104	104	104
query38	3166	3142	3076	3076
query39	926	939	910	910
query39_1	893	880	892	880
query40	238	161	133	133
query41	77	68	69	68
query42	118	111	112	111
query43	334	332	293	293
query44	
query45	217	208	196	196
query46	1045	1222	746	746
query47	2357	2369	2217	2217
query48	404	423	293	293
query49	648	514	409	409
query50	1046	348	260	260
query51	4300	4299	4185	4185
query52	108	111	99	99
query53	262	286	211	211
query54	333	289	274	274
query55	96	93	85	85
query56	314	323	310	310
query57	1430	1428	1333	1333
query58	318	286	276	276
query59	1586	1708	1450	1450
query60	368	317	309	309
query61	160	159	157	157
query62	673	624	573	573
query63	244	197	204	197
query64	2385	804	629	629
query65	
query66	1679	478	367	367
query67	30131	30120	29885	29885
query68	
query69	474	366	295	295
query70	967	957	973	957
query71	313	276	267	267
query72	2909	2745	2387	2387
query73	866	758	444	444
query74	5033	4928	4776	4776
query75	2692	2593	2294	2294
query76	2287	1163	763	763
query77	401	398	329	329
query78	12169	12181	11650	11650
query79	1499	1038	761	761
query80	915	560	459	459
query81	513	282	247	247
query82	1362	161	127	127
query83	347	278	250	250
query84	262	140	110	110
query85	911	546	456	456
query86	426	359	321	321
query87	3370	3440	3213	3213
query88	3552	2708	2660	2660
query89	447	390	340	340
query90	1795	182	185	182
query91	177	164	140	140
query92	81	80	71	71
query93	1600	1518	841	841
query94	621	357	321	321
query95	679	383	349	349
query96	1044	822	323	323
query97	2718	2737	2569	2569
query98	242	225	230	225
query99	1120	1100	983	983
Total cold run time: 253512 ms
Total hot run time: 168927 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.31% (121/147) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.77% (27115/37779)
Line Coverage 55.09% (288429/523537)
Region Coverage 52.32% (240497/459692)
Branch Coverage 53.50% (103416/193288)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 82.31% (121/147) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.78% (27119/37779)
Line Coverage 55.11% (288532/523537)
Region Coverage 52.31% (240442/459692)
Branch Coverage 53.52% (103457/193288)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants