[Draft](be) push CHAR padding strip down to page decoder#63291
Open
csun5285 wants to merge 1 commit into
Open
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage `` 🎉 |
Contributor
TPC-H: Total hot run time: 31623 ms |
Contributor
TPC-DS: Total hot run time: 168893 ms |
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
FE Regression Coverage ReportIncrement line coverage |
80fd402 to
fb80b35
Compare
Previously CHAR(N) was stored padded with '\0' to N on disk and
unpadded at the top of every read path by
Block::shrink_char_type_column_suffix_zero, plus various query-time
paths re-padded predicate values to match. That meant every CHAR read
paid for an extra column scan, and the padding contract leaked across
many layers.
This commit pushes the strip down so segments hold unpadded CHAR
slices natively:
- Binary*PageDecoder strnlen CHAR slices on decode (dict pool + plain
pages), so column reads emit unpadded data directly.
- OlapColumnDataConvertorChar no longer pads on write; segments
written by the new code contain natural-length slices.
- ZoneMap from_olap_string strnlens CHAR min/max on read.
- Predicate creators (comparison / in-list / not-in) and
delete_handler no longer pad CHAR predicate values.
- segment_iterator drops _char_type_idx / _has_char_type and the
three shrink_char_type_column_suffix_zero calls.
- Block / ColumnArray / ColumnMap / ColumnStruct lose the now-unused
shrink_padding_chars overrides; ColumnDictionary drops
get_shrink_value.
- RowCursor::pad_char_fields() removed.
Index byte-format stability is preserved by keeping the pad inside
the KeyCoder. KeyCoderTraits<CHAR>::encode_ascending /
full_encode_ascending pad to schema_length internally (new
schema_length parameter, default 0, only consulted by CHAR). Short-key
index, PK index and segment min-max keys therefore remain
byte-identical to old BE writes, so cross-version lookups keep working.
BloomFilter requires a format flag: old segments hashed the
zero-padded CHAR, new segments hash the unpadded value, so the reader
honours BloomFilterIndexPB.unpadded_char_filter and only probes when
the predicate hashing matches the segment hashing. Old segments fall
back to skipping BF pruning for CHAR -- safe (no false negatives), just
slower.
Tests updated: zone_map_index_test CharColumnPadding now expects
unpadded min/max; key_coder_test passes schema_length=0;
segment_writer_full_encode_keys_test passes per-column
key_index_sizes; char_type_padding_test rewritten around the new
contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fb80b35 to
2fb1707
Compare
Contributor
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage `` 🎉 |
Contributor
TPC-H: Total hot run time: 31114 ms |
Contributor
TPC-DS: Total hot run time: 168927 ms |
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CHAR padding is kept only inside KeyCoder, used exclusively when short-key index entries and PK index entries are compared. Every other layer sees and produces unpadded CHAR.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)