Add datasketches HLL sketch aggregate functions#63143
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
setting the Maven version constraint to [3.6.3,) is sufficient for normal compilation; [3.9.0,) is not required.
|
run buildall |
|
run buildall |
|
compile |
FE UT Coverage ReportIncrement line coverage |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
run p0 |
|
run cloud_p0 |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 29922 ms |
TPC-DS: Total hot run time: 171756 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29650 ms |
TPC-DS: Total hot run time: 171815 ms |
| branch = openblas | ||
| [submodule "contrib/datasketches-cpp"] | ||
| path = contrib/datasketches-cpp | ||
| url = https://github.com/apache/datasketches-cpp.git |
There was a problem hiding this comment.
I directly pointed the submodule to the Apache DataSketches GitHub repository. Later, if needed, we can consider adding DataSketches to the doris-thirdparty repository.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
/review |
1 similar comment
|
/review |
|
Hello, sorry to bother you @CalvinKirs @morningman . Could you please help review my PR when you have time? I typed "/review" but nothing seems to have happened... Please take a look. :) |
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test coverage: the PR adds a DataSketches HLL union aggregate and covers basic query/function behavior, aliases, invalid input, and BE unit cases, but it does not prove aggregate-key/pre-aggregation storage behavior or empty-state merge behavior.
- Scope/focus: the main query aggregate implementation is focused, but the table aggregation type/pre-agg additions are incomplete across FE/BE storage paths.
- Concurrency/lifecycle: no new runtime concurrency or non-obvious shared lifecycle was found in the reviewed query path.
- Configuration/compatibility: no config was added; adding a new thrift aggregation enum and table aggregation mode needs complete BE handling before use.
- Parallel paths: FE enables the aggregate type and pre-agg path, but BE tablet schema/storage aggregation parsing is not updated, so this path is inconsistent.
- Error handling/memory/data correctness: corruption is converted to exceptions, but merge has a null-state crash scenario for groups/partials with only NULL or empty inputs. Data visibility/version/MoW paths are not directly involved.
- Tests/results: regression and BE unit tests exist, but missing tests for aggregate-key table/pre-agg behavior and empty-state merge; regression cleanup drops tables after the test, which is against the repository regression-test standard but not the blocking issue here.
- Observability/performance: no additional observability requirement identified for this function; no confirmed hot-path performance regression beyond the correctness issues.
User focus: no additional user-provided review focus was supplied.
| } | ||
| void merge(AggregateDataPtr __restrict place, ConstAggregateDataPtr rhs, | ||
| Arena&) const override { | ||
| this->data(place).merge(this->data(rhs).hll_union_data->get_result(datasketches::HLL_8)); |
There was a problem hiding this comment.
This merge path assumes the rhs state has already seen a non-empty sketch, but add() leaves hll_union_data null for NULL inputs (via the default nullable handling) and for empty strings (line 130). A grouped/partial aggregate state can therefore exist for a key whose rows are all NULL/empty, and merging that state dereferences rhs.hll_union_data here. Please handle the empty rhs state the same way serialize()/insert_result_into() do, e.g. skip merging when rhs has no union data or merge an explicit empty sketch, and add a test that merges an empty state.
| case HLL_UNION: | ||
| return TAggregationType.HLL_UNION; | ||
| case DATASKETCHES_HLL_UNION_AGG: | ||
| return TAggregationType.DATASKETCHES_HLL_UNION; |
There was a problem hiding this comment.
This exposes DATASKETCHES_HLL_UNION_AGG as a table aggregation type, but the BE tablet schema/storage side is not updated to understand DATASKETCHES_HLL_UNION. tablet_meta.cpp converts this enum to the string DATASKETCHES_HLL_UNION, and TabletColumn::get_aggregation_type_by_string() currently treats any non-empty unknown string as OLAP_FIELD_AGGREGATION_GENERIC, which creates an AggStateField rather than a string/datasketches union aggregation field. A table or rollup using this aggregate type will therefore be created with the wrong BE aggregation implementation. Please either complete the BE storage aggregation support and add an aggregate-key/pre-agg regression test, or do not expose this as an AggregateType/pre-agg mode yet.
What problem does this PR solve?
Issue Number:
Summary:
Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.)
see: #63142
use case: see regression test & #63142
Release note
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)