Skip to content

Add datasketches HLL sketch aggregate functions#63143

Open
nooneuse wants to merge 21 commits into
apache:masterfrom
nooneuse:add_datasketches_union_aggregate_functions
Open

Add datasketches HLL sketch aggregate functions#63143
nooneuse wants to merge 21 commits into
apache:masterfrom
nooneuse:add_datasketches_union_aggregate_functions

Conversation

@nooneuse
Copy link
Copy Markdown

@nooneuse nooneuse commented May 11, 2026

What problem does this PR solve?

An aggregate function is required to process user data containing Datasketches HLL sketches. In many data aggregation scenarios, users pre‑aggregate detailed data in Hive using the sketching techniques provided by Apache Datasketches, and then analyze the resulting sketches across various OLAP engines. Compared with the HLL union aggregate functions natively offered by these engines, there are two key diff to using Datasketches HLL sketches: firstly, the use cases differ; and secondly, HLL sketches can be used seamlessly across different engines—for example, simultaneously in ES, Doris, and ClickHouse. Such requirements are common in many production environments.

Issue Number:

Summary:
Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.)

see: #63142
use case: see regression test & #63142

Release note

  1. Add Apache Datasketches Thirdparty submodule
  2. Implemented an aggregate function that integrates the Datasketches HLL sketch.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
  • Behavior changed:

    • No.
  • Does this need documentation?

    • No. No separate documentation is needed; the usage is easy to understand, and it is clearly explained in the regression tests.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Comment thread fe/pom.xml
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting the Maven version constraint to [3.6.3,) is sufficient for normal compilation; [3.9.0,) is not required.

@BePPPower
Copy link
Copy Markdown
Contributor

run buildall

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@nooneuse
Copy link
Copy Markdown
Author

compile

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33222/51327)
Region Coverage 65.25% (16441/25198)
Branch Coverage 55.81% (8780/15732)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33225/51327)
Region Coverage 65.24% (16439/25198)
Branch Coverage 55.80% (8779/15732)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run p0

@nooneuse
Copy link
Copy Markdown
Author

run cloud_p0

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.47% (80/106) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.70% (21753/37703)
Line Coverage 40.99% (214197/522617)
Region Coverage 37.63% (172419/458184)
Branch Coverage 38.36% (73975/192866)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 35.71% (15/42) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.47% (80/106) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.71% (21760/37703)
Line Coverage 41.00% (214279/522617)
Region Coverage 37.70% (172752/458184)
Branch Coverage 38.40% (74070/192866)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 35.71% (15/42) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29922 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1b85ce6274571c48a029fe24052d537847536dde, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17620	3978	3858	3858
q2	q3	10722	899	613	613
q4	4662	460	347	347
q5	7452	1338	1147	1147
q6	191	172	143	143
q7	917	944	781	781
q8	9327	1422	1315	1315
q9	5701	5463	5398	5398
q10	6267	2098	1852	1852
q11	464	265	264	264
q12	627	418	305	305
q13	18073	3368	2701	2701
q14	291	290	268	268
q15	q16	870	872	806	806
q17	960	1033	685	685
q18	6545	5796	5681	5681
q19	1150	1216	1121	1121
q20	503	401	269	269
q21	4938	2380	2018	2018
q22	488	430	350	350
Total cold run time: 97768 ms
Total hot run time: 29922 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4819	4819	4869	4819
q2	q3	4684	4814	4236	4236
q4	2135	2235	1442	1442
q5	5007	4994	5269	4994
q6	200	168	134	134
q7	2113	1899	1648	1648
q8	3399	3094	3150	3094
q9	8523	8472	8462	8462
q10	4476	4491	4269	4269
q11	606	433	421	421
q12	707	751	521	521
q13	3330	3697	2903	2903
q14	296	310	286	286
q15	q16	795	775	693	693
q17	1552	1334	1294	1294
q18	8051	7162	7238	7162
q19	1187	1165	1172	1165
q20	2243	2235	1932	1932
q21	6241	5495	4952	4952
q22	553	534	425	425
Total cold run time: 60917 ms
Total hot run time: 54852 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171756 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1b85ce6274571c48a029fe24052d537847536dde, data reload: false

query5	4340	658	524	524
query6	318	218	201	201
query7	4255	555	304	304
query8	323	241	222	222
query9	8830	4132	4078	4078
query10	453	339	306	306
query11	5789	2404	2259	2259
query12	181	127	126	126
query13	1288	623	450	450
query14	6143	5379	5045	5045
query14_1	4397	4413	4347	4347
query15	212	202	181	181
query16	1036	453	483	453
query17	1214	778	649	649
query18	2732	502	375	375
query19	225	213	171	171
query20	137	135	130	130
query21	221	147	124	124
query22	13722	14245	14437	14245
query23	17485	16613	16250	16250
query23_1	16467	16412	16283	16283
query24	7486	1749	1358	1358
query24_1	1350	1328	1329	1328
query25	582	485	412	412
query26	1295	321	172	172
query27	2674	611	332	332
query28	4366	1951	1938	1938
query29	976	653	510	510
query30	304	236	201	201
query31	1104	1062	936	936
query32	84	71	72	71
query33	533	352	295	295
query34	1164	1106	650	650
query35	761	769	666	666
query36	1324	1364	1114	1114
query37	144	99	88	88
query38	3186	3158	3079	3079
query39	917	927	883	883
query39_1	869	878	876	876
query40	236	159	135	135
query41	62	62	61	61
query42	111	109	111	109
query43	337	327	283	283
query44	
query45	203	201	187	187
query46	1072	1175	717	717
query47	2300	2256	2116	2116
query48	383	402	288	288
query49	629	531	431	431
query50	698	294	221	221
query51	4285	4242	4220	4220
query52	106	107	95	95
query53	256	273	200	200
query54	312	272	256	256
query55	93	87	82	82
query56	313	300	299	299
query57	1402	1380	1294	1294
query58	300	267	263	263
query59	1597	1688	1471	1471
query60	340	335	325	325
query61	151	157	197	157
query62	670	617	571	571
query63	252	205	209	205
query64	2356	811	679	679
query65	
query66	1678	510	399	399
query67	30038	29886	29858	29858
query68	
query69	466	340	302	302
query70	1001	1040	1000	1000
query71	308	267	267	267
query72	3071	2989	2582	2582
query73	822	745	419	419
query74	5049	4883	4711	4711
query75	2761	2647	2310	2310
query76	2279	1135	767	767
query77	409	420	351	351
query78	13003	12934	12418	12418
query79	1556	945	751	751
query80	1354	592	483	483
query81	519	281	242	242
query82	975	161	124	124
query83	360	283	259	259
query84	263	139	109	109
query85	1062	516	440	440
query86	445	333	305	305
query87	3411	3312	3216	3216
query88	3522	2672	2674	2672
query89	434	377	335	335
query90	1983	176	183	176
query91	177	169	140	140
query92	82	81	74	74
query93	1293	969	560	560
query94	704	334	304	304
query95	674	390	359	359
query96	1044	765	350	350
query97	2723	2680	2606	2606
query98	239	228	228	228
query99	1112	1109	966	966
Total cold run time: 255087 ms
Total hot run time: 171756 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 88.79% (95/107) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.59% (20645/38524)
Line Coverage 37.21% (195140/524363)
Region Coverage 33.63% (152766/454189)
Branch Coverage 34.62% (66559/192280)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 89.72% (96/107) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27796/37724)
Line Coverage 57.55% (300966/522980)
Region Coverage 54.90% (251761/458603)
Branch Coverage 56.33% (108727/193008)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 1.50% (15/998) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29650 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1873aecb7c14ab17c615d86a5d0c10d1ac87f085, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17671	3815	3785	3785
q2	q3	10710	876	618	618
q4	4664	459	340	340
q5	7459	1329	1134	1134
q6	182	168	142	142
q7	897	966	748	748
q8	9294	1395	1270	1270
q9	5637	5367	5351	5351
q10	6246	2089	1851	1851
q11	465	265	255	255
q12	639	412	288	288
q13	18197	3264	2770	2770
q14	290	285	261	261
q15	q16	868	863	795	795
q17	852	978	738	738
q18	6503	5764	5656	5656
q19	1217	1240	1079	1079
q20	495	398	261	261
q21	4449	2411	1971	1971
q22	481	419	337	337
Total cold run time: 97216 ms
Total hot run time: 29650 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4726	4622	4797	4622
q2	q3	4662	4791	4236	4236
q4	2115	2160	1410	1410
q5	4961	4987	5208	4987
q6	200	167	133	133
q7	2091	1870	1627	1627
q8	3516	3067	3080	3067
q9	8399	8415	8460	8415
q10	4479	4480	4232	4232
q11	591	433	395	395
q12	704	750	527	527
q13	3173	3569	2935	2935
q14	299	303	378	303
q15	q16	814	798	699	699
q17	1361	1312	1263	1263
q18	7923	7157	7119	7119
q19	1170	1136	1169	1136
q20	2221	2246	1984	1984
q21	6137	5396	4901	4901
q22	545	506	420	420
Total cold run time: 60087 ms
Total hot run time: 54411 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171815 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1873aecb7c14ab17c615d86a5d0c10d1ac87f085, data reload: false

query5	4311	656	511	511
query6	341	220	203	203
query7	4276	550	287	287
query8	331	239	225	225
query9	8817	4035	4059	4035
query10	449	344	307	307
query11	5860	2348	2218	2218
query12	183	129	128	128
query13	1306	623	441	441
query14	6655	5364	5065	5065
query14_1	4367	4343	4351	4343
query15	213	204	193	193
query16	1013	497	466	466
query17	1322	779	642	642
query18	2774	486	361	361
query19	289	207	168	168
query20	142	133	132	132
query21	218	142	120	120
query22	13672	14116	14448	14116
query23	17420	16641	16251	16251
query23_1	16318	16289	16301	16289
query24	7364	1766	1328	1328
query24_1	1333	1337	1339	1337
query25	546	484	406	406
query26	1307	299	162	162
query27	2674	564	326	326
query28	4307	1953	1934	1934
query29	988	603	508	508
query30	297	232	197	197
query31	1113	1058	936	936
query32	80	75	71	71
query33	542	353	281	281
query34	1146	1140	635	635
query35	771	790	678	678
query36	1329	1341	1141	1141
query37	146	95	85	85
query38	3188	3132	3038	3038
query39	948	938	905	905
query39_1	874	900	861	861
query40	240	163	140	140
query41	70	67	67	67
query42	111	110	120	110
query43	324	327	279	279
query44	
query45	211	208	192	192
query46	1077	1185	722	722
query47	2321	2332	2244	2244
query48	413	435	300	300
query49	662	543	452	452
query50	690	287	219	219
query51	4280	4225	4412	4225
query52	108	104	97	97
query53	253	278	209	209
query54	332	279	275	275
query55	94	91	91	91
query56	334	310	331	310
query57	1409	1410	1323	1323
query58	334	293	281	281
query59	1584	1756	1463	1463
query60	346	345	349	345
query61	179	179	210	179
query62	678	618	536	536
query63	246	200	205	200
query64	2307	821	690	690
query65	
query66	1660	518	400	400
query67	30044	29880	29857	29857
query68	
query69	453	322	300	300
query70	1017	929	974	929
query71	297	275	263	263
query72	2898	2671	2408	2408
query73	829	758	435	435
query74	5066	4917	4718	4718
query75	2769	2637	2287	2287
query76	2307	1163	766	766
query77	396	429	345	345
query78	13019	12910	12307	12307
query79	1504	974	746	746
query80	1017	563	490	490
query81	499	292	239	239
query82	1306	156	120	120
query83	344	285	254	254
query84	256	145	112	112
query85	900	542	442	442
query86	427	348	323	323
query87	3389	3357	3211	3211
query88	3516	2659	2612	2612
query89	429	376	337	337
query90	1824	183	176	176
query91	182	172	139	139
query92	78	77	76	76
query93	957	964	567	567
query94	615	343	310	310
query95	655	381	439	381
query96	1050	744	302	302
query97	2692	2703	2592	2592
query98	231	238	241	238
query99	1099	1095	989	989
Total cold run time: 254224 ms
Total hot run time: 171815 ms

Comment thread .gitmodules
branch = openblas
[submodule "contrib/datasketches-cpp"]
path = contrib/datasketches-cpp
url = https://github.com/apache/datasketches-cpp.git
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pointed the submodule to the Apache DataSketches GitHub repository. Later, if needed, we can consider adding DataSketches to the doris-thirdparty repository.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 88.79% (95/107) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.51% (20656/38605)
Line Coverage 37.16% (195159/525179)
Region Coverage 33.50% (152599/455455)
Branch Coverage 34.55% (66557/192638)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 89.72% (96/107) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.81% (27147/37805)
Line Coverage 55.17% (288976/523802)
Region Coverage 52.01% (239162/459870)
Branch Coverage 53.52% (103490/193368)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 10.00% (15/150) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

/review

1 similar comment
@nooneuse
Copy link
Copy Markdown
Author

/review

@nooneuse
Copy link
Copy Markdown
Author

Hello, sorry to bother you @CalvinKirs @morningman . Could you please help review my PR when you have time? I typed "/review" but nothing seems to have happened... Please take a look. :)

@zclllyybb
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test coverage: the PR adds a DataSketches HLL union aggregate and covers basic query/function behavior, aliases, invalid input, and BE unit cases, but it does not prove aggregate-key/pre-aggregation storage behavior or empty-state merge behavior.
  • Scope/focus: the main query aggregate implementation is focused, but the table aggregation type/pre-agg additions are incomplete across FE/BE storage paths.
  • Concurrency/lifecycle: no new runtime concurrency or non-obvious shared lifecycle was found in the reviewed query path.
  • Configuration/compatibility: no config was added; adding a new thrift aggregation enum and table aggregation mode needs complete BE handling before use.
  • Parallel paths: FE enables the aggregate type and pre-agg path, but BE tablet schema/storage aggregation parsing is not updated, so this path is inconsistent.
  • Error handling/memory/data correctness: corruption is converted to exceptions, but merge has a null-state crash scenario for groups/partials with only NULL or empty inputs. Data visibility/version/MoW paths are not directly involved.
  • Tests/results: regression and BE unit tests exist, but missing tests for aggregate-key table/pre-agg behavior and empty-state merge; regression cleanup drops tables after the test, which is against the repository regression-test standard but not the blocking issue here.
  • Observability/performance: no additional observability requirement identified for this function; no confirmed hot-path performance regression beyond the correctness issues.

User focus: no additional user-provided review focus was supplied.

}
void merge(AggregateDataPtr __restrict place, ConstAggregateDataPtr rhs,
Arena&) const override {
this->data(place).merge(this->data(rhs).hll_union_data->get_result(datasketches::HLL_8));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge path assumes the rhs state has already seen a non-empty sketch, but add() leaves hll_union_data null for NULL inputs (via the default nullable handling) and for empty strings (line 130). A grouped/partial aggregate state can therefore exist for a key whose rows are all NULL/empty, and merging that state dereferences rhs.hll_union_data here. Please handle the empty rhs state the same way serialize()/insert_result_into() do, e.g. skip merging when rhs has no union data or merge an explicit empty sketch, and add a test that merges an empty state.

case HLL_UNION:
return TAggregationType.HLL_UNION;
case DATASKETCHES_HLL_UNION_AGG:
return TAggregationType.DATASKETCHES_HLL_UNION;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exposes DATASKETCHES_HLL_UNION_AGG as a table aggregation type, but the BE tablet schema/storage side is not updated to understand DATASKETCHES_HLL_UNION. tablet_meta.cpp converts this enum to the string DATASKETCHES_HLL_UNION, and TabletColumn::get_aggregation_type_by_string() currently treats any non-empty unknown string as OLAP_FIELD_AGGREGATION_GENERIC, which creates an AggStateField rather than a string/datasketches union aggregation field. A table or rollup using this aggregate type will therefore be created with the wrong BE aggregation implementation. Please either complete the BE storage aggregation support and add an aggregate-key/pre-agg regression test, or do not expose this as an AggregateType/pre-agg mode yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants