feat(ClickHouse): Add ClickHouse dialect with native JSON traits#8
Closed
khvn26 wants to merge 52 commits into
Closed
feat(ClickHouse): Add ClickHouse dialect with native JSON traits#8khvn26 wants to merge 52 commits into
khvn26 wants to merge 52 commits into
Conversation
uv-managed Python package mirroring the structure of flagsmith-flag-engine (Python) and flagsmith-rust-flag-engine (Rust). Translates `SegmentContext` predicate trees into SQL WHERE expressions. The translator emits a predicate that goes against an `IDENTITIES` table (aliased `i`) directly; trait-bound conditions become `EXISTS` subqueries against `TRAITS`. `PERCENTAGE_SPLIT` and `:semver`-marked comparators compile to inline pure-SQL — no UDF call required at runtime. Public API takes the engine's own TypedDicts: TranslateContext(environment: EnvironmentContext, ...) translate_segment(segment: SegmentContext, ctx: TranslateContext) `EnvironmentContext.key` is used directly as the `environment_id` literal in emitted SQL — no separate integer PK; `IDENTITIES` and `TRAITS` hold a STRING `environment_id` column matching the engine's vocabulary. Includes: - engine-test-data as git submodule - Dialect protocol with SnowflakeDialect implementation - 11 unit tests (translator SQL-shape regression, no Snowflake required) - 510 parity tests parametrised over engine-test-data, gated behind a Snowflake test account via SNOWFLAKE_* env vars - CI workflow: lint+unit on every PR; parity job gated on secrets beep boop
Ruff (lint + format) now runs as a pre-commit hook on every commit; mypy runs as a `make typecheck` local hook on staged Python files. CI calls `make typecheck` and `make test` instead of running the tools directly. Pattern follows github-webhook-handler. Adds: - .pre-commit-config.yaml with ruff-pre-commit, uv-pre-commit (uv-lock), pre-commit-hooks (check-yaml/json/toml), and a local python-typecheck hook calling `make typecheck`. Pre-commit.ci skips the heavy ones. - Makefile with install/install-pre-commit/lint/test/typecheck/help. - pre-commit added to dev deps; ruff dropped (provided by the pre-commit hook with its own pinned version). - README dev section rewritten around the make targets. Translator type cleanups picked up by mypy --strict on the new path: private helpers take dict[str, Any], threshold extracted from val with a None guard, and translate_segment coerces SegmentRule to dict before recursing. beep boop
prek (https://github.com/j178/prek) is a Rust rewrite of pre-commit, drop-in compatible with .pre-commit-config.yaml. PyPI-distributed so it slots into the dev group cleanly. Also drops the pre-commit.ci `ci:` block from .pre-commit-config.yaml (pre-commit.ci runs the standard Python pre-commit, not prek; instead the CI workflow gains a `make lint` step that runs prek directly). beep boop
Node 20 deprecation; v5 runs on Node 24. beep boop
Earlier proposal was column-per-trait wide-form. Doesn't scale: Snowflake caps tables at ~3,000 columns, ALTER TABLE per new trait key on the write path needs elevated grants and locks, and SaaS-aggregated trait vocabularies cross those limits. Pivoting to a single IDENTITIES table with a `traits` VARIANT column. Trait keys live as object keys inside the VARIANT — data, not schema. New trait keys never require DDL; the CDC pipeline just merges them into the existing VARIANT. Translator emits VARIANT path-extraction (`i.traits:"<key>"`) for trait-bound predicates. The slow PoC shape this is sometimes confused with — OBJECT_AGG at query time over a long-form TRAITS table, 245s at 100M — is a different operation: that one joined and aggregated per query. Here the VARIANT is pre-materialised; query time is a single columnar read with subkey extraction. Bench follow-up to validate VARIANT path-extraction perf is in the same ballpark as typed columns. If it isn't, consider hybrid (typed columns for hot traits + VARIANT for cold), but ship VARIANT-only first. Translator changes: - TranslateContext takes `traits_col` (default "i.traits"), drops `traits_table` and `identity_id_col`. - `trait_path(key)` produces `<traits_col>:"<key>"`. - Trait-bound predicates emit VARIANT path access + cast, not EXISTS. Tests + conftest mirror the new shape: single transient IDENTITIES table with VARIANT traits column; cases load via PARSE_JSON of the trait map. beep boop
Numbers from the PoC bench against 870M identities × 50 segments on X-Small with the result cache disabled. VARIANT pays a 47-73% overhead for trait-bound queries (simple 1.5s→2.6s, multi 2.1s→3.1s) but stays sub-4s on the smallest warehouse. Pure %Split is unaffected since it doesn't touch traits. beep boop
One job, one pytest invocation that runs both unit and parity tests with -n auto for parallelism. Drops the two-job separation that was mostly artificial — the conftest already skips parity tests when SNOWFLAKE_ACCOUNT is unset, so forked-PR runs (no secrets) get a clean "X passed, 510 skipped" without needing the workflow's `if:` guard. beep boop
Previous shape ran 102 INSERTs (one per case) plus 510 SELECTs (one per
parametrised test) — ~612 Snowflake round-trips. With xdist on 4 CPUs
the parity job took ~2m24s in CI, dominated by per-query latency, not
compute.
Two batches:
- `loaded_cases` does one multi-row INSERT for all 102 cases.
- `parity_results` runs a single UNION-ALL'd SELECT that evaluates
every translated segment against every case's identity row, returns
a `(case_idx, seg_key) -> bool | None` dict. Per-test invocation is
a dict lookup; no Snowflake call in the parametrised body.
Local bench: 31s wall time for 510 passing tests, serial. Down from
~2m24s. Drops `-n auto` from the CI workflow — the bottleneck is one
Snowflake query, not pytest concurrency, so xdist would just multiply
the mega-query setup across workers without speedup.
beep boop
The numeric comparator family (GREATER_THAN / LESS_THAN / *_INCLUSIVE) was interpolating segment-supplied values raw into the WHERE clause — a SQL-injection path for any user with `MANAGE_SEGMENTS` permission on the project. Same risk class as the MODULO branch already guarded against. Fix is symmetrical: float() the value in Python first, return None on ValueError so the caller falls back. Three new unit tests assert the guard: - injection-shaped value on each comparator returns None - clean numeric value interpolates the parsed float - injection in MODULO's divisor returns None Also adds the flagsmith-lint-tests pre-commit hook (FT003 / FT004 from flagsmith-common). Renames the parity test and reshapes every unit test to the `test_subject__condition__expected` template plus Given/When/Then comments enforced by the hook. beep boop
Hardens against SQL injection without adding a query-builder dependency.
Every place a segment-supplied value (operand, trait key, segment key,
env constants) crosses into emitted SQL now routes through one of:
- Sanitiser.escape_string / Sanitiser.string_literal — single-quote
SQL string literal escaping
- Sanitiser.variant_path_key — double-quote VARIANT path key escaping
- Sanitiser.numeric_literal — float-validates before interpolation;
rejects non-numeric, None, and bool (bool would diverge from engine
semantics)
- Sanitiser.modulo_literal — `divisor|remainder` operand parser; both
sides float-validated
19 unit tests document the contract, including injection-shaped inputs
on every primitive. Translator now only interpolates Sanitiser outputs
or known-safe substrings. Future reviewers can grep for `Sanitiser.` to
audit the full surface.
beep boop
Sanitiser class was just a static-method namespace; collapse to module-level functions in utils.py. Move Snowflake-specific VARIANT path syntax (`traits:"key"`) off the translator and onto the dialect as `trait_path(traits_col, key)`, where Postgres JSONB / long-form-table variants will also live.
The dialect already owns the IDENTITIES schema (via `schema_ddl`), so
column references for `$.identity.identifier` / `$.identity.key` and the
trait path now come from dialect methods (`identifier_expr`,
`identity_key_expr`, `trait_path(alias, key)`). `TranslateContext` only
keeps `identities_alias`.
`EvaluationContext.identity` is `NotRequired`, so accept the full eval
context instead of just the environment. Non-identity JSONPath properties
are resolved via `jsonpath_rfc9535.find_one` against the eval context
rather than special-casing `$.environment.{name,key}`.
`dialect` is now required — no implicit Snowflake default.
Imported by the translator for non-identity property resolution; was resolving transitively via flagsmith-flag-engine.
Carry the JSON shape (`context: EvaluationContext`, `result: EvaluationResult`) through `load_test_cases`, `loaded_cases`, and the parity test signature so mypy can type-check the field accesses instead of letting them resolve to Any.
Loading: switch the test-case loader from `*.json` to both `.json` and `.jsonc` via `json5` (the dataset half-uses `.jsonc` for newer cases — the old glob was silently dropping ~half of them). Backslashes in trait JSON: the SQL literal carrying `PARSE_JSON(...)` needs `\\` for any `\` in the JSON, otherwise Snowflake strips it before the JSON parser sees it. Numeric casts: switch `cast_float`/`cast_number` from `(...)::FLOAT/ NUMBER` to `TRY_TO_DOUBLE/TRY_TO_NUMBER`. A non-numeric variant value (e.g. a string trait used in MODULO) now NULL-propagates through the predicate instead of erroring the whole UNION ALL. Regex literal: Snowflake regex flavour treats `\d` in `'\d'` as the character `d`, not a digit class — needs `\\d`. Pushed regex-pattern escaping into the dialect (`_regex_literal`) and changed the protocol so callers pass raw Python regex strings. Type-aware EQUAL/NOT_EQUAL/IN on traits: mirrors flag_engine's per-type coercion (segment value cast to the trait's runtime type, no match on cast failure). Emitted as `(TYPEOF=... AND ...) OR (...)` with each branch gated — one CASE-WHEN form triggered Snowflake to evaluate all arms eagerly and choke on `(VARCHAR variant)::BOOLEAN`. The bool branch compares lowercase string forms to dodge the same trap. PERCENTAGE_SPLIT bails to FALSE at translate time when the eval context's identity lacks the referenced field — engine returns False when the looked-up value is None, but our row-bound SQL would otherwise hash a present-but-irrelevant column. Acknowledged divergence: two `:semver` cases with prerelease tags (`1.0.0-rc.2 < 1.0.0-rc.3`) — the SQL semver-sort-key is major.minor. patch only. Marked xfail by filename in `XFAIL_CASE_FILENAMES`. Result: 625 passed / 7 skipped (untranslatable operators) / 2 xfailed.
Audit of the 7 previously-skipped cases (`pytest.skip` because `translate_segment` returned None) found all 7 are translator gaps, not genuinely unsupported operators: - 2× MODULO with bad operand (`""`, `"invalid|value"`): now emits `FALSE`. Engine catches the cast error and returns False. - 4× JSONPath that resolves to None or a non-primitive object (`$.leads.to.nowhere`, `$.identity` against operators that don't cast): now pre-evaluated against the eval context via `is_context_in_segment` and collapsed to `TRUE`/`FALSE` (env values and other static jsonpath subjects are constant for every row). - 1× `$.`-prefixed property that doesn't parse as JSONPath: engine treats this as a trait-key lookup, the translator now does too (falls through the JSONPath branch into the trait branch). `pytest.skip` for the untranslatable case dropped — the dataset has no genuinely-unsupported conditions (no RE2-backref/lookahead regexes either). Replaced with an assertion + a comment pointing future unsupported additions at `XFAIL_CASE_FILENAMES` for explicit listing. Result: 632 passed / 2 xfailed (semver prerelease) / 0 skipped.
EQUAL on a trait used to expand to four `TYPEOF`-gated `OR` branches with
casts in every arm — fine for correctness, but ~2× the runtime of the
old string-only form on a 870M-row scan.
Refactored: emit the cheap string compare unconditionally, and only OR
in the typed branches that catch cases the string compare would miss:
- BOOLEAN: variant booleans stringify lowercase (`true`/`false`) so
string compare against `"True"`/`"False"` misses them.
- DECIMAL/DOUBLE: variant floats stringify to scientific/expanded
forms that don't equal the segment value's literal text.
- INTEGER: skipped entirely. Snowflake stringifies INTEGER variants
canonically (`(1)::STRING = '1'`), so the unconditional string
compare already handles them.
Branches are emitted only when the segment value can coerce to that
type at translate time — `value="enterprise"` collapses to a single
`(traits:"k")::STRING = 'enterprise'` (the OLD shape).
NOT_EQUAL keeps `TYPEOF` dispatch — its semantics (returns True only
when cast succeeded *and* values differ) can't be expressed as an OR
of positives without over-matching.
IN collapses to a single `TYPEOF IN ('VARCHAR', 'INTEGER')` gate plus
one string IN compare; bool/float/array traits never match per engine
semantics so they fall outside the gate.
Bool branch compares via the variant's `::STRING` form rather than
`::BOOLEAN` — Snowflake's optimiser eagerly evaluates the BOOLEAN cast
even when the `IS_BOOLEAN` guard would have short-circuited
(`100037: Boolean value 'red' is not recognized`). Float branch uses
`TRY_TO_DOUBLE` for the same reason.
Bench (X-Small, 870M rows, multi-condition segment, cache off):
- typed CASE: ~6.3s
- option 2: ~3.65s
- old string: ~2.8s
Parity: 600 passed + 2 xfailed (semver prerelease).
Narrow theoretical divergence on segment-value formats not in the test
data (e.g. `"001"` against int trait `1` — engine matches via `int()`,
SQL doesn't). Acknowledged trade-off for the perf gain.
Adds pytest-cov, configures branch coverage on `src` and `tests`, and
gates CI via 5monkeys/cobertura-action with `minimum_coverage: 100` —
matching the github-webhook-handler setup.
Audit of uncovered lines turned up dead code: `TranslateContext.jsonpath_expr`
had a fallback path for resolving non-identity JSONPath against the eval
context, but `translate_condition` now routes those through
`_engine_static_verdict` before jsonpath_expr is reached. Trimmed.
Defensive paths marked `# pragma: no cover[/branch]` with a one-line why:
- `_engine_static_verdict` engine-exception fallback
- conftest skip paths (env-config-dependent, missing submodule)
- `_engine_in_values`'s `[`-prefixed-but-not-list branch (unreachable —
valid JSON starting with `[` is always an array)
- parity-result loop branches that fire only when the dataset has an
untranslatable case (none today)
New unit tests cover the testable gaps: unknown operator, MODULO/IN with
bad operands, PERCENTAGE_SPLIT with absent identity fields, identity-bound
JSONPath comparators (CONTAINS/NOT_CONTAINS/NOT_EQUAL/IS_SET/IS_NOT_SET),
trait CONTAINS/NOT_CONTAINS, semver with an unsupported operator, rule
composition (NONE, empty rule, untranslatable nested rule, unknown type),
and the helper-level `_engine_in_values` / `jsonpath_expr` edges.
…returns False
Previously the translator returned None for any condition shape it
couldn't compile to "interesting" SQL — including cases the engine
itself evaluates to a deterministic False (bad operands, missing
properties, non-numeric thresholds). That muddied the contract for
callers: "None means fall back to the engine" was right for some Nones
and wrong for others.
Now:
- Engine-deterministic-False inputs compile to `"FALSE"`. Examples:
numeric comparator with a non-numeric operand (`"100; DROP TABLE …"`,
empty MODULO operand), `EQUAL` with no `value` field, IN with a
non-iterable value, semver-suffixed segment value paired with an
operator the engine treats as plain string compare.
- Internal-misuse contracts use `assert`, not `return None`. Examples:
`jsonpath_expr` called with a non-identity property (the translator
routes those through static eval before reaching here), unknown
segment rule type (Flagsmith's segment validation rejects upstream),
empty rule (same), `_comparison` exiting without matching any op.
- `None` is now reserved for genuinely untranslatable inputs: RE2-unsafe
regex (backref / lookaround), an operator outside
`TRANSLATABLE_OPERATORS`, and PERCENTAGE_SPLIT keyed on a non-identity
JSONPath (TODO: bake the resolved value as a hash literal). Callers
fall back to the engine for these.
`return None` site count: 19 → 8. Of the remaining 8, 2 are internal
signals between helpers (caller turns them into `"FALSE"`), 3 are pure
propagation up the rule/segment recursion, 3 are the genuine
untranslatable cases above.
Test cleanup at the same time: removed 8 unit tests that existed only
to colour branch-coverage lines green (asserting on internal helpers,
asserts that the segment_key auto-injection works, parsing
`[{"a":"b"}]` through `_engine_in_values`, etc.) Updated the tests that
previously asserted None to assert `"((FALSE))"`.
Coverage stays at 100% line + branch. Parity stays at 653 + 2 xfailed.
…c eval
$.identity.* JSONPaths look up against the eval-context identity in the
engine. In the parity-test setup that's the same identity as the row, so
static evaluation gave the right answer; in production, where the SQL
scans many identities per query, statically pre-computing
$.identity.traits.foo against the eval-ctx identity returns the wrong
trait value for every row but the eval-ctx one. Latent bug.
Added _classify_jsonpath to inspect the parsed path structure and route
each shape correctly:
- $.identity.identifier (any syntax) → i.identifier column
- $.identity.key (any syntax) → i.identity_key column
- $.identity.traits.<X> (any syntax) → i.traits:"<X>" per-row
- $.identity (the whole object) → static eval (the engine's
fail-cast on a dict reliably returns False per row, same as static)
- $.identity.<other> → None (no row mapping; caller
falls back to engine — three engine-test-data cases hit this)
- non-identity ($.environment.*, etc.) → static eval (correct)
Bracket-notation aliases now work too: $.identity['identifier'],
$.identity['traits']['foo'], etc. all resolve to the same row references
as their dot-notation counterparts.
PERCENTAGE_SPLIT picks up the same routing — $.identity.traits.<X>
hashes the per-row trait, identity columns hash the per-row column.
Parity: 656 passed / 2 xfailed (semver prerelease). Coverage: 100%
line + branch. Three new unit tests cover the new untranslatable
paths ($.identity.foo, $.identity.traits.*, PERCENTAGE_SPLIT on
unmapped identity field).
Translator was emitting Snowflake-specific SQL — TYPEOF(), IS_BOOLEAN(), IS_DECIMAL(), TRY_TO_DOUBLE(), and ::FLOAT casts — directly from _trait_typed_eq / _trait_typed_in, which broke the dialect-agnostic abstraction. Moved both functions onto SnowflakeDialect as `trait_eq` and `trait_in`, added the corresponding methods to the Dialect protocol, and have the translator delegate. Translator now only knows that traits live behind dialect.trait_path() and that EQUAL/NOT_EQUAL/IN need a type-aware predicate that the dialect emits — no leaking knowledge of how Snowflake stores or discriminates trait types. The remaining VARIANT mentions in translator.py are clearly labelled "Snowflake's implementation" rather than implied invariants. Engine-faithful IN value parsing (`_engine_in_values`) stays in the translator since it mirrors flag_engine's parsing rules and is purely Python-side, dialect-independent. The translator hands the parsed list to dialect.trait_in so the dialect doesn't redo the parsing. Parity: 656 passed / 2 xfailed. Coverage: 100% line + branch.
- Result type is now `JsonpathClassification(kind, trait_key)` — kind is
a Literal of the five classification names; trait_key is set only when
kind is "trait".
- The "invalid jsonpath syntax" case is folded into the trait kind: a
prop that doesn't parse as JSONPath becomes `("trait", prop)`, which
is what the engine does when its jsonpath compile fails. Removes the
None return entirely.
- Translator always classifies — non-`$`-prefixed and empty-string props
flow through the same code path; the `if classification is not None`
guards drop out.
Parity: 656 passed / 2 xfailed. Coverage: 100% line + branch.
Compile a JSONPath only when the property actually starts with `$` —
non-prefixed props are bare trait keys, classified as `("trait", prop)`
without the parse roundtrip.
Splitting the previous "static" classification: $.identity (the whole identity dict) gets its own kind that encodes the row-based answer directly, since every IDENTITIES row IS an identity by construction. Per engine semantics (cf. test_context_value_jsonpath_pointing_to_object note: "paths leading to non-primitive values are treated as not set"): - IS_SET($.identity) → FALSE - IS_NOT_SET($.identity) → TRUE - any scalar comparator → FALSE (engine fail-casts on the dict) The previous "static" classification went through _engine_static_verdict against the eval context, which silently assumed the eval ctx carries an identity. A production caller scanning rows without binding a specific identity would have got the wrong answer. Encoding the row-truth removes that assumption. Also xfail one engine-test-data case that needs trait-first dispatch (engine resolves `$.identity` as a literal trait key first; the dataset includes a trait literally named `$.identity`). Replicating per-row trait fallback in SQL would bloat every JSONPath predicate; defer.
Engine's `_get_context_value_getter` tries the property as a literal trait key first, falling back to JSONPath only if no such trait exists. A row that happens to carry a trait literally named `$.identity` (or `$.environment.name`, etc.) wins over the JSONPath resolution in the engine; we now mirror that in SQL. Every non-trait JSONPath classification is wrapped in `IFF(traits:"<prop>" IS NOT NULL, <trait_predicate>, <fallback>)`. The fallback is the per-classification SQL we already had (column ref for identity columns, static verdict for env paths, etc.); the trait branch goes through the same `_translate_trait_op` path that bare trait keys use. PERCENTAGE_SPLIT skips the wrapper — its hash subject can't be split across an IFF, and trait shadowing on PS is a vanishingly rare shape. Untranslatable identity-bound paths (`$.identity.foo`, `$.identity.traits.*`) still return None — without a fallback SQL the shadow check alone can't be row-correct, so the caller's engine fallback handles the whole condition (including shadowing). Cost: one extra `traits:"<prop>" IS NOT NULL` probe per JSONPath condition per row. Trait-heavy workloads (typical) pay nothing — bare trait keys bypass the wrapper. JSONPath-heavy workloads see roughly +15-30% per condition at X-Small on 870M rows. Removes the trait-shadow xfail (case 139). Now: 658 passed, 2 xfailed (semver-prerelease only). 100% line + branch coverage.
Bench result on the 870M-row table at X-Small:
- trait-only segment (4 conditions): unchanged ~3.7s
- same segment with one wrapped JSONPath: ~7.5s — Snowflake evaluates
both arms of the IFF rather than short-circuiting, so each wrapped
condition roughly doubles in cost.
Engine-faithful trait-first dispatch matters only for an esoteric shape
(trait literally named `$.identity` or similar) — exactly one
engine-test-data case. Pay the perf cost on every JSONPath condition in
every production query for that, no thanks. Reverted.
Case 139 returns to XFAIL_CASE_FILENAMES with a comment explaining the
trade.
Coverage.py records a fall-through branch from the last `case` of a `match` to function exit even when the match is exhaustive over a `Literal` type — it can't see mypy's exhaustiveness check. That showed up as `114->exit` and `544->exit` partials in CI, dropping total coverage to 99% and failing the 100% Cobertura gate. Add `case .+:` to `[tool.coverage.report] partial_branches` so any `case` line is treated as an intentionally-partial branch. Both existing match statements are exhaustive and already verified by mypy; this just stops coverage.py from double-counting against us. beep boop
Renovate's git-submodules manager only opens PRs when the submodule declares a tracked branch in `.gitmodules`. We were pinned a single infra commit ahead of v3.7.0; rolling back to the tag and adding `branch = v3.7.0` matches the Python flag-engine's pattern and lets Renovate raise a PR whenever a new release is tagged on engine-test- data. The infra commit we drop is a pre-commit setup in the upstream repo with no test-data changes — engine parity stays green. `renovate.json` mirrors the Python engine's structure but tightens the policy: all proactive update types are disabled by default, so non-submodule deps only get PRs through Renovate's vulnerability- alert path; the git-submodules manager is the only re-enabled lane, which makes engine-test-data releases the one channel that bumps without manual intervention. beep boop
Mirrors the Python flag-engine: every path owned by the flagsmith-back-end team. beep boop
The wide-form comparison column gets dropped along the way — we don't ship a typed-column shape, so contrasting against it was misleading prose. beep boop
Implements the `Dialect` protocol against ClickHouse — `MergeTree` schema with `Nullable(String)` traits read via `JSONExtract*`, RE2 anchored regex via `match`, hex-chunk parse via `reinterpretAsUInt32(reverse(unhex(...)))`, type dispatch via `JSONType`. Adds a `ClickHouseHarness` for the engine-parity suite (clickhouse- connect HTTP client, per-run Memory table, batched UNION ALL of `count() > 0`). All 599 engine-test-data segments pass; the three xfails match Snowflake's set (translator-level divergences shared by every dialect). beep boop
…ickHouse beep boop
|
Minimum allowed coverage is Generated by 🐒 cobertura-action against f89c656 |
Replaces the `Nullable(String)` traits column + `JSONExtract*` access with `JSON` (CH 24+) + typed-variant subcolumn access. Trait reads become direct columnar scans rather than per-row JSON parses. Wide-String was 14-30x slower than Snowflake at 870M; native JSON closes that to 2.5-4x. Also drops the hex round-trip in `parse_hex_chunk`: `md5_hex` now returns the raw 16-byte digest and `parse_hex_chunk` reads bytes directly via `reinterpretAsUInt32(reverse(substring(...)))`. Engine parity unchanged (599 passed, 3 xfailed — same set as Snowflake). beep boop
The first cut of the JSON-native dialect emitted an OR across five
typed-variant subcolumns per trait condition (`.:String`, `.:Int64`,
`.:UInt64`, `.:Float64`, `.:Bool`). `toString(<sub>)` returns the JSON
value's canonical string form in a single subcolumn read — matches
Snowflake's `v::STRING` fast path — so the typical case collapses to
one comparison.
`trait_eq` (positive) becomes:
- fast path: `toString(<sub>) = <str_lit>`
- bool branch: `<sub>.:Bool = <target>` (covers Python-repr "True" /
"False" coercions)
- float branch: `toFloat64OrNull(toString(<sub>)) = <float_lit>`
(covers floats whose toString integer-trims, e.g. 1.0 → '1')
`trait_in` collapses three typed branches into one `toString(<sub>) IN
(...)` gated on `.:Bool IS NULL AND .:Float64 IS NULL` to exclude bool
and float traits per engine semantics.
`NOT_EQUAL` keeps explicit per-type dispatch (engine wants cast-failure
→ False, which `.:Type IS NOT NULL AND ...` encodes directly).
Measured at 100M on the Cloud trial: simple 2.53s → 2.15s (-15%),
multi 4.61s → 4.06s (-12%), % split unchanged. Modest gain — CH's JSON
engine already skips empty subcolumns cheaply, so the OR-of-typed
shape wasn't 5× worse, just 15%. Worth keeping because the resulting
SQL is also shorter.
Engine parity unchanged (599 passed, 3 xfailed).
beep boop
- Drop a stale `bench` reference (`bench/` was an exploratory artefact, not part of the merged dialect). - In the Snowflake-vs-CH comparison table, replace the under-spec'd "Type discrimination" row to lead with the `toString(<sub>)` fast path (the actual main path through `trait_eq` / `trait_in`) with typed-variant subcolumns as the fallback shape. - Note in the "Hex chunk parse" row that CH reads raw bytes — no hex round-trip — to match what `parse_hex_chunk` actually emits. beep boop
4 tasks
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #7.
Summary
New
ClickHouseDialectimplementing the fullDialectprotocol against ClickHouse's nativeJSONtype (24+, GA in 25.x). Trait reads are typed columnar subcolumn scans rather than per-row JSON parses, mirroring Snowflake VARIANT's cost shape.Engine parity: 599
engine-test-datasegments pass against the harness. Three xfails — same set Snowflake has (two semver-prerelease cases, one$.identity-shaped trait shadowing case).Performance assessment
Benched against the same
engine-test-datashape at 10M / 100M / 870M rows on a ClickHouse Cloud trial. Full numbers in the Notion RFC follow-up and the iteration history is captured in the commit log. Headline at 870M:IN + IS_SET)At 870M we're 4-8× behind Snowflake on trait predicates and ~54× on
% Split. The cost-shape conclusion in the RFC doesn't move — every read still fits under CH Cloud's 60s minimum-billing slot, even at 34s p95 — so this is fine to ship.Translator-level levers to close more of the gap are scoped in the RFC follow-up section and out of scope here.
% Splitis also pending an upstream ClickHouse contribution to vectoriseMD5: ClickHouse/ClickHouse#104887. Independent of this PR.Test plan
feat/clickhouse-dialectagainstinitial-scaffolddocker compose up --detach --wait clickhouse && make testlocally — 599 passed, 3 xfailedmake lint && make typechecklocally — clean