fix(Segment membership PoC): Identity backfill broken due to lack of JSON column support#7584
fix(Segment membership PoC): Identity backfill broken due to lack of JSON column support#7584khvn26 wants to merge 1 commit into
Conversation
Two issues stopped the daily IDENTITIES backfill + per-project refresh from ever populating SegmentMembershipCount on staging: 1. clickhouse-driver 0.2.10 has no column class for the JSON type, so `cursor.executemany(_INSERT_IDENTITIES_SQL, rows)` crashes with `KeyError: 'JSON'` while encoding the sample block returned by ClickHouse Cloud 25.12. Pinning to `Flagsmith/clickhouse-driver` branch `newjson` (PR #503 upstream) which adds JSON insert/select support unblocks the path with no other code changes. Switch back to a versioned release once that PR merges. 2. The staging migration task definition was missing `CLICKHOUSE_URL`, so `CLICKHOUSE_ENABLED` evaluated False during `python manage.py migrate`, the `clickhouse` DB alias never registered, and the `clickhouse/0001_create_identities` migration was silently skipped. The `IDENTITIES` table didn't exist on staging at all until I ran a one-off task last week. Wire the same Secrets Manager value the admin-api task def already uses. Smoke tested locally against staging CH 25.12.1: a Django shell run through `connections["clickhouse"].cursor()` with the real `map_identity_document_to_clickhouse_row` mapper round-trips JSON traits (incl. Decimal/bool coercion) and matches a `traits.<key>` predicate via the same `SELECT ... FROM IDENTITIES FINAL` shape the refresh task uses. beep boop
|
The latest updates on your projects. Learn more about Vercel for GitHub. 3 Skipped Deployments
|
There was a problem hiding this comment.
Code Review
This pull request adds clickhouse-driver as a dependency via a specific git branch and introduces the CLICKHOUSE_URL environment variable to the staging ECS task definition. Feedback indicates that the git-based dependency may necessitate a C compiler in the build environment and could affect performance. Furthermore, the secret ARN for CLICKHOUSE_URL is likely misconfigured, as it lacks the necessary key suffix and is inconsistent with other database URL secrets.
| name = "clickhouse-driver" | ||
| version = "0.2.10" | ||
| source = { registry = "https://pypi.org/simple" } | ||
| source = { git = "https://github.com/Flagsmith/clickhouse-driver?branch=newjson#11a9bee45efec2415c5987734778313ad318af4f" } |
There was a problem hiding this comment.
Switching clickhouse-driver to a git source removes the pre-built wheels from the lock file. This package contains C extensions, so the build environment (e.g., the Docker build stage) will now require a C compiler and Python development headers. If these are missing, the installation will fail. Additionally, this may lead to a performance regression if the package falls back to the pure-Python implementation due to a failed extension build.
There was a problem hiding this comment.
Bad bot; read the PR description before reviewing
| }, | ||
| { | ||
| "name": "CLICKHOUSE_URL", | ||
| "valueFrom": "arn:aws:secretsmanager:eu-west-2:302456015006:secret:clickhouse-url-ns26gC" |
There was a problem hiding this comment.
The valueFrom ARN for CLICKHOUSE_URL is inconsistent with the ANALYTICS_DATABASE_URL configuration and likely incorrect. While ANALYTICS_DATABASE_URL references a specific key within the shared ECS-API-heAdoB secret, this entry points to a different secret (clickhouse-url-ns26gC) and lacks the required :KEY:: syntax for JSON secrets. If clickhouse-url-ns26gC is a standard Key/Value secret, ECS will inject the entire JSON object as a string, causing the application to fail. Given the PR description states this should use the same value as admin-api, it should likely reference the same secret and key as ANALYTICS_DATABASE_URL.
| "valueFrom": "arn:aws:secretsmanager:eu-west-2:302456015006:secret:clickhouse-url-ns26gC" | |
| "valueFrom": "arn:aws:secretsmanager:eu-west-2:302456015006:secret:ECS-API-heAdoB:ANALYTICS_DATABASE_URL::" |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7584 +/- ##
=======================================
Coverage 98.50% 98.50%
=======================================
Files 1430 1430
Lines 54254 54254
=======================================
Hits 53444 53444
Misses 810 810 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks for submitting a PR! Please check the boxes below:
docs/if required so people know about the feature.Changes
Contributes to #5663.
The daily
IDENTITIESbackfill + per-project refresh have never populatedSegmentMembershipCounton staging. Two issues are addressed hiere:clickhouse-driver==0.2.10has no column class for ClickHouse Cloud 25.12's first-classJSONtype, sosegment_membership.tasks.backfill_identities_to_clickhousecrashes withKeyError: 'JSON'while encoding the server's sample block.mymarilyn/clickhouse-driver#503 which adds JSON insert + select support. Will switch to a tagged release once that PR merges.ecs-task-definition-migration.jsonlackesCLICKHOUSE_URL— migration ran manually in staging, and the task definition is fixed here.How did you test this code?
Smoke tested the exact task code path against staging ClickHouse 25.12.1 from a local Python shell using the patched driver:
clickhouseDB alias pointing at the stagingCLICKHOUSE_URL(port 9440,?secure=true).map_identity_document_to_clickhouse_rowmapper from synthetic Dynamo-shaped docs carryingstr/Decimal/booltraits.clickhouse/0001_create_identities, rancursor.executemany(_INSERT_IDENTITIES_SQL, rows)throughconnections["clickhouse"].cursor()— the call that was previously failing withKeyError: 'JSON'.SELECT environment_id, count() FROM <table> FINAL WHERE environment_id IN %(envs)s AND (traits.plan = 'enterprise' AND traits.age >= 22) GROUP BY environment_id—traits.<key>subcolumn predicates resolved correctly.traitsreturned as a PythondictwithDecimalcoerced tointandboolpreserved.