Skip to content

[Track] Vortex output gaps: 3 datasets currently parquet-only #2

@mprammer

Description

@mprammer

3 datasets in the catalog ship parquet only — convert.vortex = false because vortex-data 0.69 cannot currently round-trip their schemas. This issue tracks the affected slugs and the upstream blockers so we can re-enable Vortex outputs as those close.

To list the affected datasets locally, with reasons:

python -m scripts.pipeline.list_datasets --no-vortex --json

Each affected entry carries a convert.vortex_skip_reason string in sources.json.

Failure modes

1. FSST output offsets overflow i32 on large nested-string columns

Upstream: vortex-data/vortex#7833fsst_compress builds its output buffer with VarBinBuilder<i32> regardless of input size, panicking once cumulative compressed bytes pass i32::MAX.

Affected slug:

  • code-contestssolutions.solution: list<string> accumulates >2 GB of compressed UTF-8 across the file.

2. VARIANT-column conversion

Upstream: vortex-data/vortex#7717 — "Epic: Variant Type and Array" (proposed dtype + canonical type + zero-copy support for Arrow's Parquet-Variant encoding + a VariantGet expression).

Affected slug:

  • jsonbench-bluesky-100m — DuckDB VARIANT column

3. FixedSizeBinary(16) / UUID

Upstream: vortex-data/vortex#6854 — "Tracking Issue: Uuid Extension Type" (parent epic vortex-data/vortex#7683 — "Epic: Extension Types"). The Vortex Uuid extension type is specced to be backed by FixedSizeList(Primitive(U8), 16), which is the storage path our column needs.

Error: vortex.Array.from_arrow raises pyo3_runtime.PanicException on FSB(16) columns.

Affected slug:

  • wikipedia-structured-contentsevent.identifier is a UUID stored as FixedSizeBinary(16)

Re-enabling

When an upstream issue closes (or a Vortex release otherwise resolves the path), for each affected slug:

  1. Bump the vortex-data dependency to a release that includes the fix.
  2. Flip convert.vortex to true in sources.json for the slug and remove its convert.vortex_skip_reason.
  3. Re-run python -m scripts.pipeline.convert <slug> to produce the missing <slug>/vortex/<slug>.vortex (and <slug>/vortex-hydrated/<slug>.vortex if a hydrated parquet exists).
  4. Regenerate derived docs: python -m scripts.pipeline.docs.

Metadata

Metadata

Assignees

Labels

tracking-issueShared implementation context for work likely to span multiple PRs.

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions