Implement high-performance sorted writing support in IcebergIO by atognolag · Pull Request #38406 · apache/beam

atognolag · 2026-05-07T15:33:11Z

Pull Request: High-Performance Sorted Writing Support in Native IcebergIO

Description

This pull request implements robust, high-performance sorted writing support in native IcebergIO (sdks/java/io/iceberg).

When writing to sorted Iceberg tables, writing unsorted data causes massive performance degradation, high memory overhead, and "file thrashing" due to too many concurrent file writers being kept open on workers. By dynamically pre-sorting incoming PCollection<Row> elements based on the active Iceberg table SortOrder inside the write transform, we produce perfectly ordered Parquet files, optimize worker resources, and reduce the number of concurrent file handles.

Core Technical Implementation

1. Memory-Safe Spill-to-Disk Sorter (`IcebergRowSorter.java`)

Integrates Beam's native :sdks:java:extensions:sorter (via BufferedExternalSorter).
Processes row sorting per-partition dynamically. If memory buffers are exceeded, the sorter automatically spills elements to the worker's local disk, ensuring strict memory guardrails against out-of-memory (OOM) crashes.

2. Dynamic Unsigned Lexicographical Byte Encoding

Generates a sorting key dynamically by converting composite sort columns into a lexicographically comparable byte array (byte[]):
- Direction (ASC vs DESC): ASC columns are encoded in their natural comparable format. DESC columns are inverted bitwise (~byte) to reverse the unsigned byte comparison order naturally.
- Null Constraints (NULLS_FIRST & NULLS_LAST): Prefix headers are mapped statically to direct standard unsigned comparators correctly:
  - ASC NULLS_FIRST -> 0x00 / ASC NULLS_LAST -> 0xFF
  - DESC NULLS_FIRST -> 0xFF / DESC NULLS_LAST -> 0x00
- Escape Boundary Protocol: Strings and byte arrays are mapped to a deterministic, collision-free escaping sequence (0x00 -> [0x01, 0x01], 0x01 -> [0x01, 0x02]) terminated by a safe 0x00 byte. This prevents column boundary bleeding on composite keys (e.g. "abc"+"def" vs "abcdef"+null).
- Algebraic Numbers: Transforms signed integers, longs, doubles, and floats into big-endian byte structures with flipped sign bits to preserve algebraic scale order.
- Flexible Timeframes: Supports ReadableInstant, java.time.Instant, and java.util.Date conversions, preventing runner-specific casting crashes.

3. Shard-Routing and Write-Path Bypasses

Dynamic Schema Extraction: Extracts schemas dynamically from active row elements at runtime, avoiding null schema references caused by transient serialization on worker nodes.
Direct-Write Bypass: Dynamically inspects target table metadata inside WriteUngroupedRowsToFiles. If the table has an active SortOrder, it skips direct ungrouped writing and spills elements to the grouped, shuffled path where they are properly partitioned and pre-sorted before writing.

Verification and Test Coverage

1. Expanded Unit Tests (`IcebergRowSorterTest.java`)

String Collision Proofing: Asserts that boundary-safe byte escaping correctly distinguishes and orders complex composite strings.
Null Quadrant Matrix: Validates all 4 null-ordering combinations under unsigned comparison.
Scale and Disk Spill: Asserts sorting safety and data integrity with 5,000 randomized records under extremely tight memory bounds (1MB buffer), forcing disk spilling.

2. End-to-End Pipeline Integration Tests (`IcebergIOWriteTest.java`)

Appends scrambled datasets to a sorted Iceberg table in a Beam pipeline.
Implements a sharding-friendly assertFilesAreInternallySorted verification helper which parses individual Parquet files committed to the table directly using Iceberg scan APIs, ensuring that each written file is perfectly sorted internally, regardless of the runner's sharding factor.
Tested and validated successfully across NONE, HASH, and HASH_WITH_AUTOSHARDING distribution modes.

gemini-code-assist · 2026-05-07T15:33:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces native support for sorted writes in IcebergIO. By integrating a memory-safe, disk-spilling sorter, the implementation ensures that data written to sorted Iceberg tables is perfectly ordered, which significantly improves performance and reduces resource contention by minimizing concurrent file writers. The changes include a comprehensive byte-encoding protocol for lexicographical sorting and updated write-path logic to handle sorted tables efficiently.

Highlights

Sorted Writing Support: Implemented high-performance sorted writing in IcebergIO by dynamically pre-sorting incoming rows based on the target table's SortOrder.
Memory-Safe Sorter: Introduced IcebergRowSorter, which leverages Beam's BufferedExternalSorter to spill data to disk when memory limits are exceeded, preventing OOM errors.
Lexicographical Encoding: Added a robust byte-encoding mechanism that supports complex composite keys, null-ordering contracts, and type-specific sorting (including algebraic number handling).
Write Path Optimization: Updated WriteUngroupedRowsToFiles and WriteGroupedRowsToFiles to automatically detect table sort requirements and route data through the sorting path.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements row sorting for Iceberg writes by introducing the IcebergRowSorter utility, which leverages BufferedExternalSorter to handle large datasets. The implementation supports complex Iceberg SortOrder configurations by encoding sort keys into byte arrays. Review feedback identified critical logic errors in the null-ordering byte prefix calculation for descending columns, which were also propagated into incorrect test assertions. Additionally, several performance optimizations were suggested to avoid expensive row conversions and redundant lookups within loops, and to reduce GC pressure from frequent buffer allocations. A potential reliability issue was also noted regarding the consumption of non-re-iterable collections when extracting schemas.

…racting Schema from Coder

…Order().isSorted() for readability

…' variable name

…e its conditional state

… key encoding performance

atognolag · 2026-05-07T16:36:12Z

@ahmedabu98 mind taking a look? Thanks!

ahmedabu98

Left some comments.

I think more thought needs to be put into the architecture of Sorted writes. This is a primitive approach that might work well for small writes, but will quickly degrade for medium~larger writes.

Maybe take a look at how Flink or Spark do it (look for "range" distribution mode). IIRC they do a global sort before passing data to writers. This way, input iterables are already sorted so writers can break them up into files with tight min/max ranges.

ahmedabu98 · 2026-05-07T17:51:21Z

+        if (icebergRecord == null) {
+          icebergRecord = IcebergUtils.beamRowToIcebergRecord(icebergSchema, row);
+        }
+        Object icebergVal = icebergRecord.getField(colName);
+        if (icebergVal != null) {
+          val = field.transform().apply(icebergVal);


This is pretty heavy to extract the transformed value. Will make writes pretty slow

Can we apply the transform to the Beam field object directly? If not, maybe we need a "beam object to iceberg object" conversion. IcebergUtils.copyFieldIntoRecord does something similar, but we may need to refactor it a little to fit this use case.

ahmedabu98 · 2026-05-07T18:00:11Z

+    try {
+      for (Row row : rows) {
+        byte[] keyBytes = encodeSortKey(row, sortOrder, columnNames, icebergSchema, beamSchema);
+        ByteArrayOutputStream baos = new ByteArrayOutputStream();


Let's to clear and reuse the output stream object instead of creating a new one each iteration

ahmedabu98 · 2026-05-07T18:02:59Z

+      org.apache.beam.sdk.schemas.Schema beamSchema)
+      throws IOException {
+
+    ByteArrayOutputStream baos = new ByteArrayOutputStream();


Same thing here, let's create ByteArrayOutputStream it in the caller function and pass it in here. Each call should reset and use the same one for efficiency

ahmedabu98 · 2026-05-07T18:05:28Z

+    } else {
+      writeString(val.toString(), baos, invert);
+    }


Does Iceberg support sorting on complex types (lists, maps, structs) ?

We should either 1) add support for that or 2) throw an Unsupported Exception early on.

I'd rather throw an error than fall back on String because it may lead to unexpected sorting behavior for some types.

It is not supported. It now raises an UnsupportedOperationException

ahmedabu98 · 2026-05-07T18:06:32Z

+        Iterable<Row> sortedOrUnsortedRows =
+            IcebergRowSorter.sortRows(
+                element.getValue(), table.sortOrder(), table.schema(), dataSchema);
+        for (Row row : sortedOrUnsortedRows) {


We should only be sorting if the user asked us to.

IcebergRowSorter.sortRows() will only sort if the table schema defines any sort configuration or else, it does nothing.

…xpose NONE, HASH, and RANGE modes, optimize stream reuse, and add comprehensive sorting tests

…ively document distribution modes

…n mode with custom sharding function

… auto-sharding overlap limitations

atognolag · 2026-05-08T12:00:12Z

I have updated this PR to address your (Ahmed) feedback and optimize the sorting implementation. Below is a summary of the changes:

1. Configurable Distribution Modes
Introduced configurable Distribution Modes in IcebergIO.WriteRows to manage sharding and sorting behaviors before writing:

HASH (Default): Shuffles and groups data by partition key ({destination, partition}) to consolidate files per partition and prevent cross-worker file overlaps. Supports optional dynamic auto-sharding (withAutosharding()) to split hot partitions (but losing the non-overlapping files condition).
RANGE: Shuffles data based on a user-supplied SerializableFunction<Row, Integer> (shard function) to distribute hot partitions across parallel sharded workers, keeping file min/max key ranges non-overlapping (provided the user logic does not randomize shard IDs).
NONE: No network shuffle. Best for lightweight pipelines that rely on post-fact compaction or read-time sort merges.

2. Performance Optimizations (GC & CPU Overhead Reductions)
Optimized the sorting pipeline in IcebergRowSorter to reduce CPU cycles and GC memory pressure:

Field-Level Value Conversion: Added beamValueToIcebergValue to IcebergUtils. This converts only the specific fields requiring a sort transform, completely bypassing the expensive row-to-record conversion (beamRowToIcebergRecord) inside the sorting loop.
Stream Reuse: Reuses and resets (.reset()) the ByteArrayOutputStream in the sorting loops to eliminate short-lived output stream allocations.
Primitive Serialization: Encodes numeric fields using bitwise shifts directly into the output stream, eliminating intermediate ByteBuffer and byte[] allocations.

3. Sorter Correctness & Safety

Null-Ordering Correctness: Corrected the null-ordering prefix byte logic for descending sort orders. Prefix bytes now remain consistent regardless of sorting direction (which is already handled by bitwise inverting values), resolving the descending null order bugs.
Sorting Type Safety: Throws UnsupportedOperationException when attempting to sort on complex types (maps, lists, structs) to prevent non Iceberg compliant sorts.
Deprecated API Removal: Removed the unused, deprecated encodeSortKey overload from IcebergRowSorter since it is a package-private utility class and not used elsewhere.

4. Verification & Test Coverage

Checkstyle & Linter Warnings: Documented all exposed configurations and resolved Google Error Prone warnings.
New Test Cases: Added testRangeDistribution to IcebergIOWriteTest to verify range-based writes with custom sharding. Added a test verifying ValidationException is thrown when trying to construct a sort order with complex types.

…cebergIO Javadocs

…IcebergIO.java

…value-level toString conversion

…o String conversions in IcebergUtilsTest.java

…ring conversions in IcebergUtilsTest.java

… in WritePartitionedRowsToFiles.java

Implement high-performance sorted writing support in IcebergIO

1a6c642

github-actions Bot added java io labels May 7, 2026

atognolag marked this pull request as draft May 7, 2026 15:33

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

atognolag added 5 commits May 7, 2026 17:49

Fix multiple-iteration anti-pattern in WriteGroupedRowsToFiles by ext…

094c721

…racting Schema from Coder

Refactor row sorting calls to be explicitly conditional on table.sort…

f9dd27d

…Order().isSorted() for readability

Simplify sorting logic to keep call DRY and use straightforward 'rows…

ca4ff74

…' variable name

Rename variable rows to sortedOrUnsortedRows to explicitly communicat…

df8dc69

…e its conditional state

Fix sorted write PR comments: resolve null-ordering bugs and optimize…

9120fda

… key encoding performance

atognolag added 2 commits May 7, 2026 18:37

Fix Checkstyle check MissingDeprecated warning in IcebergRowSorter.java

75dcc4f

Remove unused deprecated encodeSortKey method in IcebergRowSorter.java

e4c97bf

ahmedabu98 reviewed May 7, 2026

View reviewed changes

atognolag added 5 commits May 8, 2026 13:04

Finalize Hybrid Partitioning and Sorting Architecture in IcebergIO: e…

ecdaf7f

…xpose NONE, HASH, and RANGE modes, optimize stream reuse, and add comprehensive sorting tests

Set HASH as the default distribution mode in IcebergIO and comprehens…

890fe76

…ively document distribution modes

Add testRangeDistribution integration test case for RANGE distributio…

0cecd01

…n mode with custom sharding function

Add Javadoc documentation for withAutosharding method in IcebergIO.java

94fa738

Update RANGE sharding code sample to use ID partitioning and document…

f24183a

… auto-sharding overlap limitations

atognolag added 2 commits May 8, 2026 16:09

Document the multi-dimensional distribution mode decision matrix in I…

670aab1

…cebergIO Javadocs

Add Operational Impact column to distribution modes Javadoc table in …

acf0e56

…IcebergIO.java

atognolag force-pushed the feature/sorted-icebergIO branch from fbad2a0 to acf0e56 Compare May 8, 2026 14:14

atognolag added 4 commits May 8, 2026 18:01

Fix ClassCastException in IcebergUtils string field copying by using …

e22d4dc

…value-level toString conversion

Add comprehensive test scenarios for dynamic BigDecimal and Integer t…

1ab4fcf

…o String conversions in IcebergUtilsTest.java

Add thorough test scenarios for Double, Boolean, and Null value to St…

1e19180

…ring conversions in IcebergUtilsTest.java

Use catalog.buildTable to set SortOrder during dynamic table creation…

4d20b11

… in WritePartitionedRowsToFiles.java

Conversation

atognolag commented May 7, 2026

Pull Request: High-Performance Sorted Writing Support in Native IcebergIO

Description

Core Technical Implementation

1. Memory-Safe Spill-to-Disk Sorter (IcebergRowSorter.java)

2. Dynamic Unsigned Lexicographical Byte Encoding

3. Shard-Routing and Write-Path Bypasses

Verification and Test Coverage

1. Expanded Unit Tests (IcebergRowSorterTest.java)

2. End-to-End Pipeline Integration Tests (IcebergIOWriteTest.java)

Uh oh!

gemini-code-assist Bot commented May 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atognolag commented May 7, 2026

Uh oh!

ahmedabu98 left a comment

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

atognolag May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

atognolag May 8, 2026

Choose a reason for hiding this comment

Uh oh!

atognolag commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Memory-Safe Spill-to-Disk Sorter (`IcebergRowSorter.java`)

1. Expanded Unit Tests (`IcebergRowSorterTest.java`)

2. End-to-End Pipeline Integration Tests (`IcebergIOWriteTest.java`)