Skip to content

Add batch DELETE/UPDATE samples for datasets exceeding 3k row limit#698

Open
rmconstantin wants to merge 9 commits into
aws-samples:mainfrom
rmconstantin:batch-operations
Open

Add batch DELETE/UPDATE samples for datasets exceeding 3k row limit#698
rmconstantin wants to merge 9 commits into
aws-samples:mainfrom
rmconstantin:batch-operations

Conversation

@rmconstantin
Copy link
Copy Markdown
Contributor

Demonstrates sequential and parallel batch processing patterns for Aurora DSQL with OCC retry logic and recommended connection management. Includes Python (psycopg2), Java (pgJDBC), and Node.js (node-postgres) implementations.
Fixes #693 .

By submitting this pull request, I confirm that my contribution is made under
the terms of the MIT-0 license.

Thank you for your contribution!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the pycache path to gitignore?

while (true) {
try (Connection conn = pool.getConnection()) {
conn.setAutoCommit(false);
String sql = "UPDATE " + table + " SET " + setClause + ", updated_at = NOW()"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this ensure progress over all of the source items?

* gradle run --args="--endpoint <cluster-endpoint> [--user admin]
* [--batch-size 1000] [--num-workers 4]"
*/
public class Main {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an integ test that runs these batch ops?

);

-- Create an asynchronous index on the category column.
-- Aurora DSQL requires CREATE INDEX ASYNC for tables with existing rows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all tables, maybe delete this comment

Comment thread batch-operations/README.md Outdated
@@ -0,0 +1,52 @@
# Aurora DSQL Batch Operations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be better organizing these examples under the specific language/driver pairing instead of having it as a top level dir.

Can we also add integ tests for each example? There should be patterns for how to do that in each language

* @param connection a JDBC connection (autoCommit should be false)
* @param operation the database operation to execute
* @param maxRetries maximum retry attempts (default 3)
* @param baseDelay base delay in seconds for backoff (default 0.1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we make baseDelay milliseconds instead?

*/
public class Repopulate {

private static final String INSERT_SQL =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with the repopulate fn vs the batch setup script?

@rmconstantin
Copy link
Copy Markdown
Contributor Author

Updated the code to address all comments.

  • Batch operations are now in standalone directories under each language (java/batch_operations/, javascript/batch_operations/, python/batch_operations/).
  • baseDelay is now base_delay_ms.
  • Initial table+index setup script comments updated.
  • Got rid of the Repopulate fn .
  • Integ tests added for each language.
  • Added an outer retry to make sure all rows are processed (keep batching until done, and if OCC conflicts persist on a single batch, get a fresh connection and try again).

Ready for another look.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this jar file for? Should we be shipping it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gradle-wrapper.jar is the Gradle Wrapper bootstrap JAR. It allows anyone to build the project without having Gradle pre-installed — the wrapper downloads the correct Gradle version automatically. Shipping it in version control is the recommended Gradle convention. The other Java projects in this repo (java/pgjdbc, java/spring_boot) follow the same pattern.

I also updated java/.gitignore to stop ignoring these wrapper files (previously they were gitignored but force-tracked, which was inconsistent).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and gradelw be checked in or gitignored?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, gradlew and gradlew.bat should be checked in. They are the Gradle Wrapper scripts (Unix and Windows respectively) that bootstrap the build — users run ./gradlew build instead of installing Gradle globally. This is the standard Gradle convention and matches java/pgjdbc and java/spring_boot in this repo.

I cleaned up java/.gitignore in the latest commit to remove the **/gradlew, **/gradlew.bat, and **/gradle/ patterns that were incorrectly ignoring these files.

@rmconstantin rmconstantin requested a review from Benjscho May 14, 2026 19:05
ralconst added 9 commits May 14, 2026 12:19
Demonstrates sequential and parallel batch processing patterns for Aurora DSQL
with OCC retry logic and hashtext() partitioning. Includes Python (psycopg2),
Java (pgJDBC), and Node.js (node-postgres) implementations.
- Add SELECT COUNT(*) post-check after each batch loop to verify all
  matching rows were processed (sequential and parallel, all 3 languages)
- Update integration tests to seed data via psql -f batch_test_setup.sql
- Add connect_timeout to Python pool creation for IPv6 fallback
The gradle wrapper (gradle-wrapper.jar, gradlew, gradlew.bat) should be
committed to version control per Gradle convention. This allows anyone to
build the project without pre-installing Gradle. Consistent with existing
java/pgjdbc and java/spring_boot projects in this repo.

Removed **/gradle/, **/gradlew, and **/gradlew.bat from .gitignore.
The .gradle/ (build cache) pattern remains correctly ignored.
@rmconstantin
Copy link
Copy Markdown
Contributor Author

Rebased on main — conflicts resolved.

CI has two failures:

  1. deps-review — Transitive npm dependencies from Jest (e.g. color-name, co, ci-info) score below the repo's OpenSSF Scorecard threshold of 3. These are all standard, widely-used packages pulled in by Jest. Is there an allow-list or policy exception for test dependencies?

  2. javascript-node-postgres / create-cluster — Expected failure since cluster creation requires maintainer AWS credentials.

Ready for another look when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add batch DELETE/UPDATE code samples for large datasets

3 participants