feat: Implement Image Duplicate Checker by notsafeforgit · Pull Request #6675 · stashapp/stash

notsafeforgit · 2026-03-13T22:28:43Z

This pull request introduces a new tool to identify duplicate images based on their perceptual hash (phash). The implementation is based directly on the one used for video pHashes and leverages a shared utility to ensure consistency and performance across the codebase.

It includes:

Unified backend implementation for phash distance comparison and grouping.
GraphQL schema updates and API resolvers.
Enhanced frontend UI with pagination, group sorting, and detailed image previews.
Unit tests for the image search and duplicate detection logic.

Example screenshot:

Gykes · 2026-03-13T23:06:10Z

I'm getting build errors when trying to test.

gykes@Mac-Air stash % make server-start
cd .local && go run -v -tags " sqlite_stat4 sqlite_math_functions"  -ldflags " -X 'github.com/stashapp/stash/internal/build.buildstamp=2026-03-13 16:05:16' -X 'github.com/stashapp/stash/internal/build.githash=af75c8c1b' -X 'github.com/stashapp/stash/internal/build.version=v0.30.1-122-gaf75c8c1b' -X 'github.com/stashapp/stash/internal/build.officialBuild=false'" ../cmd/stash
github.com/stashapp/stash/pkg/sqlite
# github.com/stashapp/stash/pkg/sqlite
../pkg/sqlite/image.go:10:2: "strconv" imported and not used
../pkg/sqlite/image.go:1106:16: undefined: utils
../pkg/sqlite/image.go:1116:13: undefined: utils
make: *** [server-start] Error 1

Gykes

Just an initial static check. Once the build issues are fixed I can do another, and more detailed, review

internal/api/resolver_query_find_image.go

pkg/sqlite/image.go

ui/v2.5/src/locales/en-US.json

Gykes · 2026-03-14T18:22:16Z

UI review:

The initial page in settings -> Tools looks okay. The only thing here is that the casing of all the settings are bad but I know that's more of @DogmaDragon decision on how they want it as I've seen few PRs from them correcting issues like this.

Inside the image checker is where there are issues. It's white and looks nothing like the scene one. Is there a reason it looks so different from the scenes variation? There are also a lot of missing features. Search Accuracy, Search options, things of that nature.

The final issue is when I click search I get a 400 error. I do have several images in my dev environment but non have hashes. If the images do not have hashes it should be handled appropriately and not 400 erroring

If you are using AI to build this, which I am not against, please make sure you are testing these features before PRing and requesting reviews from people. Both of the major issues I have pointed out could have been solved by you by just starting the app, seeing it fail, then testing the feature a little.

Thank you for the work, and I am 100% for this feature, but I believe it needs more polish and work before I could give it an approval.

sleetx · 2026-03-14T18:26:50Z

Inside the image checker is where there are issues. It's white and looks nothing like the scene one. Is there a reason it looks so different from the scenes variation? There are also a lot of missing features. Search Accuracy, Search options, things of that nature.

To be fair, the UI of the scene checker is not exactly great either, so I'm open to seeing changes if they make sense. But to your point, the color scheme is off and there are a lot of missing features that the scene checker currently has.

Gykes · 2026-03-14T18:29:57Z

To be fair, the UI of the scene checker is not exactly great either, so I'm open to seeing changes if they make sense. But to your point, the color scheme is off and there are a lot of missing features that the scene checker currently has.

Agreed, i'm definitely open to trying new things if there was an intent. This looks more like an over site rather than a deliberate attempt to refactor the UI to be a better UX. If OP wants to modify UX for this then I am happy to have it done and potentially port it over to scenes

notsafeforgit · 2026-03-16T11:17:56Z

All UI issues mentioned should be fixed now, I have attached a screenshot. Polished it up quite a bit and hopefully it's closer to meeting the standards for stash now!

This change introduces a new tool to identify duplicate images based on their perceptual hash (phash). It includes: - Backend implementation for phash distance comparison and grouping. - GraphQL schema updates and API resolvers. - Frontend UI for the Image Duplicate Checker tool. - Unit tests for the image search and duplicate detection logic.

This change unifies the duplicate detection logic by leveraging the shared phash utility. It also enhances the UI with: - Pagination for large result sets. - Sorting duplicate groups by total file size. - A more detailed table view with image thumbnails, paths, and dimensions. - Consistency with the existing Scene Duplicate Checker tool.

This adds checkboxes to select duplicate images and integrates the existing EditImagesDialog and DeleteImagesDialog, allowing users to resolve duplicates directly from the tool.

…pository - Removed unused `strconv` import from `pkg/sqlite/image.go`. - Added missing `github.com/stashapp/stash/pkg/utils` import to resolve the undefined `utils` reference. - Fixed pagination prop in ImageDuplicateChecker component. - Formatted modified go files using gofmt. - Ran prettier over the UI codebase to resolve the formatting check CI failure.

- Wrap FindDuplicateImages query in r.withReadTxn() to ensure a database transaction in context. - Use queryFunc instead of queryStruct for fetching multiple hashes, preventing runtime errors. - Fix N+1 query issue in duplicate grouping by using qb.FindMany() instead of qb.Find() for each duplicate image. - Revert searchColumns array to exclude "images.details" which was from another PR and remove related failing test.

- Fixes 400 error in ImageDuplicateChecker - Updates UI and frontend types - Fixes tools casing

This fixes a bug where identical image duplicates were not being detected. The implementation was incorrectly scanning the phash BLOB into a string and then attempting to parse it as a hex string. Since phashes are stored as 64-bit integers, they were being converted to decimal strings. For phashes with the MSB set (negative when treated as int64), the resulting decimal string started with a '-', which caused the hex parser to fail and skip the image entirely. Additionally, even for non-negative phashes, parsing a decimal string as hex yielded incorrect hash values. Scanning directly into the utils.Phash struct (which uses int64) matches how Scene phashes are handled and ensures the hash values are correct.

… detection This change adds a specialized SQL query to find exact image duplicate matches (distance 0) directly in the database. Previously, the image duplicate checker always used an O(N^2) Go-based comparison loop, which caused indefinite loading and timeouts on libraries with a large number of images. The new SQL fast path reduces the time to find exact duplicates from minutes/hours to milliseconds.

This update provides significant performance improvements for both image and scene duplicate searching: 1. Optimized the core Hamming distance algorithm in pkg/utils/phash.go: - Uses native CPU popcount instructions (math/bits) for bit counting. - Pre-calculates hash values to eliminate object allocations in the hot loop. - Halves the number of comparisons by leveraging the symmetry of the Hamming distance. - The loop is now several orders of magnitude faster and allocation-free. 2. Solved the N+1 database query bottleneck: - Replaced individual database lookups for each duplicate group with a single batched query for all duplicate IDs. - This optimization was applied to both Image and Scene repositories. 3. Simplified the SQL fast path for exact image matches to remove redundant table joins.

This update provides additional performance improvements specifically targeted at large image libraries (e.g. 300k+ images): 1. Optimized the exact match SQL query for images: - Added filtering for zero/empty fingerprints to avoid massive false-positive groups. - Added a LIMIT of 1000 duplicate groups to prevent excessive memory consumption and serialization overhead. - Simplified the join structure to ensure better use of the database index. 2. Parallelized the Go comparison loop in pkg/utils/phash.go: - Utilizes all available CPU cores to perform Hamming distance calculations. - Uses a lock-free design to minimize synchronization overhead. - This makes non-zero distance searches significantly faster on multi-core systems.

…tion This update resolves major performance regressions when processing large libraries: 1. Optimized FindMany in both Image and Scene stores to use map-based ID lookups. Previously, this function used slices.Index in a loop, resulting in O(N^2) complexity. On a library with 300k items, this was causing the server to hang indefinitely. 2. Refined the exact image duplicate SQL query to match the scene checker's level of optimization. It now joins the files table and orders results by total duplicate file size, ensuring that the most impactful duplicates are shown first. 3. Removed the temporary LIMIT 1000 from the image duplicate query now that the algorithmic bottlenecks have been resolved.

This fixes a severe performance bottleneck where the image duplicate checker would hang indefinitely or crash the server when finding many duplicates. Previously, the GraphQL query requested the full 'ImageData' fragment for every duplicate found, forcing the backend to resolve and serialize all related entities (galleries, studios, tags, performers) for thousands of images at once. By switching to the 'SlimImageData' fragment (mirroring how the Scene duplicate checker operates), the payload size and resolution time are drastically reduced, allowing the tool to scale correctly.

This fixes an issue where Chrome would become unresponsive and prompt the user to kill the page when a large number of duplicates (e.g. 30,000+ groups) were found. 1. Changed the fetchPolicy on FindDuplicateImages to 'no-cache'. Loading 30k+ complex objects into the Apollo normalized cache blocked the main thread for an extended period. Bypassing the cache for this massive one-off query resolves the blocking. 2. Optimized the sorting algorithm in both Image and Scene duplicate checkers. Previously, the group size was recalculated by iterating over all nested files inside the sort's comparison function, resulting in millions of unnecessary iterations (O(N log N) with a heavy inner loop). Now, group sizes are precalculated into a map (O(N)) before sorting.

… check Renamed the dropdown options in the duplicate checkers to be much clearer about their behavior (e.g. 'Keep the largest file'). Also fixed a bug in the Image Duplicate Checker where 'select highest resolution' would fail or do nothing because 'checkSameResolution' was incorrectly trying to access array index [0] on visual_files instead of finding the max resolution across all files, causing it to incorrectly abort the selection.

Gykes requested changes Mar 13, 2026

View reviewed changes

internal/api/resolver_query_find_image.go Outdated Show resolved Hide resolved

pkg/sqlite/image.go Outdated Show resolved Hide resolved

pkg/sqlite/image.go Outdated Show resolved Hide resolved

pkg/sqlite/image.go Outdated Show resolved Hide resolved

DogmaDragon reviewed Mar 14, 2026

View reviewed changes

ui/v2.5/src/locales/en-US.json Outdated Show resolved Hide resolved

notsafeforgit requested review from DogmaDragon and Gykes March 14, 2026 13:02

notsafeforgit force-pushed the feat/image-duplicate-checker branch from 41ae7cb to 27ab865 Compare March 16, 2026 11:15

DogmaDragon removed their request for review March 17, 2026 12:48

Gykes added the deferred Good feature that can be looked at for a later release. label Mar 17, 2026

notsafeforgit and others added 18 commits March 27, 2026 04:12

feat: add edit and delete actions to image duplicate checker

1b1a674

This adds checkboxes to select duplicate images and integrates the existing EditImagesDialog and DeleteImagesDialog, allowing users to resolve duplicates directly from the tool.

chore: revert changes to en-US.json

cdc7cb3

fix: update image duplicate checker UI and API handling

1e398d2

- Fixes 400 error in ImageDuplicateChecker - Updates UI and frontend types - Fixes tools casing

Update capitalization in localization strings

170299f

fix: remove unused goimagehash import in phash utility

81e4927

style(ui): fix prettier formatting issues in duplicate checkers

de06341

notsafeforgit force-pushed the feat/image-duplicate-checker branch from f6cbe86 to 9b6a861 Compare March 27, 2026 11:14

fix(ui): refresh image duplicate checker after deletes

47f21db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Implement Image Duplicate Checker#6675

feat: Implement Image Duplicate Checker#6675
notsafeforgit wants to merge 19 commits intostashapp:developfrom
notsafeforgit:feat/image-duplicate-checker

notsafeforgit commented Mar 13, 2026 •

edited

Loading

Uh oh!

Gykes commented Mar 13, 2026

Uh oh!

Gykes left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gykes commented Mar 14, 2026 •

edited

Loading

Uh oh!

sleetx commented Mar 14, 2026

Uh oh!

Gykes commented Mar 14, 2026

Uh oh!

notsafeforgit commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

notsafeforgit commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gykes commented Mar 13, 2026

Uh oh!

Gykes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gykes commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sleetx commented Mar 14, 2026

Uh oh!

Gykes commented Mar 14, 2026

Uh oh!

notsafeforgit commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

notsafeforgit commented Mar 13, 2026 •

edited

Loading

Gykes commented Mar 14, 2026 •

edited

Loading