Skip to content

feat: Implement Image Duplicate Checker#6675

Open
notsafeforgit wants to merge 19 commits intostashapp:developfrom
notsafeforgit:feat/image-duplicate-checker
Open

feat: Implement Image Duplicate Checker#6675
notsafeforgit wants to merge 19 commits intostashapp:developfrom
notsafeforgit:feat/image-duplicate-checker

Conversation

@notsafeforgit
Copy link
Copy Markdown
Contributor

@notsafeforgit notsafeforgit commented Mar 13, 2026

This pull request introduces a new tool to identify duplicate images based on their perceptual hash (phash). The implementation is based directly on the one used for video pHashes and leverages a shared utility to ensure consistency and performance across the codebase.

It includes:

  • Unified backend implementation for phash distance comparison and grouping.
  • GraphQL schema updates and API resolvers.
  • Enhanced frontend UI with pagination, group sorting, and detailed image previews.
  • Unit tests for the image search and duplicate detection logic.

Example screenshot:
download

@Gykes
Copy link
Copy Markdown
Collaborator

Gykes commented Mar 13, 2026

I'm getting build errors when trying to test.

gykes@Mac-Air stash % make server-start
cd .local && go run -v -tags " sqlite_stat4 sqlite_math_functions"  -ldflags " -X 'github.com/stashapp/stash/internal/build.buildstamp=2026-03-13 16:05:16' -X 'github.com/stashapp/stash/internal/build.githash=af75c8c1b' -X 'github.com/stashapp/stash/internal/build.version=v0.30.1-122-gaf75c8c1b' -X 'github.com/stashapp/stash/internal/build.officialBuild=false'" ../cmd/stash
github.com/stashapp/stash/pkg/sqlite
# github.com/stashapp/stash/pkg/sqlite
../pkg/sqlite/image.go:10:2: "strconv" imported and not used
../pkg/sqlite/image.go:1106:16: undefined: utils
../pkg/sqlite/image.go:1116:13: undefined: utils
make: *** [server-start] Error 1

Copy link
Copy Markdown
Collaborator

@Gykes Gykes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an initial static check. Once the build issues are fixed I can do another, and more detailed, review

@Gykes
Copy link
Copy Markdown
Collaborator

Gykes commented Mar 14, 2026

UI review:

The initial page in settings -> Tools looks okay. The only thing here is that the casing of all the settings are bad but I know that's more of @DogmaDragon decision on how they want it as I've seen few PRs from them correcting issues like this.
Screenshot 2026-03-14 at 11 14 32

Inside the image checker is where there are issues. It's white and looks nothing like the scene one. Is there a reason it looks so different from the scenes variation? There are also a lot of missing features. Search Accuracy, Search options, things of that nature.
Screenshot 2026-03-14 at 11 14 39

The final issue is when I click search I get a 400 error. I do have several images in my dev environment but non have hashes. If the images do not have hashes it should be handled appropriately and not 400 erroring
Screenshot 2026-03-14 at 11 18 22

If you are using AI to build this, which I am not against, please make sure you are testing these features before PRing and requesting reviews from people. Both of the major issues I have pointed out could have been solved by you by just starting the app, seeing it fail, then testing the feature a little.

Thank you for the work, and I am 100% for this feature, but I believe it needs more polish and work before I could give it an approval.

@sleetx
Copy link
Copy Markdown

sleetx commented Mar 14, 2026

Inside the image checker is where there are issues. It's white and looks nothing like the scene one. Is there a reason it looks so different from the scenes variation? There are also a lot of missing features. Search Accuracy, Search options, things of that nature.

To be fair, the UI of the scene checker is not exactly great either, so I'm open to seeing changes if they make sense. But to your point, the color scheme is off and there are a lot of missing features that the scene checker currently has.

@Gykes
Copy link
Copy Markdown
Collaborator

Gykes commented Mar 14, 2026

To be fair, the UI of the scene checker is not exactly great either, so I'm open to seeing changes if they make sense. But to your point, the color scheme is off and there are a lot of missing features that the scene checker currently has.

Agreed, i'm definitely open to trying new things if there was an intent. This looks more like an over site rather than a deliberate attempt to refactor the UI to be a better UX. If OP wants to modify UX for this then I am happy to have it done and potentially port it over to scenes

@notsafeforgit notsafeforgit force-pushed the feat/image-duplicate-checker branch from 41ae7cb to 27ab865 Compare March 16, 2026 11:15
@notsafeforgit
Copy link
Copy Markdown
Contributor Author

All UI issues mentioned should be fixed now, I have attached a screenshot. Polished it up quite a bit and hopefully it's closer to meeting the standards for stash now!

@DogmaDragon DogmaDragon removed their request for review March 17, 2026 12:48
@Gykes Gykes added the deferred Good feature that can be looked at for a later release. label Mar 17, 2026
notsafeforgit and others added 18 commits March 27, 2026 04:12
This change introduces a new tool to identify duplicate images based on their perceptual hash (phash). It includes:
- Backend implementation for phash distance comparison and grouping.
- GraphQL schema updates and API resolvers.
- Frontend UI for the Image Duplicate Checker tool.
- Unit tests for the image search and duplicate detection logic.
This change unifies the duplicate detection logic by leveraging the shared phash utility. It also enhances the UI with:
- Pagination for large result sets.
- Sorting duplicate groups by total file size.
- A more detailed table view with image thumbnails, paths, and dimensions.
- Consistency with the existing Scene Duplicate Checker tool.
This adds checkboxes to select duplicate images and integrates the existing EditImagesDialog and DeleteImagesDialog, allowing users to resolve duplicates directly from the tool.
…pository

- Removed unused `strconv` import from `pkg/sqlite/image.go`.
- Added missing `github.com/stashapp/stash/pkg/utils` import to resolve the undefined `utils` reference.
- Fixed pagination prop in ImageDuplicateChecker component.
- Formatted modified go files using gofmt.
- Ran prettier over the UI codebase to resolve the formatting check CI failure.
- Wrap FindDuplicateImages query in r.withReadTxn() to ensure a database transaction in context.
- Use queryFunc instead of queryStruct for fetching multiple hashes, preventing runtime errors.
- Fix N+1 query issue in duplicate grouping by using qb.FindMany() instead of qb.Find() for each duplicate image.
- Revert searchColumns array to exclude "images.details" which was from another PR and remove related failing test.
- Fixes 400 error in ImageDuplicateChecker

- Updates UI and frontend types

- Fixes tools casing
This fixes a bug where identical image duplicates were not being detected.

The implementation was incorrectly scanning the phash BLOB into a string and then attempting to parse it as a hex string. Since phashes are stored as 64-bit integers, they were being converted to decimal strings. For phashes with the MSB set (negative when treated as int64), the resulting decimal string started with a '-', which caused the hex parser to fail and skip the image entirely.

Additionally, even for non-negative phashes, parsing a decimal string as hex yielded incorrect hash values.

Scanning directly into the utils.Phash struct (which uses int64) matches how Scene phashes are handled and ensures the hash values are correct.
… detection

This change adds a specialized SQL query to find exact image duplicate matches (distance 0) directly in the database.

Previously, the image duplicate checker always used an O(N^2) Go-based comparison loop, which caused indefinite loading and timeouts on libraries with a large number of images. The new SQL fast path reduces the time to find exact duplicates from minutes/hours to milliseconds.
This update provides significant performance improvements for both image and scene duplicate searching:

1. Optimized the core Hamming distance algorithm in pkg/utils/phash.go:
   - Uses native CPU popcount instructions (math/bits) for bit counting.
   - Pre-calculates hash values to eliminate object allocations in the hot loop.
   - Halves the number of comparisons by leveraging the symmetry of the Hamming distance.
   - The loop is now several orders of magnitude faster and allocation-free.

2. Solved the N+1 database query bottleneck:
   - Replaced individual database lookups for each duplicate group with a single batched query for all duplicate IDs.
   - This optimization was applied to both Image and Scene repositories.

3. Simplified the SQL fast path for exact image matches to remove redundant table joins.
This update provides additional performance improvements specifically targeted at large image libraries (e.g. 300k+ images):

1. Optimized the exact match SQL query for images:
   - Added filtering for zero/empty fingerprints to avoid massive false-positive groups.
   - Added a LIMIT of 1000 duplicate groups to prevent excessive memory consumption and serialization overhead.
   - Simplified the join structure to ensure better use of the database index.

2. Parallelized the Go comparison loop in pkg/utils/phash.go:
   - Utilizes all available CPU cores to perform Hamming distance calculations.
   - Uses a lock-free design to minimize synchronization overhead.
   - This makes non-zero distance searches significantly faster on multi-core systems.
…tion

This update resolves major performance regressions when processing large libraries:

1. Optimized FindMany in both Image and Scene stores to use map-based ID lookups. Previously, this function used slices.Index in a loop, resulting in O(N^2) complexity. On a library with 300k items, this was causing the server to hang indefinitely.

2. Refined the exact image duplicate SQL query to match the scene checker's level of optimization. It now joins the files table and orders results by total duplicate file size, ensuring that the most impactful duplicates are shown first.

3. Removed the temporary LIMIT 1000 from the image duplicate query now that the algorithmic bottlenecks have been resolved.
This fixes a severe performance bottleneck where the image duplicate checker would hang indefinitely or crash the server when finding many duplicates.

Previously, the GraphQL query requested the full 'ImageData' fragment for every duplicate found, forcing the backend to resolve and serialize all related entities (galleries, studios, tags, performers) for thousands of images at once.

By switching to the 'SlimImageData' fragment (mirroring how the Scene duplicate checker operates), the payload size and resolution time are drastically reduced, allowing the tool to scale correctly.
This fixes an issue where Chrome would become unresponsive and prompt the user to kill the page when a large number of duplicates (e.g. 30,000+ groups) were found.

1. Changed the fetchPolicy on FindDuplicateImages to 'no-cache'. Loading 30k+ complex objects into the Apollo normalized cache blocked the main thread for an extended period. Bypassing the cache for this massive one-off query resolves the blocking.
2. Optimized the sorting algorithm in both Image and Scene duplicate checkers. Previously, the group size was recalculated by iterating over all nested files inside the sort's comparison function, resulting in millions of unnecessary iterations (O(N log N) with a heavy inner loop). Now, group sizes are precalculated into a map (O(N)) before sorting.
… check

Renamed the dropdown options in the duplicate checkers to be much clearer about their behavior (e.g. 'Keep the largest file').
Also fixed a bug in the Image Duplicate Checker where 'select highest resolution' would fail or do nothing because 'checkSameResolution' was incorrectly trying to access array index [0] on visual_files instead of finding the max resolution across all files, causing it to incorrectly abort the selection.
@notsafeforgit notsafeforgit force-pushed the feat/image-duplicate-checker branch from f6cbe86 to 9b6a861 Compare March 27, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deferred Good feature that can be looked at for a later release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants