feat: Implement Image Duplicate Checker#6675
feat: Implement Image Duplicate Checker#6675notsafeforgit wants to merge 19 commits intostashapp:developfrom
Conversation
|
I'm getting build errors when trying to test. |
Gykes
left a comment
There was a problem hiding this comment.
Just an initial static check. Once the build issues are fixed I can do another, and more detailed, review
|
UI review: The initial page in settings -> Tools looks okay. The only thing here is that the casing of all the settings are bad but I know that's more of @DogmaDragon decision on how they want it as I've seen few PRs from them correcting issues like this. Inside the image checker is where there are issues. It's white and looks nothing like the scene one. Is there a reason it looks so different from the scenes variation? There are also a lot of missing features. Search Accuracy, Search options, things of that nature. The final issue is when I click search I get a 400 error. I do have several images in my dev environment but non have hashes. If the images do not have hashes it should be handled appropriately and not 400 erroring If you are using AI to build this, which I am not against, please make sure you are testing these features before PRing and requesting reviews from people. Both of the major issues I have pointed out could have been solved by you by just starting the app, seeing it fail, then testing the feature a little. Thank you for the work, and I am 100% for this feature, but I believe it needs more polish and work before I could give it an approval. |
To be fair, the UI of the scene checker is not exactly great either, so I'm open to seeing changes if they make sense. But to your point, the color scheme is off and there are a lot of missing features that the scene checker currently has. |
Agreed, i'm definitely open to trying new things if there was an intent. This looks more like an over site rather than a deliberate attempt to refactor the UI to be a better UX. If OP wants to modify UX for this then I am happy to have it done and potentially port it over to scenes |
41ae7cb to
27ab865
Compare
|
All UI issues mentioned should be fixed now, I have attached a screenshot. Polished it up quite a bit and hopefully it's closer to meeting the standards for stash now! |
This change introduces a new tool to identify duplicate images based on their perceptual hash (phash). It includes: - Backend implementation for phash distance comparison and grouping. - GraphQL schema updates and API resolvers. - Frontend UI for the Image Duplicate Checker tool. - Unit tests for the image search and duplicate detection logic.
This change unifies the duplicate detection logic by leveraging the shared phash utility. It also enhances the UI with: - Pagination for large result sets. - Sorting duplicate groups by total file size. - A more detailed table view with image thumbnails, paths, and dimensions. - Consistency with the existing Scene Duplicate Checker tool.
This adds checkboxes to select duplicate images and integrates the existing EditImagesDialog and DeleteImagesDialog, allowing users to resolve duplicates directly from the tool.
…pository - Removed unused `strconv` import from `pkg/sqlite/image.go`. - Added missing `github.com/stashapp/stash/pkg/utils` import to resolve the undefined `utils` reference. - Fixed pagination prop in ImageDuplicateChecker component. - Formatted modified go files using gofmt. - Ran prettier over the UI codebase to resolve the formatting check CI failure.
- Wrap FindDuplicateImages query in r.withReadTxn() to ensure a database transaction in context. - Use queryFunc instead of queryStruct for fetching multiple hashes, preventing runtime errors. - Fix N+1 query issue in duplicate grouping by using qb.FindMany() instead of qb.Find() for each duplicate image. - Revert searchColumns array to exclude "images.details" which was from another PR and remove related failing test.
- Fixes 400 error in ImageDuplicateChecker - Updates UI and frontend types - Fixes tools casing
This fixes a bug where identical image duplicates were not being detected. The implementation was incorrectly scanning the phash BLOB into a string and then attempting to parse it as a hex string. Since phashes are stored as 64-bit integers, they were being converted to decimal strings. For phashes with the MSB set (negative when treated as int64), the resulting decimal string started with a '-', which caused the hex parser to fail and skip the image entirely. Additionally, even for non-negative phashes, parsing a decimal string as hex yielded incorrect hash values. Scanning directly into the utils.Phash struct (which uses int64) matches how Scene phashes are handled and ensures the hash values are correct.
… detection This change adds a specialized SQL query to find exact image duplicate matches (distance 0) directly in the database. Previously, the image duplicate checker always used an O(N^2) Go-based comparison loop, which caused indefinite loading and timeouts on libraries with a large number of images. The new SQL fast path reduces the time to find exact duplicates from minutes/hours to milliseconds.
This update provides significant performance improvements for both image and scene duplicate searching: 1. Optimized the core Hamming distance algorithm in pkg/utils/phash.go: - Uses native CPU popcount instructions (math/bits) for bit counting. - Pre-calculates hash values to eliminate object allocations in the hot loop. - Halves the number of comparisons by leveraging the symmetry of the Hamming distance. - The loop is now several orders of magnitude faster and allocation-free. 2. Solved the N+1 database query bottleneck: - Replaced individual database lookups for each duplicate group with a single batched query for all duplicate IDs. - This optimization was applied to both Image and Scene repositories. 3. Simplified the SQL fast path for exact image matches to remove redundant table joins.
This update provides additional performance improvements specifically targeted at large image libraries (e.g. 300k+ images): 1. Optimized the exact match SQL query for images: - Added filtering for zero/empty fingerprints to avoid massive false-positive groups. - Added a LIMIT of 1000 duplicate groups to prevent excessive memory consumption and serialization overhead. - Simplified the join structure to ensure better use of the database index. 2. Parallelized the Go comparison loop in pkg/utils/phash.go: - Utilizes all available CPU cores to perform Hamming distance calculations. - Uses a lock-free design to minimize synchronization overhead. - This makes non-zero distance searches significantly faster on multi-core systems.
…tion This update resolves major performance regressions when processing large libraries: 1. Optimized FindMany in both Image and Scene stores to use map-based ID lookups. Previously, this function used slices.Index in a loop, resulting in O(N^2) complexity. On a library with 300k items, this was causing the server to hang indefinitely. 2. Refined the exact image duplicate SQL query to match the scene checker's level of optimization. It now joins the files table and orders results by total duplicate file size, ensuring that the most impactful duplicates are shown first. 3. Removed the temporary LIMIT 1000 from the image duplicate query now that the algorithmic bottlenecks have been resolved.
This fixes a severe performance bottleneck where the image duplicate checker would hang indefinitely or crash the server when finding many duplicates. Previously, the GraphQL query requested the full 'ImageData' fragment for every duplicate found, forcing the backend to resolve and serialize all related entities (galleries, studios, tags, performers) for thousands of images at once. By switching to the 'SlimImageData' fragment (mirroring how the Scene duplicate checker operates), the payload size and resolution time are drastically reduced, allowing the tool to scale correctly.
This fixes an issue where Chrome would become unresponsive and prompt the user to kill the page when a large number of duplicates (e.g. 30,000+ groups) were found. 1. Changed the fetchPolicy on FindDuplicateImages to 'no-cache'. Loading 30k+ complex objects into the Apollo normalized cache blocked the main thread for an extended period. Bypassing the cache for this massive one-off query resolves the blocking. 2. Optimized the sorting algorithm in both Image and Scene duplicate checkers. Previously, the group size was recalculated by iterating over all nested files inside the sort's comparison function, resulting in millions of unnecessary iterations (O(N log N) with a heavy inner loop). Now, group sizes are precalculated into a map (O(N)) before sorting.
… check Renamed the dropdown options in the duplicate checkers to be much clearer about their behavior (e.g. 'Keep the largest file'). Also fixed a bug in the Image Duplicate Checker where 'select highest resolution' would fail or do nothing because 'checkSameResolution' was incorrectly trying to access array index [0] on visual_files instead of finding the max resolution across all files, causing it to incorrectly abort the selection.
f6cbe86 to
9b6a861
Compare



This pull request introduces a new tool to identify duplicate images based on their perceptual hash (phash). The implementation is based directly on the one used for video pHashes and leverages a shared utility to ensure consistency and performance across the codebase.
It includes:
Example screenshot:
