Detecting Duplicate Files with fw ingest
Introduction
By default, when you upload data that already exists in the destination project, the existing files are overwritten, and a new file version is created. However, if you do not wish to overwrite files, then you can use the detect duplicates option. This option finds duplicates within the upload dataset and also checks if files already exist in Flywheel. What Flywheel does with the duplicate data is configurable.
This article explains how to find duplicate files when using the CLI to upload data.
Instruction Steps
Flywheel scans both the source dataset and any data already in the destination project(s). Specifically, Flywheel uses the following criteria to determine duplicates:
- Multiple files have the same destination path (DD01)
- File name already exists in destination project (DD02)
- A single item contains multiple StudyInstanceUIDs (DD03)
- A single session contains multiple StudyInstanceUIDs (DD04)
- Multiple sessions have the same StudyInstanceUID (DD05)
- StudyInstanceUID already exists in a different session (DD06)
- A single acquisition contains multiple SeriesInstanceUIDs (DD07)
- Multiple acquisitions have the same SeriesInstanceUID (DD08)
- A single item contains multiple SeriesInstanceUIDs (DD09)
- SeriesInstanceUID already exists in a different acquisition (DD10)
- SOPInstanceUID occurs multiple times (image UIDs should be unique) (DD11)
Note: When using Ingest template and ingest folder commands the following criteria are not applied: - A single session contains multiple StudyInstanceUIDs (DD04) - A single acquisition contains multiple SeriesInstanceUIDs (DD07).*
The DD## codes can be used to ignore that specific rule. See below for more information.
How to Use --detect-duplicates
In this example we will use the ingest dicom
command, however the --detect-duplicates
option is also available for the ingest folder, ingest template, and ingest project commands.
- Follow these instructions to download and sign in to the Flywheel CLI if you have not already.
-
To start, we will compare the source dataset to a single project. Open your Terminal or Windows Command Prompt, and enter the following command:
For example:
-
The Flywheel CLI displays the files found. Any duplicates are listed in the error summary along with the number of files.
-
If you enter Yes, Flywheel uploads only the new data and does not upload duplicates.
All duplicates are noted in the ingest audit log attached to the project. To review the audit log, navigate to the destination project, and select the Information tab.
Optional Arguments
These additional options give you more control over detecting duplicates and what to do with duplicate data.
--copy-duplicates
: Creates a new project and uploads any duplicate data there. The new project will have the same name as the destination project with a randomized number appended (for example _162948576) .--detect-duplicates-override
: Used to override specific criteria for detecting duplicates.--detect-duplicates-project
: Used to include additional projects to scan for duplicates.
Copy Duplicates to a New Project
Instead of skipping duplicates, you may still wish to upload them to Flywheel in a different project for review. The --copy-duplicates
option will create a new project specifically for the duplicate data.
-
Enter the following command into Terminal or Windows Command Prompt:
For example:
There is also an additional step in the upload status called "preparing sidecar". The new project created for the duplicates is called a sidecar project.
-
Flywheel uploads all new data to the original destination project. If any duplicates exist, Flywheel creates a new project using the original project name followed by a randomized number.
If the original project is named "Example", the new project would be called something like "Example_1629392734".
The audit logs in both the original project and the sidecar project show which files were copied to the new project.
Override Detect Duplicate Criteria
To ignore specific criteria for duplicates, use the --detect-duplicates-override
option to select which duplicate detection rules to apply to the upload.
For example, if you wanted to upload a file, but it contained multiple SeriesInstanceUIDs (code DD07) your command would look like this:
ingest dicom ~/Desktop/001\ 2.zip doc-test Example --detect-duplicates-override DD01 DD02 DD03 DD04 DD05 DD06 DD08 DD09 DD10 DD11
Note: Include all criteria you want to use to check for duplicates separated by a space.
Choose from the codes below when listing the criteria you would like to apply:
code | Description |
---|---|
DD01 | Multiple files have the same destination path |
DD02 | File name already exists in destination project |
DD03 | A single item contains multiple StudyInstanceUIDs |
DD04 | A single session contains multiple StudyInstanceUIDs |
DD05 | Multiple sessions have the same StudyInstanceUID |
DD06 | StudyInstanceUID already exists in a different session |
DD07 | A single acquisition contains multiple SeriesInstanceUIDs |
DD08 | Multiple acquisitions have the same SeriesInstanceUID |
DD09 | A single item contains multiple SeriesInstanceUIDs |
DD10 | SeriesInstanceUID already exists in a different acquisition |
DD11 | SOPInstanceUID occurs multiple times (image UIDs should be unique) |
Compare with More than One Project
You can also compare your upload dataset to more than one project. To do this you will need the sort string for each project. The sort string follows this pattern: fw://group.id/project.label
.
Find it by navigating to the project and copying it from the top of the page:
To check against both the destination project (Example) and an additional project (AnxietyStudy):
Flywheel will check both projects for duplicate data. The conflict_path column in the audit log shows which project the duplicate was found in.
Config File Example
If you use a config file for your CLI commands, see one example below that incorporates the options on this page.
detect-duplicates: true
copy-duplicates: true
detect-duplicates-project:
- fw://doc-test/Example
- fw://doc-test/Example2
- fw://Lab612/AnxietyStudy
Detect-duplicates-override:
- DD01
- DD02
- DD03
- DD04
- DD05
- DD06
- DD08
- DD09
- DD10
- DD11
# The above example will not use DD07: A single acquisition contains multiple SeriesInstanceUIDs
# as part of the detect duplicate criteria