Detect Duplicate Files When Uploading
Introduction
By default, when you upload data that already exists in the destination project, the existing files are overwritten, and a new file version is created. However, if you do not wish to overwrite files, then use the detect duplicates options. This option finds duplicates within the upload dataset itself and checks if files already exist in Flywheel. One can also configure what to do with the duplicate data.
This article explains how to find duplicate files when using the CLI to upload data.
Instruction Steps
Detecting Flywheel Duplicates
Flywheel scans both the source dataset and any data already in the destination project(s). Specifically, Flywheel uses the following criteria to determine duplicates:
- Multiple files have the same destination path (DD01)
- File name already exists in destination project (DD02)
- A single item contains multiple StudyInstanceUIDs (DD03)
- A single session contains multiple StudyInstanceUIDs (DD04)
- Multiple sessions have the same StudyInstanceUID (DD05)
- StudyInstanceUID already exists in a different session (DD06)
- A single acquisition contains multiple SeriesInstanceUIDs (DD07)
- Multiple acquisitions have the same SeriesInstanceUID (DD08)
- A single item contains multiple SeriesInstanceUIDs (DD09)
- SeriesInstanceUID already exists in a different acquisition (DD10)
- SOPInstanceUID occurs multiple times (image UIDs should be unique) (DD11)
Note - When using Ingest template and ingest folder commands the following criteria are not applied: a single session contains multiple StudyInstanceUIDs (DD04) and a single acquisition contains multiple SeriesInstanceUIDs (DD07).
The DD## codes can be used to ignore that specific rule. See the section below for more information.
How to Use --Detect-Duplicates
In this example we will use the ingest dicom
command, however the --detect-duplicate
option is also available for the ingest folder, ingest template, and ingest project commands.
- Follow these instructions to download and sign in to the Flywheel CLI if you have not already.
-
To start, we will compare the source dataset to a single project. Open your Terminal or Windows Command-Prompt app, and enter the following command:
For example:
-
The Flywheel CLI displays the files found. Any duplicates are listed in the error summary along with the number of files.
-
If you enter Yes, Flywheel uploads only the new data and does not upload duplicates. All duplicates are noted in the ingest audit log attached to the project. To review the audit log, navigate to the destination project, and select the Information tab.
These additional options give you more control over detecting duplicates and what to do with duplicate data. See the sections below for more information on how to use them:
[--copy-duplicates](#section-idm4615850413334432588023980634 "Copy duplicates to a new project")
: Creates a new project and uploads any duplicate data there. The new project will have the same name as the destination project with a randomized number added to the end (for example _162948576).[--detect-duplicates-override](#section-idm4585546791606432598083580959 "Override detect duplicate criteria")
: Used to override specific criteria for detecting duplicates.[--detect-duplicates-project](#section-idm460632692739843259802427242 "Compare with more than one project")
: Allows you to include additional projects to scan for duplicates.
Copy Duplicates to a New Project
Instead of skipping duplicates, you may still wish to upload them to Flywheel in a different project for review. The --copy-duplicates
option will create a new project just for the duplicate data.
-
Enter the following command into Terminal or Windows Command Prompt:
For example:
There is also an additional step in the upload status called "preparing sidecar". The new project created for the duplicates is called a sidecar project.
-
Flywheel uploads all new data to the original destination project. If any duplicates exist, Flywheel creates a new project using the original project name followed by a randomized number.
If the original project is named "Example", the new project would be called something like "Example_1629392734". The audit log in both the original and sidecar project show what files were copied to the new project.
Override Detect Duplicate Criteria
To ignore specific criteria for duplicates, use the --detect-duplicates-override
option to select which detect duplicates rules to apply to the upload.
For example, if you wanted to upload a file, but it contained multiple SeriesInstanceUIDS (code DD07) your command would look like this:
ingest dicom ~/Desktop/001\ 2.zip doc-test Example --detect-duplicates-override DD01 DD02 DD03 DD04 DD05 DD06 DD08 DD09 DD10 DD11
Note: One must include all criteria you want to use to check for duplicates separated by a space.
Choose from the codes below when listing the criteria you would like to apply:
code | Description |
---|---|
DD01 | Multiple files have the same destination path |
DD02 | File name already exists in destination project |
DD03 | A single item contains multiple StudyInstanceUIDs |
DD04 | A single session contains multiple StudyInstanceUIDs |
DD05 | Multiple sessions have the same StudyInstanceUID |
DD06 | StudyInstanceUID already exists in a different session |
DD07 | A single acquisition contains multiple SeriesInstanceUIDs |
DD08 | Multiple acquisitions have the same SeriesInstanceUID |
DD09 | A single item contains multiple SeriesInstanceUIDs |
DD10 | SeriesInstanceUID already exists in a different acquisition |
DD11 | SOPInstanceUID occurs multiple times (image UIDs should be unique) |
Compare with More than One Project
One can also compare your upload dataset to more than one project. To do this you will need the sort string for each project. The sort string follows this pattern: fw://group.id/project.label
.
You can also find it by navigating to the project and copying it from the top of the page:
To check against both the destination project (Example) and an additional project (AnxietyStudy):
Flywheel will check both projects for duplicate data. The conflict_path column in the audit log shows which of the projects the duplicate was found in.
Config File Examples
If using a config file for your CLI commands, below are some examples of how to format these options.
---
detect-duplicates: true
copy-duplicates: true
detect-duplicates-project:
- fw://doc-test/Example
- fw://doc-test/Example2
- fw://Lab612/AnxietyStudy
Detect-duplicates-override:
- DD01
- DD02
- DD03
- DD04
- DD05
- DD06
- DD08
- DD09
- DD10
- DD11
# The above example will not use DD07: A single acquisition contains multiple SeriesInstanceUIDs
# as part of the detect duplicate criteria