Skip to content

Detecting Duplicate Files with fw ingest

Introduction

By default, when you upload data that already exists in the destination project, the existing files are overwritten, and a new file version is created. However, if you do not wish to overwrite files, then you can use the detect duplicates option. This option finds duplicates within the upload dataset and also checks if files already exist in Flywheel. What Flywheel does with the duplicate data is configurable.

This article explains how to find duplicate files when using the CLI to upload data.

Instruction Steps

Flywheel scans both the source dataset and any data already in the destination project(s). Specifically, Flywheel uses the following criteria to determine duplicates:

  • Multiple files have the same destination path (DD01)
  • File name already exists in destination project (DD02)
  • A single item contains multiple StudyInstanceUIDs (DD03)
  • A single session contains multiple StudyInstanceUIDs (DD04)
  • Multiple sessions have the same StudyInstanceUID (DD05)
  • StudyInstanceUID already exists in a different session (DD06)
  • A single acquisition contains multiple SeriesInstanceUIDs (DD07)
  • Multiple acquisitions have the same SeriesInstanceUID (DD08)
  • A single item contains multiple SeriesInstanceUIDs (DD09)
  • SeriesInstanceUID already exists in a different acquisition (DD10)
  • SOPInstanceUID occurs multiple times (image UIDs should be unique) (DD11)

Note: When using Ingest template and ingest folder commands the following criteria are not applied: - A single session contains multiple StudyInstanceUIDs (DD04) - A single acquisition contains multiple SeriesInstanceUIDs (DD07).*

The DD## codes can be used to ignore that specific rule. See below for more information.

How to Use --detect-duplicates

In this example we will use the ingest dicom command, however the --detect-duplicates option is also available for the ingest folder, ingest template, and ingest project commands.

  1. Follow these instructions to download and sign in to the Flywheel CLI if you have not already.
  2. To start, we will compare the source dataset to a single project. Open your Terminal or Windows Command Prompt, and enter the following command:

    fw ingest dicom --detect-duplicates [filepath to data] [group id] [project label]
    

    For example:

    fw ingest dicom --detect-duplicates ~/Desktop/001\2.zip doc-test Example
    
  3. The Flywheel CLI displays the files found. Any duplicates are listed in the error summary along with the number of files.

    deleteduplicates

  4. If you enter Yes, Flywheel uploads only the new data and does not upload duplicates.

All duplicates are noted in the ingest audit log attached to the project. To review the audit log, navigate to the destination project, and select the Information tab.

auditlog

Optional Arguments

These additional options give you more control over detecting duplicates and what to do with duplicate data.

  • --copy-duplicates: Creates a new project and uploads any duplicate data there. The new project will have the same name as the destination project with a randomized number appended (for example _162948576) .
  • --detect-duplicates-override: Used to override specific criteria for detecting duplicates.
  • --detect-duplicates-project: Used to include additional projects to scan for duplicates.

Copy Duplicates to a New Project

Instead of skipping duplicates, you may still wish to upload them to Flywheel in a different project for review. The --copy-duplicates option will create a new project specifically for the duplicate data.

  1. Enter the following command into Terminal or Windows Command Prompt:

    fw ingest dicom --copy-duplicates [filepath to data] [group id] [subject label]
    

    For example:

    fw ingest dicom --copy-duplicates ~/Documents/StudyData doc-test Example
    

    There is also an additional step in the upload status called "preparing sidecar". The new project created for the duplicates is called a sidecar project.

  2. Flywheel uploads all new data to the original destination project. If any duplicates exist, Flywheel creates a new project using the original project name followed by a randomized number.

    If the original project is named "Example", the new project would be called something like "Example_1629392734".

The audit logs in both the original project and the sidecar project show which files were copied to the new project.

AuditLogCopyDuplicates.png

Override Detect Duplicate Criteria

To ignore specific criteria for duplicates, use the --detect-duplicates-override option to select which duplicate detection rules to apply to the upload.

For example, if you wanted to upload a file, but it contained multiple SeriesInstanceUIDs (code DD07) your command would look like this:

ingest dicom ~/Desktop/001\ 2.zip doc-test Example --detect-duplicates-override DD01 DD02 DD03 DD04 DD05 DD06 DD08 DD09 DD10 DD11 

Note: Include all criteria you want to use to check for duplicates separated by a space.

Choose from the codes below when listing the criteria you would like to apply:

code Description
DD01 Multiple files have the same destination path
DD02 File name already exists in destination project
DD03 A single item contains multiple StudyInstanceUIDs
DD04 A single session contains multiple StudyInstanceUIDs
DD05 Multiple sessions have the same StudyInstanceUID
DD06 StudyInstanceUID already exists in a different session
DD07 A single acquisition contains multiple SeriesInstanceUIDs
DD08 Multiple acquisitions have the same SeriesInstanceUID
DD09 A single item contains multiple SeriesInstanceUIDs
DD10 SeriesInstanceUID already exists in a different acquisition
DD11 SOPInstanceUID occurs multiple times (image UIDs should be unique)

Compare with More than One Project

You can also compare your upload dataset to more than one project. To do this you will need the sort string for each project. The sort string follows this pattern: fw://group.id/project.label.

Find it by navigating to the project and copying it from the top of the page:

path

To check against both the destination project (Example) and an additional project (AnxietyStudy):

fw ingest dicom --detect-duplicates-project fw://psychology2/**AnxietyStudy** doc-test **Example**

Flywheel will check both projects for duplicate data. The conflict_path column in the audit log shows which project the duplicate was found in.

multipleprojects

Config File Example

If you use a config file for your CLI commands, see one example below that incorporates the options on this page.

detect-duplicates: true
copy-duplicates: true
detect-duplicates-project:
   - fw://doc-test/Example
   - fw://doc-test/Example2
   - fw://Lab612/AnxietyStudy
Detect-duplicates-override: 
   - DD01
   - DD02
   - DD03
   - DD04
   - DD05
   - DD06
   - DD08
   - DD09
   - DD10
   - DD11 
# The above example will not use DD07: A single acquisition contains multiple SeriesInstanceUIDs 
# as part of the detect duplicate criteria