Bulk Import - Default Behavior

Overview

By default, Flywheel automatically detects the type of files you are importing and processes them accordingly:

DICOM files - Flywheel reads DICOM headers to determine where to place the data
Non-DICOM files - Flywheel uses the file path structure (folder names) to determine where to place the data

This document explains in detail how Flywheel's automatic detection and processing works.

Customizing Import Behavior

If you need to change the way your data is organized into Flywheel, see Bulk Import - Customizing Import Rules to learn how to override default behavior.

How Default Processing Works

Flywheel considers the following information from your source data to determine how to organize files within Flywheel:

File paths, including:
1. File names
2. File extensions
3. Parent folder structure
DICOM headers (DICOM files only)

At a glance, the logic works as follows:

flowchart LR
    dt1(Detect Attachments)
    dt2(Detect DICOM)
    dt1-->dt2
    map1(Derive destination hierarchy)
    dt2-->map1
    grp1("Package files (ZIP)")
    map1-->grp1
    up1(Upload to Flywheel)
    grp1-->up1

The following sections describe this process in detail.

Note

Because of this processing logic, the way the source data is organized before import greatly affects how the data is imported and organized into Flywheel.

Step 1: Detecting Container Attachments

Flywheel can import files as attachments to Project, Subject, and Session containers (instead of only as Acquisition contents).

To import files as attachments, place the files into the source folder representing the Flywheel container to which the file should be attached.

For example, see Example 1: Container Attachments.

Step 2: Detecting DICOM Files

Flywheel reads the source file path information to determine which files might be DICOM.

For the purposes of determining where to place the data, Flywheel assumes that any file meeting both of the following rules is a DICOM file:

File name has either:
1. An extension of *.dcm, *.dicom, or *.ima; OR
2. No extension AND either:
3. Solely consists of numerals (e.g., 98201293), OR
4. Matches the format of a DICOM UID (e.g., 1.5.300.0.9230010.3.1.4.3735312059.35976.1651523430.20.99)
File is located at a leaf-level folder in the source data
1. I.e., there are no folders next to this file in the source data

Any file that fails either one of these criteria (file name format or file location) will not be considered a DICOM file and would be placed purely according to their parent folder structure.

Non-DICOM file handling is explained in Step 3b: Non-DICOM Files.

DICOM Detection Limitations

DICOM file type detection is currently limited in the following ways:

File contents are not opened during type detection; only the file path information is used (names, extensions, etc.)
1. DICOMDIR index files are not used
2. "Magic Bytes" are not used
ZIP archives are not processed as DICOM (at import time)
1. Archives are not extracted during the import process
2. The *.dicom.zip extension is not detected as DICOM (at import time)

Limitation: Pre-zipped DICOM

For the purposes of mapping to the Flywheel hierarchy, ZIP files will not be detected as DICOM even if the file name contains *.dicom.zip.

Pre-zipped DICOM files will be handled as non-DICOM files (at import time), and only their source file path information will be used.

This also means DICOM header to Flywheel metadata extraction will not occur at import time for such files.

However, once placed into Flywheel, pre-zipped files whose names end with *.dicom.zip will be classified as DICOM type by Flywheel Core according to Flywheel Core's file type detection rules, thus allowing for DICOM-specific gear rules to be triggered accordingly.

Limitation: Magic Bytes

Flywheel uses a different, stronger mechanism to detect DICOM files for the purposes of de-identification.

In the case of de-identification, Flywheel reads the file content and looks for the File Signature (i.e., "Magic Bytes") indicating if the file is DICOM or not.

The File Signature mechanism is used for de-identification since it is a reliable indicator of file type, and it is critical to ensure all DICOM files are de-identified properly.

However, this mechanism is not used for file type detection in other processes (e.g., determining file placement) for performance and scalability reasons.

For more information about de-identification support with Bulk Import, refer to the Bulk Import De-Identification documentation.

Limitation: DICOMDIR Structure

The DICOMDIR index file is not read or used for DICOM type detection.

Instead, Flywheel makes assumptions about the naming conventions typically used with DICOMDIR structures (e.g., extension-less files with numeric names) to infer if a file may be DICOM.

Step 3: Deriving Destination Hierarchy

Step 3a: DICOM Files

For files detected as DICOM (see DICOM Detection), Flywheel opens the files and reads their DICOM header information.

The following sections detail how the DICOM header information is used to determine both:

Flywheel Hierarchy Mappings (Container Labels)
Flywheel Container Metadata

For a complete example, see Example 2: DICOM Files.

Tip

Additional documentation on DICOM header to Flywheel metadata mappings is available in the new (BETA) CLI documentation for the import run command.

DICOM Header to Flywheel Hierarchy Mappings

From DICOM files, Flywheel generates the container labels for the desired destination Flywheel hierarchy according to the following mappings:

Flywheel Container Label	DICOM Header Tag
`subject.label`	`PatientID`
`session.label`	`StudyDescription` fallback to `session.timestamp` fallback to `StudyInstanceUID`
`acquisition.label`	`SeriesNumber - SeriesDescription` fallback to `SeriesNumber - ProtocolName` fallback to `acquisition.timestamp` (formatted as `%Y-%m-%dT%H:%M:%S`) fallback to `SeriesInstanceUID` and only prefixed if `SeriesNumber` is set
`file.name`	Copied from `acquisition.label`

DICOM Header to Flywheel Metadata Mappings

From DICOM files, Flywheel generates additional metadata for the desired destination Flywheel hierarchy according to the following mappings:

Flywheel Container Type	Flywheel Metadata Field	DICOM Header Tag
Subject	`subject.firstname`	Split from `PatientName`
Subject	`subject.lastname`	Split from `PatientName`
Subject	`subject.sex`	`PatientSex`
Session	`session.uid`	`StudyInstanceUID`
Session	`session.age`	`PatientAge` (converted to seconds) fallback to difference between `acquisition.timestamp` and `PatientBirthDate`
Session	`session.weight`	`PatientWeight`
Session	`session.operator`	`OperatorsName`
Session	`session.timestamp`	Combination of `StudyDate` and `StudyTime` fallback to combination of `SeriesDate` and `SeriesTime` fallback to `AcquisitionDateTime` fallback to combination of `AcquisitionDate` and `AcquisitionTime` with respect to `TimezoneOffsetFromUTC`
Acquisition	`acquisition.uid`	`SeriesInstanceUID`
Acquisition	`acquisition.timestamp`	`AcquisitionDateTime` fallback to combination of `AcquisitionDate` and `AcquisitionTime` fallback to combination of `SeriesDate` and `SeriesTime` fallback to combination of `StudyDate` and `StudyTime` with respect to `TimezoneOffsetFromUTC`

Step 3b: Non-DICOM Files

For any file that Flywheel does not detect as DICOM, Flywheel uses the source file path information, including parent folder names, to determine how to organize the files within Flywheel.

In this case, the files are not opened and their content is ignored.

Since the original file path is used to determine where to place non-DICOM files in Flywheel, such non-DICOM files must be organized carefully before importing into Flywheel:

The "root" folder represents a single project
Each first-level folder (directly inside the Project root) represents a single Subject (i.e., Patient)
Each second-level folder (directly inside the Subject) represents a single Session (i.e., Study)
Each "leaf" (lowest-level) folder (directly inside a Session) represents a single Acquisition
- Each acquisition folder contains in a single "leaf" folder
- Each "leaf" folder contains a single acquisition

For a complete example, see Example 3: Arbitrary Files (non-DICOM).

Info

A "leaf-level" folder is any folder that contains only files and not any additional lower-level folders.

Step 4: Grouping & Packaging Files into ZIP Archives

It is strongly recommended that DICOM files are stored in Flywheel Core as ZIP archives (e.g., *.dicom.zip) rather than as loose files (e.g., *.dcm, etc.).

When the source data is loose DICOM files (e.g., *.dcm, etc.), Bulk Import packages the DICOM files into the recommended ZIP archives.

To do this, Bulk Import needs to decide which files to place into which ZIP archive. This process is called "grouping."

By default, Bulk Import groups DICOM files using the following logic:

Non-DICOM files are not packaged into ZIP archives; they are uploaded individually
All DICOM files are grouped into separate ZIP archives by StudyInstanceUID and SeriesInstanceUID
- Groups containing only a single DICOM file are still zipped even though each resulting ZIP archive contains only a single file
DICOM localizer files are separated into their own ZIP archives
- The following DICOM tags are inspected when determining whether a DICOM file is a localizer or not:
  - InstanceNumber
  - ImageOrientationPatient
  - ImagePositionPatient
  - Rows
  - Columns
Each ZIP archive will be named as <acquisition.label>.dicom.zip
- Where acquisition.label is calculated as described in the DICOM Header to Flywheel Hierarchy Mappings section
Each individual DICOM file packaged into a ZIP archive will be renamed as {SOPInstanceUID}.{Modality}.dcm
- This only applies to DICOM files that are zipped. When zipping is disabled, the loose DICOM files will not be renamed
Each ZIP archive will contain a top level folder to avoid polluting the current directory when extracted
- E.g., the contents of 12345.zip are nested inside a folder named 12345 (like 12345.zip/12345/abcdef.MR.dcm), so that when 12345.zip is extracted its contents are neatly organized (like ./12345/abcdef.MR.dcm) and not mixed into the current directory

For complete examples, refer to Example 2 and Example 4.

Limitation: Split Series

Each DICOM series must be fully contained in a single "leaf-level" input folder.

A "leaf-level" folder is any folder that contains only files and not any additional folders.

Starting with version 20.5, a single source folder can contain multiple DICOM series. However, each DICOM series must still be fully contained within a single source folder.

Step 5: Uploading Files to Flywheel

Files can be uploaded to the following locations within Flywheel:

Project, Subject, or Session attachments
- Review the section on Container Attachments
Acquisition container contents
- Review the sections on DICOM and Non-DICOM file handling

If any conflicts or duplication scenarios are encountered while uploading the files to Flywheel, the affected files will be quarantined and flagged for manual review.

Refer to the documentation on Bulk Import Conflict Handling for more details.

Examples

Example 1: Container Attachments

Consider the following source data structure:

s3://myDataBucket/
├── objectives-1.csv
├── Patient123/
    ├── consent-form-1.pdf
    ├── Study20220423/
        ├── tech-notes-1.txt
        └── Series1/
    └── Study20230122/
└── Patient456/
    └── Study20221103/

With the default behavior, this source data would be imported into Flywheel as:

fw://ACME Research/  (Project)
├── objectives-1.csv  (Project attachment)
├── Patient123/  (Subject)
    ├── consent-form-1.pdf  (Subject attachment)
    ├── Study20220423/  (Session)
        ├── tech-notes-1.txt  (Session attachment)
        └── Series1/  (Acquisition)
    └── Study20230122/  (Session)
└── Patient456/  (Subject)
    └── Study20221103/  (Session)

Note how the following files are imported based on where they were located in the source folder structure:

objectives-1.csv is imported as an attachment to the Destination Project
consent-form-1.pdf is imported as an attachment to the Subject labeled Patient123 in the Destination Project
tech-notes-1.txt is imported as an attachment to the Session labeled Study20220423 of the Subject labeled Patient123 in the Destination Project

Example 2: DICOM Files

Consider the following source data structure:

s3://myDataBucket/
├── Patient1/
    ├── Study20220423/
        └── Series1/
            ├── file1.dcm
            └── file2.ima
        └── Series2/
            └── file3.dicom
    └── Study20230122/
└── Patient2/
    ├── 9572012
    └── 0012893

Where the DICOM files contain the following headers:

DICOM Header	`file1.dcm`	`file2.dcm`	`file3.dcm`	`9572012`	`0012893`
`PatientID`	Subj123	Subj123	Subj123	Subj456	Subj456
`StudyInstanceUID`	1234	1234	1234	5678	5678
`StudyDescription`	Timepoint1	Timepoint1	Timepoint2	Timepoint1	Timepoint1
`SeriesInstanceUID`	9876	9876	7654	3210	3210
`SeriesNumber`	1	1	1	4	4
`SeriesDescription`	Chest X-ray	Chest X-ray	Head CT	PET scan	PET scan
`SOPInstanceUID`	abc123	abc456	def789	ghi012	jkl345

With the default behavior, this source data would be imported into Flywheel as:

fw://ACME Research/ (Project)
├── Subj123 (Subject)
    └── Timepoint1 (Session)
        └── 1 - Chest X-ray (Acquisition)
            └── "1 - Chest X-ray.dicom.zip" (File)
    └── Timepoint2 (Session)
        └── 1 - Head CT (Acquisition)
            └── "1 - Head CT.dicom.zip" (File)
└── Subj456 (Subject)
    └──Timepoint1 (Session)
        └── 4 - PET scan (Acquisition)
            └── "4 - PET scan.dicom.zip" (File)

Where the ZIP file contents are:

"1 - Chest X-ray.dicom.zip"
├── "abc123.XA.dcm" (renamed from "file1.dcm")
└── "abc456.XA.dcm" (renamed from "file2.ima")
"1 - Head CT.dicom.zip"
└── "def789.CT.dcm" (renamed from "file3.dicom")
"4 - PET scan.dicom.zip"
├── "ghi012.PT.dcm" (renamed from "9572012")
└── "jkl345.PT.dcm" (renamed from "0012893")

Note a few nuances:

Header-based mappings: The destination Flywheel hierarchy structure is based entirely on the DICOM headers and not on any of the source file path information
Loose organization: There is no rule on how many parent folders a DICOM series can be contained within. Compare the contents of the Patient1/ and Patient2/ folders from the source data, for example
No split DICOM series: The only restriction on source data organization for DICOM files is that each DICOM series must be fully contained in exactly 1 leaf-level source folder
Single-file Handling: DICOM series containing only 1 file are zipped

Example 3: Arbitrary Files (non-DICOM)

Consider the following source data structure:

s3://myDataBucket/
├── Patient123/
    ├── Study20220423/
        └── Series1/
            ├── formA.pdf
            ├── report09.csv
            └── ...
    └── Study20230122/
└── Patient456/
    └── Study20221103/

With the default behavior, this source data would be imported into Flywheel as:

fw://ACME Research/ (Project)
├── Patient123/ (Subject)
    ├── Study20220423/ (Session)
        └── Series1/ (Acquisition)
            ├── formA.pdf
            ├── report09.csv
            └── ...
    └── Study20230122/ (Session)
└── Patient456/ (Subject)
    └── Study20221103/ (Session)

Note a few nuances:

Container Labels: The source folder names are used as the container labels (e.g., "Patient123" is the Subject label)
No Zipping: The files are not grouped together and are stored in Flywheel individually as-is
Arbitrary Files Types: Any type of file can be imported (not only DICOM). The uploaded file will have its type automatically set in Flywheel according to the matching rules for File Types in Flywheel Core

Example 4: Mixed Types

s3://myDataBucket/
├── objectives-1.csv
├── Patient1/
    ├── consent-form-1.pdf
    ├── Study20220423/
        ├── tech-notes-1.txt
        └── Series1/
            ├── file1.dcm
            └── file2.ima
        └── Series2/
            └── file3.dicom
        └── Series3/
            ├── formA.pdf
            ├── report09.csv
            └── ...
    └── Study20230122/
└── Patient2/
    ├── scan-notes-1.txt
    ├── 9572012
    └── 0012893

Where the DICOM files contain the following headers:

DICOM Header	`file1.dcm`	`file2.dcm`	`file3.dcm`	`9572012`	`0012893`
`PatientID`	Subj123	Subj123	Subj123	Subj456	Subj456
`StudyInstanceUID`	1234	1234	1234	5678	5678
`StudyDescription`	Timepoint1	Timepoint1	Timepoint2	Timepoint1	Timepoint1
`SeriesInstanceUID`	9876	9876	7654	3210	3210
`SeriesNumber`	1	1	1	4	4
`SeriesDescription`	Chest X-ray	Chest X-ray	Head CT	PET scan	PET scan
`SOPInstanceUID`	abc123	abc456	def789	ghi012	jkl345

With the default behavior, this source data would be imported into Flywheel as:

fw://ACME Research/ (Project)
├── objectives-1.csv  (Project attachment)
├── Subj123/ (Subject)
    ├── consent-form-1.pdf  (Subject attachment)
    ├── Timepoint1/ (Session)
        ├── tech-notes-1.txt  (Session attachment)
        └── 1 - Chest X-ray (Acquisition)
            └── "1 - Chest X-ray.dicom.zip" (File)
    └── Timepoint2/ (Session)
        └── 1 - Head CT (Acquisition)
            └── "1 - Head CT.dicom.zip" (File)
├── Patient1/ (Subject)
    └── Study20220423/ (Session)
        └── Series1/ (Acquisition)
            ├── formA.pdf
            ├── report09.csv
            └── ...
└── Subj456/ (Subject)
    └── Timepoint1/ (Session)
        └── 4 - PET scan/ (Acquisition)
            └── "4 - PET scan.dicom.zip" (File)

Where the ZIP file contents are:

"1 - Chest X-ray.dicom.zip"
├── "abc123.XA.dcm" (renamed from "file1.dcm")
└── "abc456.XA.dcm" (renamed from "file2.ima")
"1 - Head CT.dicom.zip"
└── "def789.CT.dcm" (renamed from "file3.dicom")
"4 - PET scan.dicom.zip"
├── "ghi012.PT.dcm" (renamed from "9572012")
└── "jkl345.PT.dcm" (renamed from "0012893")

Note a few nuances:

Skipped Files: scan-notes-1.txt is skipped and not uploaded to Flywheel, because there is "no matching rule"
- Specifically, this file is not detected as DICOM and the parent directory structure does not contain enough levels to represent the desired Flywheel hierarchy, and so it is skipped due to not having enough information to map to Flywheel
Additional Containers: There are 3 Subject containers in the destination Flywheel hierarchy even though the source folder structure appeared to have only 2 patient folders
- This is because the Patient1/ source folder contained some folders with DICOM files and some folders containing non-DICOM files
- The DICOM files mapped to Subj123/ based on the PatientID header
- The non-DICOM files mapped to Patient1/ based on the file path information
- To combine these containers into just Subj123/, the source folder must be renamed to Subj123/ to match the PatientID header of the contained DICOM files
- Note that this same behavior cascades down to the Session and Acquisition containers as well