Step 2: De-identify Data
Introduction
To avoid exposing PHI, consider how you may need to de-identify your data before uploading it to Flywheel. In Flywheel, data is de-identified using a de-id profile.
Instruction Steps
What is the de-id profile?
A de-id profile is a set of instructions for what to do with file metadata that may include PHI.
The de-id profile options apply de-identification transformations in 2 different ways:
- File settings: At this level, the de-identification settings apply to all DICOM data by default. For example, you can use the
remove-private-tags
option to remove all non-standard DICOM tags. See the reference guide for more possible file settings. - Fields settings: Use the DICOM keywords or tags to de-identify specific fields. This setting takes precedent over the file settings, so this means you can set the
remove-private-tags
totrue
at the file level, but then choose to keep a specific custom tag by using the field transformationkeep
. See the reference guide for more possible field options.
Note
You can skip this portion of the config file if you are uploading to a location that already de-identifies data using a site, group, or project de-id profile or if you do not wish to de-identify your data.
Add a De-id Profile Section to the Config File
-
Below the template section add a new section for the de-id profile. Below is an example config file with template and de-id profile sections :
##### # Template Settings ##### template: - pattern: "{subject}" - pattern: "{session}" - pattern: "{acquisition}" scan: dicom #### # De-identification settings #### name: Profile1 dicom: # Date-increment controls how many days to offset each date field # where the increment-date (shown below) is configured. #Positive values will result in later dates, negative # values will result in earlier dates. date-increment: -17 # patient-age-from-birthdate sets the DICOM header as a 3-digit value with a suffix # be 091D, and that same age in months would be 003M. By default, if # the age fits in days, then days will be used, # otherwise if it fits in months, then months # will be used, otherwise years will be used patient-age-from-birthdate: true # all data elements not defined in fields section of the profile will be removed. # If any field references a nested element in a sequence the whole sequence element # will be kept. remove-undefined: true # Set patient age units as Years. Other options include months (M) and days (D) patient-age-units: Y # The following are field transformations. # Remove, replace-with, increment-date, hash, and hashuid can be used with any DICOM # field. Replace name with the DICOM field "keyword" by the DICOM standard fields: # Use remove Remove a dicom field Removes the field from the DICOM entirely. # If removal is not supported then this field will be blank. # This example removes PatientID. - name: PatientID replace-with: REDACTED # Replace a dicom field with the value provided. # This example replaces “StationName” with "XXXX" in Flywheel - name: StationName replace-with: XXXX # Offsets the date by the number of days defined in # the date-increment setting above, preserving the time # and timezone. In this example, StudyDate appears as 17 days earlier - name: StudyDate increment-date: true # You can refer to fields by their DICOM tag or keyword # Applies one-way hash to a unique string - name: (0008,0050) hash: true # Replaces a UID field with a hashed version of that # field. The first four nodes (prefix) and last node # (suffix) will be preserved, with the middle being # replaced by the hashed value - name: ConcatenationUID hashuid: true # The fields below are listed so that they are not removed as part of the # remove-undefined setting above. - name: SeriesInstanceUID - name: Modality - name: SeriesNumber - name: ScheduledProcedureStepID - name: RequestedProcedureID - name: StudyTime - name: StudyID - name: SeriesNumber - name: PatientID - name: StudyInstanceUID - name: ProtocolName - name: AcquisitionDate increment-date: true - name: AcquisitionDateTime - name: AcquisitionTime - name: SeriesDate increment-date: true - name: SeriesTime
-
Update the template to fit your dataset by adding or removing fields or updating the transformation options.
Warning
Flywheel requires the following DICOM data elements to sort and label data when uploaded via the CLI. Make sure that your de-id template does not remove these fields. However, you can transform them by incrementing or replacing the value.
The OHIF Viewer
The OHIF viewer requires the following tags:
Keyword | Tag | VR |
SeriesInstanceUID | (0020,000E) | UI |
StudyInstanceUID | (0020,000D) | UI |
Modality | (0008,0060) | CS |
SeriesNumber | (0020,0011) | IS |
ScheduledProcedureStepID | (0040,0009) | SH |
RequestedProcedureID | (0040,1001) | SH |
StudyDate | (0008,0020) | DA |
StudyTime | (0008,0030) | DM |
StudyID | (0020,0010) | SH |
SeriesNumber | (0020,0011 | IS |
PatientID | (0010,0020) | LO |
Flywheel Hierarchy Field Mappings to DICOM Tags
Keyword | Tag | Flywheel Field | VR |
Study Comments* | (0032,4000)* | Group ID Project Label Subject ID | LT |
Study Instance UID | (0020,000D) | Session UID | UI |
Study Description | (0008,1030) | Session Label | LO |
Series Instance UID | (0020,000E) | Acquisition UID | UI |
Protocol Name | (0018,1030) | Acquisition Label | LO |
*It is possible to customize which DICOM field(s) Flywheel should configure to capture Routing String values (Group, Project, and Subject)
Flywheel Field Mappings to DICOM Tags
If these values are not present in the DICOM tags, then the Flywheel timestamp will be blank
Keyword | Tag | Flywheel Field | VR |
Acquisition Date | (0008,0022) | Timestamp | DA |
Acquisition DateTime | (0008,002A) | Timestamp | DT |
Acquisition Time | (0008,0032) | Timestamp | TM |
If one set of DICOM tag values is not present then we try to use the other | |||
Series Date | (0008,0021) | Timestamp | DA |
Series Time | (0008,0031) | Timestamp | TM |
Click Next to learn how to include or exclude folders and file.
De-identification Profile Settings Reference
This section explains what each de-identification setting does as well as shows an example of how data is de-identified. These settings are broken into file and field settings:
File Settings
These settings are applied to ALL DICOM data by default. They offer broad strokes of de-identification. You can override these for specific tags using field settings
date-format
(string)
Describes what format Flywheel should expect for dates in the metadata of the file. This enables Flywheel to properly parse the date. Only use if the date fields in your data have a format that is different than the DICOM default %Y%m%d
The format interpretation follows the format codes that the 1989 C standard requires.
Example 1
datetime-format (string)
Describes what format Flywheel should expect for dates in the metadata of the file. This enables Flywheel to properly parse the date. Only use if the datetime fields in your data have a format that is different than the DICOM default %Y%m%d%H%M%S.%f”
The format interpretation follows the format codes that the 1989 C standard requires.
Example 2
date-increment (numeric)
Controls how much time to offset each date or datetime field where the increment-date(true/false) or increment-datetime(true/false) transformation is chosen.
- Positive values result in later dates
- Negative values result in earlier dates
- Incrementing by a multiple of 7 will keep the week-day consistent for shifted dates
- Incrementing by a non-integer value will also modify the time of datetime element (For example, 0.5 will increment by 12h datetime).
Example 3
Tag | Original metadata | De-identified metadata |
---|---|---|
StudyDate(0008,0020) | 20150215 | 20150210 |
jitter-range (numeric)
The range used to offset a value by a random number. The new value is in [-jitter-range, +jitter-range]. Use jitter-type to change the random number to an integer
Jitter-range can also be set at field level.
Default is jitter-range is 2
Example 4
Tag | Original metadata | De-identified metadata |
---|---|---|
PatientWeight (0010,1030) | 54.43 | 60 |
Jitter-type (int/float)
Draws a random number from a uniform distribution set by jitter-range. You can configure the random number to be an integer (int) or a floating-point number (float). Jitter-type can also used at the field level.
Default is float
.
Example 5
Tag | Original metadata | De-identified metadata |
---|---|---|
PatientWeight (0010,1030) | 54.43 | 44.43 |
patient-age-from-birthdate (true/false)
When set totrue
, this will set the PatientAge
DICOM header as a 3-digit value with a suffix indicating units.
For example an age in days would be 091D, and that same age in months would be 003M. By default, the age will be set using a best-fit approach. This means if the age fits in days, then days will be used, otherwise if it fits in months, then months will be used, otherwise years will be used
Default is false.
Example 6
Tag | Original metadata | De-identified metadata |
---|---|---|
PatientAge(0010,1010) | 97 | 097Y |
patient-age-units (string)
When set in conjunction with patient-age-from-birthdate
, this will act as a preference for which units to use. If the value does not fit into the desired unit, the next level of units will be used.
The most common use for this field would be to always use years as the patient age. Valid values are D
, M
, Y
for Days, Months and Years respectively.
Example 7
Tag | Original metadata | De-identified metadata |
---|---|---|
PatientAge(0010,1010) | 97 | 097Y |
remove-private-tags (true/false)
When set to true
, the private DICOM tags will be removed. Private DICOM tags are any tags not included in the standard DICOM data elements
Default is false
.
Example 8
Tag | Original metadata | De-identified metadata |
---|---|---|
MyPrivateTag(1235,0042) | Acme Inc | blank This field and value are not added to Flywheel metadata |
recurse-sequence (true/false)
When set to true
, each element of a sequence (VR=SQ) will be processed according to the profile, recursively for all nested sequence elements. When used withremove-undefined: true
setting, any sequences or sequence elements defined under fields
will result in full sequence having de-id profile applied
Default is false
(false means only the top-level field for the sequence is processed by the de-id profile)
Note
When using this option, the profile fields
section must not define fields acting on element of sequences or using regex.
Example 9
When recurse-sequence
is set to true
, AccessionNumber
within that all DICOM Sequences are also de-identified according to the profile. You do not need to call out each tag individually.
remove-undefined (true/false)
This is also called a "keeplist". When set to true
, all data elements not defined in thefields
section of the profile will be removed. If any field references a nested element in a sequence the whole sequence element will be kept. Default is false
.
Warning
When using this option, particular attention should be paid to the de-id profile to guarantee that the output DICOM still contains the mandatory data elements according to its Information Object Definitions (IOD)
Example 10
dicom:
remove-private-tags: true
remove-undefined: true
fields:
- name: PatientID
replace-with: MY_PATIENT_ID
- name: StudyInstanceUID
hashuid: true
- name: PatientName
replace-with: REDACTED
Tag | Original metadata | De-identified metadata |
---|---|---|
PatientID | 0345 | MY_PATIENT_ID |
StudyInstanceUID | 1.2.840.113619.6.283.4.983142589.7316.1300473420.841 | 1.2.840.113619.551726.420312.177022.222461.230571.501817.841 |
PatientName | Smith John | REDACTED |
AcquisitionNumber | 1 | Blank This is not defined under fields, so it is removed |
StudyID | 4912 | Blank This is not defined under fields, so it is removed |
replace-with-insert (true/false)
If true
, replace-with actions will insert the field inside record if it does not exist already and replace its value. If false
, replace-with will not insert the field if it exists in the record already.
Default is true.
Example 11
Tag | Original metadata | De-identified metadata |
---|---|---|
StudyTime(0008,0030) | blank (no data) | blank (no data) Without the replace-with-insert set to false, Flywheel would add REDACTED instead of leaving blank. |
Fields
The fields
portion of the de-id profile allows you to reference a specific DICOM data element and perform transformations on it. These follow the format:
dicom:
fields:
- name: <DICOM data element>
<field transformation>: <value>
<field transformation>: <value>
In this section you will find the different ways you can reference a specific DICOM field, tag, or keyword.
How to Reference a DICOM Data Element
This file profile supports 3 ways to reference a DICOM data element: keyword, tag, or dotty-notation.
Note
The data elements in the DICOM File Meta information located in the optional 128 bytes of the DICOM File Preamble can be accessed in the same way as other tags.
Keyword
The keyword string as defined in the public DICOM dictionaries . For example: PatientName
, SOPClassUID
, and AcquisitionDate
Tag
When referencing the DICOM tag, you any of these notations:
Notation | Format | Example |
---|---|---|
Tuple | (0010, 0010) |
|
Hexadecimal | * '00100010' ggggeeee * '0x00100010' 0xggggeeee |
|
Private tag notation | (GGGG, PrivateCreatorName, EE) | The private tag creator element is used to validate organization of private tags in the above example, (0009,"NOT_GEMS_IDEN_01", 04) would be ignored because the creator value must match. See the official DICOM documentation for more information on private tags. We rely on a predefined private dictionaries to infer tag VR which is build from pydicom _private_dict.py and flywheel-metadata. |
Repeater group notation You can apply a field transformation to a range of DICOM elements or groups. Available for groups in range (50XX, EEEE) and (60XX, EEEE) only | Hexadecimal: 50XX####``0x50XX####``60XX####``0x60XX#### Tuple: (50XX,####)``(60XX,####) |
See DICOM's documentation for more information on repeating groups. |
Dotty-notation
Notation for referencing an element within DICOM sequence. You can use a mix of keywords and tags.
For example: AnatomicRegionSequence.0.CodeValue
, 00082218.0.00080102
, AnatomicRegionSequence.0.00080104
In addition, the dotty-notation supports the use * to reference all indices of the sequence element at once.
AnatomicRegionSequence.*.CodeValue
.
The notation supports referring data element at any depth recursively.
# nested tags (*) meaning 'for all indexes in sequence'
- name: ReferencedImageSequence.*.ReferencedSOPClassUID
# (0008, 1140).*.(0008, 1150)
# nested tag with specific index in sequence
- name: ReferencedPerformedProcedureStepSequence.1.ReferencedSOPInstanceUID
# (0008, 1111).1.(0008, 1155)
Field Transformations
Once you reference a DICOM field, you can apply a field transformation to it. These field transformation override any file transformations set above.
hash
(true/false)
Replace the contents of the field with a one-way cryptographic hash in hexadecimal form. Only the first 16 characters of the hash will be used, in order to support short strings.
hashuid
(true/false)
Replaces a UID field with a hashed version of that field. By default, the first four nodes (prefix) and last node (suffix) will be preserved, with the middle being replaced by the hashed value.
Either of those can be applied at the field level or at the global level.
Tag | Original metadata | De-identified metadata |
---|---|---|
AccessionNumber | 1.2.840.113619.6.283.4.983142589.7316.1300473420.841 | 1.2.840.113619.551726.420312.177022.222461.230571.501817.841 |
increment-date
(true/false)
Offsets the date by the number of days defined in the date-increment setting, preserving the time and timezone. If the date format does not match the DICOM default: %Y%m%d
, tell Flywheel what format to expect by using the date-format setting (the date-format setting helps Flywheel parse the string.)
You can apply either of these settings at the global level as well as at the field level.
Warning
You are responsible for setting a date-format which is valid for the file type being processed.
Increment-datetime (true/false)
Offsets the date by the number of days defined in the date-increment setting of the file profile, preserving the time and timezone. If the format of the datetime fields in your data does not match the DICOM default: %Y%m%d%H%M%S.%f
, tell Flywheel what format to expect by using the datetime-format setting (the datetime-format
helps Flywheel parse the string.)
You can apply either of these settings at the global level as well as at the field level.
Warning
You are responsible for setting a date-format which is valid for the file type being processed.
jitter(true/false)
Offsets a numeric (integer or float) value by a random value drawn from a uniform distribution centered on 0. By default, Flywheel uses integers and the range [-1,1]. You can change this range with jitter-range. You can also set if the numeric value is an integer or a floating-point number with jitter-type. These can also be set at the field level or the global level.
Additional Considerations
Additional DICOM constraints apply for DICOM data element based on VR:
- if
jitter-type:float
and VR is["IS", "UL", "US"]
, then we will convert the jittered value to an integer - If VR is
["UL", "US"]
and thejittered_value<0
, then we convert the jittered_value = 0. (unsigned short, unsigned long) - if VR is
“US”
andjitter_value> 65535,
thennew_value = 65535
keep(true/false)
Used when creating a keeplist using remove-undefined:true
or to override global or file de-identification settings.
Note
If only name
is defined as key in the field configuration, Flywheel defaults to the keep: true
.
replace-with(string)
Replaces the contents of the field with the value provided. Please be aware of the the length of the field being replaced because some DICOM fields only support a limited number of characters). By default, the field will be created in the Flywheel record if it does not exist. This behavior can be reversed by setting replace-with-insert: False
at the profile or the field.