Bulk Import by Reference Only (Reference-in-Place)
To minimize cloud costs, files may be imported into Flywheel by reference only ("reference-in-place import"), avoiding copying.
When this feature is used, files are not copied into Flywheel for storage and are kept in the source location. Instead, Flywheel only reads the source files to extract metadata and stores a reference, thus treating the source location as primary storage.
Availability
Reference-in-place Import is supported on all 3 major cloud providers supported by Bulk Import: AWS, Azure, and GCP.
The reference-in-place import feature uses the Beta Import process, and so the project_import
feature flag must be enabled. This feature flag is disabled by default for customer sites.
Limitations
There are several limitations of reference-in-place import which must be considered before deciding to use this feature.
Warning
If these limitations are not taken into account, then the cost savings are likely to be less than initially hoped-for or the feature may not take effect.
Immutability & Durability
The source files must be guaranteed to be immutable and durable. Specifically, this means that the source files must not be modified or deleted by any external system after the import into Flywheel is complete.
If the source files are modified or deleted by an external system after the import by reference into Flywheel is complete, then Flywheel will be unaware of such change. This may result in discrepancies where the metadata in Flywheel no longer describes the files correctly. It may also result in "dangling references" where Flywheel still lists a file whose content no longer exists.
Warning
If either immutability or durability cannot be guaranteed, then import by reference is strongly discouraged.
Pre-packaged
The source files must be pre-packaged how they intend to be stored within Flywheel. Specifically, this means, the import process cannot ZIP files but must leave them as-is.
This is because the process of packaging source files into ZIP archives results in new files being created (the ZIP files) by Flywheel, and so the original source data is no longer relevant.
In this case, there is no added benefit for Flywheel to push the new files (the ZIP files) back to the source location, and Flywheel simply stores the new files in its internal storage location instead.
Info
For this reason, when grouping and archiving (zipping) is required, import by reference will not apply and a standard import will be performed instead.
Pre De-identified
If Flywheel is not approved to handle PHI, then PHI must be removed before files are place into the source location.
This is because the process of de-identification results a modification to the source files, but since the source files are assumed to be immutable, new files are created instead.
In this case, there is no added benefit for Flywheel to push the new files (the de-identified files) back to the source location, and Flywheel simply stores the new files in its internal storage location instead.
Info
For this reason, when de-identification is required, import by reference will not apply and a standard import will be performed instead.
Copies for Changes
Any changes that are made to the files within Flywheel after import (e.g., curation via a Gear or otherwise) will result in new files being created, because the source files are assumed to be immutable.
In this case, there is no added benefit for Flywheel to push the new files (the de-identified files) back to the source location, and Flywheel simply stores the new files in its internal storage location instead.
Info
Although import by reference can still be used, the benefits start to erode as files are modified and local copies are created and stored within Flywheel.
Usage
To perform a standard import (not by reference), there are two steps required:
- Register the source location as an "External Storage" within Flywheel
- Launch an import specifying the new External Storage as the data source
To perform an import by reference, there is just a slight change to how the external storage location is registered in Flywheel:
- Register the source location as a "Storage Provider" within Flywheel Core
- This action is not available via a CLI or Web App and instead requires a direct API call to Flywheel Core.
- Register the new Storage Provider as an "External Storage" within Flywheel
- This action available via the Beta CLI but not via the Web App.
- Launch an import specifying the new External Storage as the data source
Note that in the case of import by reference, the "External Storage" configuration does not specify a cloud storage location but rather a "Flywheel Storage Provider". The Flywheel Storage Provider configuration references the cloud storage location.
Also note that the actual task of launching the import is exactly the same -- there is no special option that needs to be specified to tell the import process what approach to use. The import process essentially always tries to perform a reference-in-place import if a set of conditions is held true.
For example, if the "External Storage" does not point to an "Storage Provider", then the import process will not attempt to perform an import by reference. Similarly, if the user specifies to group source data into ZIP archives, then the import process will not attempt to perform an import by reference.