Derivatives are generated within FrameTree by modular "pipelines". Pipeline
outputs are connected to sink columns (see Columns). Pipeline inputs can draw
data from either source columns or sink columns containing derivatives generated by prerequisite
pipelines. By connecting pipeline inputs to the outputs of other pipelines,
complex processing chains/webs can be created (reminiscent of a makefile),
in which intermediate products will be stored in the dataset for subsequent
analysis.
FrameTree uses the Pydra dataflow engine under the hood, and Pydra tasks or workflows
can be "applied" to a dataset, where they will be wrapped by a pipeline. However, shell
commands can be wrapped using the generic frametree.common.shell`() task. Pipelines
can be applied to the dataset when it is created, and then run incrementally as the
data is acquired, ensuring the same parameters are used consistently. Additional management
features that FrameTree pipelines provide are
iteration logic over the dataset
storage and retrieval of data to and from the data store
conversion between between mismatching file formats
provenance
consistent parameterisations and software versions
To connect a workflow via the CLI mapping the inputs and outputs of the Pydra
workflow/task (in_file, peel and out_file in the example below)
to appropriate columns in the dataset (T1w, T2w and
freesurfer/recon-all respectively)
If there is a mismatch in the data datatype (see FileFormats) between the
workflow inputs/outputs and the columns they are connected to, a datatype conversion
task will be inserted into the pipeline if converter method between the two
formats exists (see FileFormats).
If the source can be referenced by its path alone and the formats of the source
and sink columns match those expected and produced by the workflow, then you
can all add the sources and sinks in one step
By default, pipelines will iterate all "leaf rows" of the data tree (e.g. session
for datasets in the Clinical space). However, pipelines can be run
at any row row_frequency of the dataset (see Axes), e.g. per subject,
per visit, or on the dataset as a whole (to create single templates/statistics).
Pipeline outputs must be connected to sinks of the same row row_frequency. However,
inputs can be drawn from columns of any row row_frequency. In this case,
inputs from more frequent rows will be provided to the pipeline as a list
sorted by their ID.
For example, when the pipeline in the following code-block runs, it will receive
a list of T1w filenames, run one workflow row, and then sink a single template
back to the dataset.
$ # Add sink column with "constant" row frequency$ frametreeadd-sinkbids///data/openneuro/ds00014vbm_templatemedimage/nifti-gz\--row-frequencyconstant
$ # NB: we don't need to add the T1w source as it is auto-detected when using BIDS$ # Connect pipeline to a "constant" row-frequency sink column. Needs to be$ # of `constant` row_frequency itself or Arcana will raise an error$ frametreeapplybids///data/openneuro/ds00014vbm_template\--inputT1win_file\--outputvbm_templateout_file\--row-frequencyconstant
Alternatively via the Python API:
Click to show
frommyworkflowsimportvbm_templatefromfileformatsimportcommon,medimagefromframetree.commonimportClinicalframeset=FrameSet.load('bids///data/openneuro/ds00014')# Add sink column with "constant" row frequencyframeset.add_sink(name='vbm_template',datatype=medimage.NiftiGzrow_frequency='constant')# NB: we don't need to add the T1w source as it is automatically detected# when using BIDS# Connect pipeline to a "dataset" row-row_frequency sink column. Needs to be# of `dataset` row_frequency itself or Arcana will raise an errorframeset.apply(name='vbm_template',workflow=vbm_template,inputs=[('in_file','T1w')],outputs=[('out_file','vbm_template')],row_frequency='constant')
After workflows and/or analysis classes have been connected to a dataset, derivatives can be
generated using FrameSet.derive() or alternatively FrameSet.derive()
for single columns. These methods check the data store to see whether the
source data is present and executes the pipelines over all rows of the dataset
with available source data. If pipeline inputs are sink columns to be derived
by prerequisite pipelines, then the prerequisite pipelines will be prepended
onto the execution stack.
frameset=FrameSet.load('/data/openneuro/ds00014@test')frameset.derive('fast/gm',cache_dir='/work/temp-dir')# Print URI of generated datasetprint(frameset['fast/gm']['sub11'].uri)
By default Pydra uses the "concurrent-futures" ('cf') plugin, which
splits workflows over multiple processes. You can specify which plugin, and
thereby how the workflow is executed via the pydra_plugin option, and pass
options to it with pydra_option.
Provenance metadata is saved alongside derivatives in the data store. The
metadata includes:
MD5 Checksums of all pipeline inputs and outputs
Full workflow graph with connections between, and parameterisations of, Pydra tasks
Container image tags for tasks that ran inside containers
Python dependencies and versions used.
How these provenance metadata are stored will depend on the type data store,
but often it will be stored in a JSON file. For example, a provenance JSON file
would look like
{"store":{"class":"<frametree.xnat.api:Xnat>","server":"https://central.xnat.org"},"dataset":{"id":"MYPROJECT","name":"passed-dwi-qc","exclude":['015','101']"id_composition":{"subject":"(?P<group>TEST|CONT)(?P<member>\d+3)"}},"pipelines":[{"name":"anatomically_constrained_tractography","inputs":{// MD5 Checksums for all files in the file group. "." refers to the// "primary file" in the file group."T1w_reg_dwi":{"datatype":"<fileformats.medimage.data:NiftiGzX>","checksums":{".":"4838470888DBBEADEAD91089DD4DFC55","json":"7500099D8BE29EF9057D6DE5D515DFFE"}},"T2w_reg_dwi":{"datatype":"<fileformats.medimage.data:NiftiGzX>","checksums":{".":"4838470888DBBEADEAD91089DD4DFC55","json":"5625E881E32AE6415E7E9AF9AEC59FD6"}},"dwi_fod":{"datatype":"<fileformats.medimage.data:MrtrixImage>","checksums":{".":"92EF19B942DD019BF8D32A2CE2A3652F"}}},"outputs":{"wm_tracks":{"task":"tckgen","field":"out_file","datatype":"<fileformats.medimage.data:MrtrixTrack>","checksums":{".":"D30073044A7B1239EFF753C85BC1C5B3"}}}"workflow":{"name":"workflow","class":"<pydra.engine.core:Workflow>","tasks":{"5ttgen":{"class":"<pydra.tasks.mrtrix3.preprocess:FiveTissueTypes>","package":"pydra-mrtrix","version":"0.1.1","inputs":{"in_file":{"field":"T1w_reg_dwi"}"t2":{"field":"T1w_reg_dwi"}"sgm_amyg_hipp":true},"container":{"type":"docker","image":"mrtrix3/mrtrix3:3.0.3"}},"tckgen":{"class":"<pydra.tasks.mrtrix3.tractography:TrackGen>","package":"pydra-mrtrix","version":"0.1.1","inputs":{"in_file":{"field":"dwi_fod"},"act":{"task":"5ttgen","field":"out_file"},"select":100000000,},"container":{"type":"docker","image":"mrtrix3/mrtrix3:3.0.3"}},},},"execution":{"machine":"hpc.myuni.edu","processor":"intel9999","python-packages":{"pydra-mrtrix3":"0.1.0","fileformats-medimage":"0.8.1","frametree-xnat":"0.5.0"}},},],}
Before derivatives are generated, provenance metadata of prerequisite
derivatives (i.e. inputs of the pipeline and prerequisite pipelines, etc...)
are checked to see if there have been any alterations to the configuration of
the pipelines that generated them. If so, any affected rows will not be
processed, and a warning will be generated by default. To override this behaviour
and reprocesse the derivatives, set the reprocess flag when calling
Dataset.derive()