Processing
==========

All data processing in Arcana is performed by the Pydra_ dataflow engine, which
specifies a declarative domain-specific language (DSL) for creating workflows
in Python, and enables the execution of workflows to be split over multiple
cores or high-performance clusters (i.e. SLURM and SGE).

Processed derivatives are computed by "pipelines", modular Pydra_ workflows
that connect dataset columns (see :ref:`data_columns`). Pipeline outputs are
always connected to sink columns, whereas pipeline inputs can draw data from either
source columns or sink columns containing derivatives generated by prerequisite
pipelines.

By connecting pipeline inputs to the outputs of other pipelines,
complex processing chains/webs can be created (reminiscent of a makefile),
in which intermediate products will be stored in the dataset for subsequent
analysis. Alternatively, :class:`.Analysis` classes can be used to implement
processing chains/webs independently of a specific dataset so that they can be applied
to new studies in a reproducible way. If required, :class:`.Analysis`
classes can be customised for particular use cases via combination of new
parameterisations and overriding selected methods in subclasses (see :ref:`design_analyses`).

.. note::

  While a general-purpose dataflow engine, Pydra_ was developed in the neuroimaging
  field and so the existing task interfaces wrap common neuroimaging tools. Therefore,
  you will need to create your own Pydra_ wrappers for other tools used in other
  domains.

.. _applying_workflows:

Pydra workflows
---------------

`Pydra workflows`_, or individual `Pydra tasks`_, can be applied to dataset in
order to operate on the data within it. Workflows are connected from source or sink
columns to sink columns. During the application process :class:`.Pipeline` objects
are created to wrap the workflow and prepend and append additional tasks to

* iterate over the relevant rows in the dataset
* manage storage and retrieval of data to and from the data store
* convert between between mismatching file formats
* write provenance metadata
* check saved provenance metadata to ensure prerequisite derivatives were generated with equivalent parameterisations and software versions (and potentially reprocess them if not)

To connect a workflow via the CLI mapping the inputs and outputs of the Pydra_
workflow/task (``in_file``, ``peel`` and ``out_file`` in the example below)
to appropriate columns in the dataset (``T1w``, ``T2w`` and
``freesurfer/recon-all`` respectively)

.. code-block:: console

    $ arcana dataset add-source 'myuni-xnat//myproject:training' T1w \
      medimage/dicom --path '.*mprage.*' --regex

    $ arcana dataset add-source 'myuni-xnat//myproject:training' T2w \
      medimage/dicom --path '.*t2spc.*' --regex

    $ arcana dataset add-sink 'myuni-xnat//myproject:training' freesurfer/recon-all \
      application/zip

    $ arcana apply pipeline 'myuni-xnat//myproject:training' freesurfer \
      pydra.tasks.freesurfer:Freesurfer \
      --input T1w in_file medimage/niftiGz \
      --input T2w peel medimage/niftiGz \
      --output freesurfer/recon-all out_file generic/directory \
      --parameter param1 10 \
      --parameter param2 20

If there is a mismatch in the data datatype (see :ref:`data_formats`) between the
workflow inputs/outputs and the columns they are connected to, a datatype conversion
task will be inserted into the pipeline if converter method between the two
formats exists (see :ref:`file_formats`).

To add a workflow to a dataset via the API use the :meth:`Dataset.apply_pipeline` method

.. code-block:: python

    from pydra.tasks.freesurfer import Freesurfer
    from arcana.data.types import common, medimage

    dataset = Dataset.load('myuni-xnat//myproject:training')

    dataset.add_source('T1w', datatype=medimage.Dicom, path='.*mprage.*',
                       is_regex=True)
    dataset.add_source('T2w', datatype=medimage.Dicom, path='.*t2spc.*',
                       is_regex=True)

    dataset.add_sink('freesurfer/recon-all', common.Directory)

    dataset.apply_pipeline(
        workflow=Freesurfer(
            name='freesurfer,
            param1=10.0,
            param2=20.0),
        inputs=[('T1w', 'in_file', medimage.NiftiGz),
                ('T2w', 'peel', medimage.NiftiGz)],
        outputs=[('freesurfer/recon-all', 'out_file', common.Directory)])

    dataset.save()

If the source can be referenced by its path alone and the formats of the source
and sink columns match those expected and produced by the workflow, then you
can all add the sources and sinks in one step

.. code-block:: console

    $ arcana apply pipeline '/data/enigma/alzheimers:test' segmentation \
      pydra.tasks.fsl.preprocess.fast:FAST \
      --source T1w in_file medimage:NiftiGz \
      --sink fast/gm gm medimage:NiftiGz \
      --parameter method a-method


By default, pipelines will iterate all "leaf rows" of the data tree (e.g. ``session``
for datasets in the :class:`.Clinical` space). However, pipelines can be run
at any row row_frequency of the dataset (see :ref:`data_spaces`), e.g. per subject,
per timepoint, or on the dataset as a whole (to create single templates/statistics).

Pipeline outputs must be connected to sinks of the same row row_frequency. However,
inputs can be drawn from columns of any row row_frequency. In this case,
inputs from more frequent rows will be provided to the pipeline as a list
sorted by their ID.

For example, when the pipeline in the following code-block runs, it will receive
a list of T1w filenames, run one workflow row, and then sink a single template
back to the dataset.


.. code-block:: python

    from myworkflows import vbm_template
    from arcana.data.types import common, medimage
    from arcana.medimage.data import Clinical

    dataset = Dataset.load('bids///data/openneuro/ds00014')

    # Add sink column with "dataset" row row_frequency
    dataset.add_sink(
        name='vbm_template',
        datatype=medimage.NiftiGz
        row_frequency='dataset')

    # NB: we don't need to add the T1w source as it is automatically detected
    #     when using BIDS

    # Connect pipeline to a "dataset" row-row_frequency sink column. Needs to be
    # of `dataset` row_frequency itself or Arcana will raise an error
    dataset.apply_pipeline(
        name='vbm_template',
        workflow=vbm_template(),
        inputs=[('in_file', 'T1w')],
        outputs=[('out_file', 'vbm_template')],
        row_frequency='dataset')


.. _analysis_classes:

Analysis classes
----------------

:class:`.Analysis` classes are used to implement pipeline chains/webs that
can be applied to types of datasets in a reproducible manner. The syntax used is
an extension of the attrs_ package (see `https://www.attrs.org/en/stable/extending.html
<https://www.attrs.org/en/stable/extending.html>`_). In this syntax, member
attributes are either free parameters or placeholders for columns in the
dataset the analysis is applied to. Decorated "pipeline builder" methods
construct the pipelines to perform the analysis.

The following toy example has two column placeholders, ``recorded_datafile``
and ``recorded_metadata``, to be linked to source data (*Line 13 & 14*), and
three column placeholders, ``preprocessed``, ``derived_image`` and
``summary_metric`` (*Line 15-17*) that can be derived by pipelines created by
one of the two implemented pipeline builder methods ``preprocess_pipeline``
(*Line 26*) and ``create_image_pipeline`` (*Line 56*).

The :func:`arcana.core.mark.analysis` decorator is used to specify an
analysis class (*Line 6*), taking the dataset space that the class operates on
as an argument. By default, class attributes are assumed to be
column placeholders of :func:`arcana.core.mark.column` type (*Line 13-17*).
Class attributes can also be free parameters of the analysis by using the
:func:`arcana.core.mark.parameter` instead (*Line 21*).

The :func:`arca.acore.mark.pipeline` decorator specifies pipeline builder
methods, and takes the columns the pipeline outputs are connected to as arguments
(*Line 26 & 54*). More details on the design of analysis classes see
:ref:`design_analyses`.

..  code-block:: python
    :linenos:

    import pydra
    from some.example.pydra.tasks import Preprocess, ExtractFromJson, MakeImage
    from arcana.core.mark import analysis, pipeline, parameter
    from arcana.example.data import ExampleDataSpace
    from fileformats.application import Zip
    from fileformats.generic import Directory
    from fileformats.application import Json
    from fileformats.image import Png, Gif

    @analysis(ExampleDataSpace)
    class ExampleAnalysis():

        # Define the columns for the dataset along with their formats.
        # The `column` decorator can be used to specify additional options but
        # is not required by default. The data formats specify the datatype
        # that the column data will be stored in
        recorded_datafile: Zip  # Not derived by a pipeline, should be linked to existing dataset column
        recorded_metadata: Json  # "     "     "     "
        preprocessed: Zip  # Derived by 'preprocess_pipeline' pipeline
        derived_image: Png  # Derived by 'create_image_pipeline' pipeline
        summary_metric: float  # Derived by 'create_image_pipeline' pipeline

        # Define an analysis-wide parameters that can be used in multiple
        # pipelines/tasks
        contrast: float = parameter(default=0.5)
        kernel_fwhms: list[float] = parameter(default=[0.5, 0.3, 0.1])

        # Define a "pipeline builder method" to generate the 'preprocessed'
        # derivative. Arcana automagically maps column names to arguments of the
        # builder methods.
        @pipeline(preprocessed)
        def preprocess_pipeline(
                self,
                wf: pydra.Workflow,
                recorded_datafile: Directory,  # Automatic conversion from stored Zip format before pipeline is run
                recorded_metadata):  # Format/format is the same as class definition so can be omitted

            # A simple task to extract the "temperature" field from a JSON
            # metadata
            wf.add(
                ExtractFromJson(
                    name='extract_metadata',
                    in_file=recorded_metadata,
                    field='temperature'))

            # Add tasks to the pipeline using Pydra workflow syntax
            wf.add(
                Task1(
                    name='preprocess',
                    in_file=recorded_datafile,
                    temperature=wf.extract_metadata.lzout.out_field))

            # Map the output of the pipeline to the "preprocessed" column specified
            # in the @pipeline decorator
            return preprocess.lzout.out_file

        # The 'create_image' pipeline derives two columns 'derived_image' (in GIF format) and
        # 'summary_metric'. Since the output format of derived image created by the pipeline ('Gif')
        # differs from that specified for the column ('Png'), an automatic conversion
        # step will be added by Arcana before the image is stored.
        @pipeline((derived_image, Gif),
                  summary_metric)
        def create_image_pipeline(
                self,
                wf,
                preprocessed: Directory,  # Automatic conversion from stored Zip format before pipeline is run
                contrast: float):  # Parameters are also automagically mapped to method args

            # Add a task that creates an image from the preprocessed data, using
            # the 'contrast' parameter
            wf.add(
                MakeImage(
                    name="create_image",
                    in_file=preprocessed,
                    contrast=contrast))

            return create_image.lzout.out_file, wf.create_image.lzout.summary

To apply an analysis via the command-line use the ``--column`` flag to connect
column specs in the class with existing columns in the dataset.

.. code-block:: console

  $ arcana apply analysis '/data/a-dataset' example:ExampleAnalysis \
    --column recorded_datafile datafile \
    --column recorded_metadata metadata \
    --parameter contrast 0.75

Analyses are applied to datasets using the Python API with the :meth:`.Dataset.apply`
method. :meth:`.Dataset.apply` takes an :class:`.Analysis` object that is instantiated
with the names of columns in the dataset to link placeholders to and any
parameters.

.. code-block:: python

  from arcana.core.data.set import Dataset
  from fileformats.application import Yaml
  from arcana.examples import ExampleAnalysis

  a_dataset = Dataset.load('/data/a-dataset')

  dataset.add_source(
      name='datafile',
      path='a-long-arbitrary-name',
      datatype=Zip)

  dataset.add_source(
      name='metadata',
      path='another-long-arbitrary-name',
      datatype=Yaml)  # The format the data is in the dataset, will be automatically converted

  dataset.apply(
      ExampleAnalysis(
          recorded_datafile='datafile',
          recorded_metadata='metadata',
          contrast=0.75))

.. _derivatives:

Generating derivatives
----------------------

After workflows and/or analysis classes have been connected to a dataset, derivatives can be
generated using :meth:`.Dataset.derive` or alternatively :meth:`.DataColumn.derive`
for single columns. These methods check the data store to see whether the
source data is present and executes the pipelines over all rows of the dataset
with available source data. If pipeline inputs are sink columns to be derived
by prerequisite pipelines, then the prerequisite pipelines will be prepended
onto the execution stack.

To generate derivatives via the CLI

.. code-block:: console

  $ arcana derive column 'myuni-xnat//myproject:training' freesurfer/recon-all

To generate derivatives via the API

.. code-block:: python

  dataset = Dataset.load('/data/openneuro/ds00014:test')

  dataset.derive('fast/gm', cache_dir='/work/temp-dir')

  # Print URI of generated dataset
  print(dataset['fast/gm']['sub11'].uri)

By default Pydra_ uses the "concurrent-futures" (`'cf'`) plugin, which
splits workflows over multiple processes. You can specify which plugin, and
thereby how the workflow is executed via the ``pydra_plugin`` option, and pass
options to it with ``pydra_option``.


.. code-block:: console

  $ arcana derive column 'myuni-xnat//myproject:training' freesurfer/recon-all \
    --plugin slurm --pydra-option poll_delay 5 --pydra-option max_jobs 10


To list the derivatives that can be derived from a dataset after workflows
have been applied you can use the ``menu`` command

.. code-block:: console

  $ arcana derive menu '/data/a-dataset'

  Derivatives
  -----------
  recorded_datafile (zip)
  recorded_metadata (json)
  preprocessed (zip)
  derived_image (png)
  summary_metric (float)

  Parameters
  ----------
  contrast (float) default=0.5
  kernel_fwhms (list[float]) default=[0.5, 0.3, 0.1]

For large analysis classes with many column specs this list could become
overwhelming, so when designing an analysis class it is good practice to set the
"salience" of columns and parameters (see :ref:`column_param_specs`). The menu
can then be filtered to show only the more salient columns (the default is to
only show "supplementary" and above).
Parameters can similarly be filtered by their salience (see :class:`.ParameterSalience`),
by default only showing parameters "check" and above.
For example, the following menu call will show all columns and parameters with
salience >= 'qa' and 'recommended', respectively.

.. code-block:: console

  $ arcana derive menu '/data/another-dataset' --columns qa --parameters recommended

The ``salience_threshold`` argument can also be used to filter out derivatives
from the data store when applying an analysis to a dataset. This
allows the user to control how much derivative data are saved to
avoid filling up (potentially expensive) storage. The following call will only
attempt to store data columns with "qa" or greater salience in XNAT, keeping the
remaining only in local cache.

.. code-block:: console

  $ arcana apply analysis 'my-unis-xnat//MYPROJECT:test' example:ExampleAnalysis \
    --link recorded_datafile datafile \
    --link recorded_metadata metadata \
    --parameter contrast 0.75 \
    --salience-threshold qa


Provenance
----------

Provenance metadata is saved alongside derivatives in the data store. The
metadata includes:

* MD5 Checksums of all pipeline inputs and outputs
* Full workflow graph with connections between, and parameterisations of, Pydra tasks
* Container image tags for tasks that ran inside containers
* Python dependencies and versions used.

How these provenance metadata are stored will depend on the type data store,
but often it will be stored in a JSON file. For example, a provenance JSON file
would look like

.. code-block:: javascript

  {
    "store": {
      "class": "<arcana.medimage.data.xnat.api:Xnat>",
      "server": "https://central.xnat.org"
    },
    "dataset": {
      "id": "MYPROJECT",
      "name": "passed-dwi-qc",
      "exclude": ['015', '101']
      "id_composition": {
        "subject": "(?P<group>TEST|CONT)(?P<member>\d+3)"
      }
    },
    "pipelines": [
      {
        "name": "anatomically_constrained_tractography",
        "inputs": {
          // MD5 Checksums for all files in the file group. "." refers to the
          // "primary file" in the file group.
          "T1w_reg_dwi": {
            "datatype": "<fileformats.medimage.data:NiftiGzX>",
            "checksums": {
              ".": "4838470888DBBEADEAD91089DD4DFC55",
              "json": "7500099D8BE29EF9057D6DE5D515DFFE"
            }
          },
          "T2w_reg_dwi": {
            "datatype": "<fileformats.medimage.data:NiftiGzX>",
            "checksums": {
              ".": "4838470888DBBEADEAD91089DD4DFC55",
              "json": "5625E881E32AE6415E7E9AF9AEC59FD6"
            }
          },
          "dwi_fod": {
            "datatype": "<fileformats.medimage.data:MrtrixImage>",
            "checksums": {
              ".": "92EF19B942DD019BF8D32A2CE2A3652F"
            }
          }
        },
        "outputs": {
          "wm_tracks": {
            "task": "tckgen",
            "field": "out_file",
            "datatype": "<fileformats.medimage.data:MrtrixTrack>",
            "checksums": {
              ".": "D30073044A7B1239EFF753C85BC1C5B3"
            }
          }
        }
        "workflow": {
          "name": "workflow",
          "class": "<pydra.engine.core:Workflow>",
          "tasks": {
            "5ttgen": {
              "class": "<pydra.tasks.mrtrix3.preprocess:FiveTissueTypes>",
              "package": "pydra-mrtrix",
              "version": "0.1.1",
              "inputs": {
                "in_file": {
                  "field": "T1w_reg_dwi"
                }
                "t2": {
                  "field": "T1w_reg_dwi"
                }
                "sgm_amyg_hipp": true
              },
              "container": {
                "type": "docker",
                "image": "mrtrix3/mrtrix3:3.0.3"
              }
            },
            "tckgen": {
              "class": "<pydra.tasks.mrtrix3.tractography:TrackGen>",
              "package": "pydra-mrtrix",
              "version": "0.1.1",
              "inputs": {
                "in_file": {
                  "field": "dwi_fod"
                },
                "act": {
                  "task": "5ttgen",
                  "field": "out_file"
                },
                "select": 100000000,
              },
              "container": {
                "type": "docker",
                "image": "mrtrix3/mrtrix3:3.0.3"
              }
            },
          },
        },
        "execution": {
          "machine": "hpc.myuni.edu",
          "processor": "intel9999",
          "python-packages": {
            "pydra-mrtrix3": "0.1.0",
            "arcana-medimage": "0.1.0"
          }
        },
      },
    ],
  }


Before derivatives are generated, provenance metadata of prerequisite
derivatives (i.e. inputs of the pipeline and prerequisite pipelines, etc...)
are checked to see if there have been any alterations to the configuration of
the pipelines that generated them. If so, any affected rows will not be
processed, and a warning will be generated by default. To override this behaviour
and reprocesse the derivatives, set the ``reprocess`` flag when calling
:meth:`.Dataset.derive`

.. code-block:: python

  dataset.derive('fast/gm', reprocess=True)

or

.. code-block:: console

  $ arcana derive column 'myuni-xnat//myproject:training' freesurfer/recon-all  --reprocess


To ignore differences between pipeline configurations you can use the :meth:`.Dataset.ignore`
method

.. code-block:: python

  dataset.ignore_diff('freesurfer_pipeline', ('freesurfer_task', 'num_iterations', 3))

or via the CLI

.. code-block:: console

  $ arcana derive ignore-diff 'myuni-xnat//myproject:training' freesurfer --param freesurfer_task num_iterations 3


.. _Pydra: http://pydra.readthedocs.io
.. _`Pydra workflows`: https://pydra.readthedocs.io/en/latest/components.html#workflows
.. _`Pydra tasks`: https://pydra.readthedocs.io/en/latest/components.html#function-tasks
.. _attrs: https://www.attrs.org/en/stable/
.. _dataclasses: https://docs.python.org/3/library/dataclasses.html