.. _adding_formats:

New formats and spaces
======================

Arcana was initially developed for medical-imaging analysis. Therefore, with
the notable exceptions of the generic data spaces and file-formats defined in
:mod:`arcana.core.standard`, the
majority of file-formats and data spaces are specific to medical imaging.
However, new formats and data spaces used in other fields can be implemented as
required with just a few lines of code.

.. _file_formats:

File formats
------------

File formats are specified using the FileFormats_ package. Please refer to its documentation
on how to add new file formats


Data spaces
-----------

New data spaces (see :ref:`data_spaces`) are defined by extending the
:class:`.DataSpace` abstract base class. :class:`.DataSpace` subclasses are be
`enums <https://docs.python.org/3/library/enum.html>`_ with binary string
values of consistent length (i.e. all of length 2 or all of length 3, etc...).
The length of the binary string defines the rank of the data space,
i.e. the maximum depth of a data tree within the space. The enum must contain
members for each permutation of the bit string (e.g. for 2 dimensions, there
must be members corresponding to the values 0b00, 0b01, 0b10, 0b11).

For example, in imaging studies scannings sessions are typically organised
by analysis group (e.g. test & control), membership within the group (i.e
matched subject ID) and time-points for longitudinal studies. In this case, we can
visualise the imaging sessions arranged in a 3-D grid along the `group`, `member`, and
`timepoint` axes. Note that datasets that only contain one group or
time-point can still be represented in this space, and just be singleton along
the corresponding axis.

All axes should be included as members of a DataSpace subclass
enum with orthogonal binary vector values, e.g.::

    member = 0b001
    group = 0b010
    timepoint = 0b100

The axis that is most often non-singleton should be given the smallest bit
as this will be assumed to be the default when there is only one layer in the
data tree, e.g. imaging datasets will not always have different groups or
time-points but will always have different members (which are equivalent to
subjects when there is only one group).

The "leaf rows" of a data tree, imaging sessions in this example, will be the
bitwise-and of the dimension vectors, i.e. an imaging session
is uniquely defined by its member, group and timepoint ID.::

    session = 0b111

In addition to the data items stored in leaf rows, some data, particularly
derivatives, may be stored in the dataset along a particular dimension, at
a lower "row_frequency" than 'per session'. For example, brain templates are
sometimes calculated 'per group'. Additionally, data
can also be stored in aggregated rows that across a plane
of the grid. These frequencies should also be added to the enum, i.e. all
permutations of the base dimensions must be included and given intuitive
names if possible::

    subject = 0b011 - uniquely identified subject within in the dataset.
    batch = 0b110 - separate group + timepoint combinations
    matchedpoint = 0b101 - matched members and time-points aggregated across groups

Finally, for items that are singular across the whole dataset there should
also be a dataset-wide member with value=0::

    dataset = 0b000

For example, if you wanted to analyse daily recordings from various
weather stations you could define a 2-dimensional "Weather" data space with
axes for the date and weather station of the recordings, with the following code

.. _weather_example:

.. code-block:: python

    from arcana.core.data.space import DataSpace

    class Weather(DataSpace):

        # Define the axes of the dataspace
        timepoint = 0b01
        station = 0b10

        # Name the leaf and root frequencies of the data space
        recording = 0b11
        dataset = 0b00

.. note::

    All permutations of *N*-D binary strings need to be named within the enum.

.. _Pydra: http://pydra.readthedocs.io
.. _FileFormats: https://arcanaframework.github.io/fileformats