Application Programming Interface¶

The core of Arcana’s framework is located under the arcana.core sub-package, which contains all the domain-independent logic. Domain-specific extensions for alternative data stores, dimensions and formats should be placed in arcana.data.stores, arcana.data.spaces and arcana.data.types respectively.

Warning

Under construction

Data Model¶

Core¶

class arcana.core.data.store.DataStore[source]¶

Abstract base class for all data store adapters. A data store can be an external data management system, e.g. XNAT, OpenNeuro, Datalad or just a defined structure of how to lay out data within a file-system, e.g. BIDS.

For a data management system/data structure to be compatible with Arcana, it must meet a number of criteria. In Arcana, a store is assumed to

contain multiple projects/datasets addressable by unique IDs.

organise data within each project/dataset in trees

store arbitrary numbers of data “items” (e.g. “file-sets” and fields) within each tree node (including non-leaf nodes) addressable by unique “paths” relative to the node.

allow derivative data to be stored within in separate namespaces for different analyses on the same data

class arcana.core.data.set.Dataset(*, id, store: DataStore, space: Type[DataSpace], id_patterns=NOTHING, hierarchy, metadata=NOTHING, include=NOTHING, exclude=NOTHING, name: str = '', columns=NOTHING, pipelines=NOTHING, analyses=NOTHING)[source]¶

A representation of a “dataset”, the complete collection of data (file-sets and fields) to be used in an analysis.

Parameters:

id (str) – The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)
store (Repository) – The store the dataset is stored into. Can be the local file system by providing a MockRemote repo.
space (DataSpace) – The space of the dataset. See https://arcana.readthedocs.io/en/latest/data_model.html#spaces) for a description
id_patterns (dict[str, str]) – Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See DataStore.infer_ids() for syntax
hierarchy (list[str]) –
The data frequencies that are explicitly present in the data tree. For example, if a MockRemote dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be

[‘subject’, ‘timepoint’]

Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be

[‘subject’, ‘session’]

In such cases, if there are multiple timepoints, the timepoint ID of the session will need to be extracted using the id_patterns argument.

Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):

[‘timepoint’, ‘member’, ‘group’]

Note that the combination of layers in the hierarchy must span the space defined in the DataSpace enum, i.e. the “bitwise or” of the layer values of the hierarchy must be 1 across all bits (e.g. ‘session’: 0b111).
metadata (dict or DatasetMetadata) – Generic metadata associated with the dataset, e.g. authors, funding sources, etc…
include (list[tuple[DataSpace, str or ty.List[str]]]) – The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used
exclude (list[tuple[DataSpace, str or ty.List[str]]]) – The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used
name (str) – The name of the dataset as saved in the store under
columns (list[tuple[str, DataSource or DataSink]) – The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).
pipelines (dict[str, pydra.Workflow]) – Pipelines that have been applied to the dataset to generate sink
access_args (ty.Dict[str, Any]) – Repository specific args used to control the way the dataset is accessed

add_sink(name: str, datatype: type, row_frequency: str | None = None, overwrite: bool = False, **kwargs) → DataSink[source]¶

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters:

name (str) – The name used to reference the dataset “column” for the sink
datatype (type) – The file-format (for file-sets) or datatype (for fields) that the sink will be stored in within the dataset
path (str, optional) – Specify a particular for the sink within the dataset, defaults to the column name within the dataset derivatives directory of the store
row_frequency (str, optional) – The row_frequency of the sink within the dataset, by default the leaf frequency of the data tree
overwrite (bool) – Whether to overwrite an existing sink

add_source(name: str, datatype: type, path: str | None = None, row_frequency: str | None = None, overwrite: bool = False, **kwargs) → DataSource[source]¶

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters:

name (str) – The name used to reference the dataset “column” for the source
datatype (type) – The file-format (for file-sets) or datatype (for fields) that the source will be stored in within the dataset
path (str, default name) – The location of the source within the dataset
row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the source within the dataset
overwrite (bool) – Whether to overwrite existing columns
**kwargs (ty.Dict[str, Any]) – Additional kwargs to pass to DataSource.__init__

class arcana.core.data.space.DataSpace(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Base class for all “data space” enums. DataSpace enums specify the relationships between rows of a dataset.

For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D grid along the group, member, and timepoint dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.

All dimensions should be included as members of a DataSpace subclass enum with orthogonal binary vector values, e.g.

member = 0b001 group = 0b010 timepoint = 0b100

In this space, an imaging session row is uniquely defined by its member, group and timepoint ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).

In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the grid. The “row_frequency” of these rows can be specified using the “basis” members (i.e. member, group, timepoint) in contrast to the session row_frequency, which is the combination of all three

session = 0b111

Additionally, some data is stored in aggregated rows that across a plane of the grid. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.

subject = 0b011 - uniquely identified subject within in the dataset. batch = 0b110 - separate group+timepoint combinations matchedpoint = 0b101 - matched members and time-points aggregated across groups

Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:

dataset = 0b000

class arcana.core.data.row.DataRow(*, ids: ty.Dict[DataSpace, str], dataset: Dataset, frequency: str, tree_path: ty.List[str] = None, uri: ty.Optional[str] = None, metadata: ty.Optional[dict] = None)[source]¶

A “row” in a dataset “frame” where file-sets and fields can be placed, e.g. a session or subject.

Parameters:

ids (Dict[DataSpace, str]) – The ids for the frequency of the row and all “parent” frequencies within the tree
dataset (Dataset) – A reference to the root of the data tree
frequency (str) – The frequency of the row
tree_path (list[str], optional) – the path to the row within the data tree. None if the row doesn’t sit within the original tree (e.g. timepoints within a subject>session hierarchy)
uri (str, optional) – a URI for the row, can be set and used by the data store implementation if appropriate, by default None

class arcana.core.data.column.DataSource(*, name: str, datatype: type, row_frequency: DataSpace, path: ty.Optional[str] = None, dataset: Dataset = None, quality_threshold=None, order=None, required_metadata: ty.Dict[str, ty.Any] = None, is_regex=False)[source]¶

Specifies the criteria by which an item is selected from a data row to be a data source.

Parameters:

name (str) – the name of the column
datatype (type) – the data type of items in the column
row_frequency (DataSpace) – the frequency of the “rows” (data nodes) within the dataset tree, e.g. for the Clinical data spce the row frequency can be per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’, et…
dataset (Dataset) – the dataset the column belongs to
path (str) – A regex name_path to match the fileset names with. Must match one and only one fileset per <row_frequency>. If None, the name is used instead.
quality_threshold (DataQuality) – The acceptable quality (or above) that should be considered. Data items will be considered missing
order (int | None) – To be used to distinguish multiple filesets that match the name_path in the same session. The order of the fileset within the session (0-indexed). Based on the scan ID but is more robust to small changes to the IDs within the session if for example there are two scans of the same type taken before and after a task.
required_metadata (dict[str, ty.Any]) – Required metadata, which can be used to distinguish multiple items that match all other criteria. The provided dictionary contains metadata values that must match the stored required_metadata exactly.
is_regex (bool) – Flags whether the name_path is a regular expression or not

class arcana.core.data.column.DataSink(*, name: str, datatype: type, row_frequency: DataSpace, dataset: Dataset = None, path=NOTHING, salience=ColumnSalience.supplementary, pipeline_name: str = None)[source]¶

A specification for a file set within a analysis to be derived from a processing pipeline.

Parameters:

name (str) – the name of the column
datatype (type) – the data type of items in the column
row_frequency (DataSpace) – the frequency of the “rows” (data nodes) within the dataset tree, e.g. for the Clinical data spce the row frequency can be per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’, et…
dataset (Dataset) – the dataset the column belongs to
path (str) – A regex name_path to match the fileset names with. Must match one and only one fileset per <row_frequency>. If None, the name is used instead.
salience (Salience) – The salience of the specified file-set, i.e. whether it would be typically of interest for publication outputs or whether it is just a temporary file in a workflow, and stages in between
pipeline_name (str) – The name of the workflow applied to the dataset to generates the data for the sink

Stores¶

Processing¶

class arcana.core.analysis.pipeline.Pipeline(name: str, row_frequency: DataSpace, workflow: Workflow, inputs, outputs, converter_args=NOTHING, dataset: Dataset = None)[source]¶

A thin wrapper around a Pydra workflow to link it to sources and sinks within a dataset

Parameters:

name (str) – the name of the pipeline, used to differentiate it from others
row_frequency (DataSpace, optional) – The row_frequency of the pipeline, i.e. the row_frequency of the derivatvies within the dataset, e.g. per-session, per-subject, etc, by default None
workflow (Workflow) – The pydra workflow that performs the actual analysis
inputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of column names (i.e. either data sources or sinks) to be connected to the inputs of the pipeline. If the pipelines requires the input to be in a datatype to the source, then it can be specified in a tuple (NAME, FORMAT)
outputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of sink names to be connected to the outputs of the pipeline If the input to be in a specific datatype, then it can be provided in a tuple (NAME, FORMAT)
converter_args (dict[str, dict]) – keyword arguments passed on to the converter to control how the conversion is performed.
dataset (Dataset) – the dataset the pipeline has been applied to

Application Programming Interface¶

Data Model¶

Core¶

Stores¶

Processing¶

Enums¶