Public API

Available Backends

class frametree.common.FileSystem(name: str = 'file_system')[source]

A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.

class frametree.bids.Bids(json_edits: list = NOTHING, name: str = 'bids')[source]

Repository for working with data stored on the file-system in BIDS format

Parameters:

json_edits (list[tuple[str, str]], optional) -- Specifications to edit JSON files as they are written to the store to enable manual modification of fields to correct metadata. List of tuples of the form: FILE_PATH - path expression to select the files, EDIT_STR - jq filter used to modify the JSON document.

class frametree.xnat.Xnat(server: str, cache_dir: str | Path, name: str | None = None, user: str = None, password: str = None, race_condition_delay: int = 5, verify_ssl: bool = True)[source]

Access class for XNAT data repositories

Parameters:
  • server (str (URI)) -- URI of XNAT server to connect to

  • project_id (str) -- The ID of the project in the XNAT repository

  • cache_dir (str (name_path)) -- Path to local directory to cache remote data in

  • user (str) -- Username with which to connect to XNAT with

  • password (str) -- Password to connect to the XNAT repository with

  • race_condition_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset

class frametree.xnat.XnatViaCS(name: ty.Optional[str] = None, race_condition_delay: int = 5, verify_ssl: bool = True, row_frequency: Axes = Clinical.session, row_id: str = None, input_mount=PosixPath('/input'), output_mount=PosixPath('/output'), server: str = NOTHING, user: str = NOTHING, password: str = NOTHING, cache_dir=PosixPath('/cache'))[source]

Access class for XNAT repositories via the XNAT container service plugin. The container service allows the exposure of the underlying file system where imaging data can be accessed directly (for performance), and outputs

Parameters:
  • server (str (URI)) -- URI of XNAT server to connect to

  • project_id (str) -- The ID of the project in the XNAT repository

  • cache_dir (str (name_path)) -- Path to local directory to cache remote data in

  • user (str) -- Username with which to connect to XNAT with

  • password (str) -- Password to connect to the XNAT repository with

  • check_md5 (bool) -- Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached

  • race_cond_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset

Available Axes

class frametree.common.Samples(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

The most basic data space within only one dimension

class frametree.common.Clinical(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enum used to specify the hierarchy of data trees and the "frequencies" of items within dataset typical of medimage research, i.e. subjects split into groups scanned at different visits (in longitudinal studies).

Markers

class frametree.core.salience.ColumnSalience(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enum that holds the salience levels options that can be used when specifying data columns. Salience is used to indicate whether it would be best to store the data in the data store or whether it can be just stored in the local file-system and discarded after it has been used. This choice is ultimately specified by the user by defining a salience threshold for a store.

The salience is also used when providing information on what sinks are available to avoid cluttering help menus

primary = (100, 'Primary input data, typically reconstructed by the instrument that collects them')
raw = (90, "Raw data from the scanner that haven't been reconstructed and are only typically used in advanced analyses")
publication = (80, 'Results that would typically be used as main outputs in publications')
supplementary = (60, 'Derivatives that would typically only be provided in supplementary material')
qa = (40, 'Derivatives that would typically be only kept for quality assurance of analysis workflows')
debug = (20, 'Derivatives that would typically only need to be checked when debugging analysis workflows')
temp = (0, 'Data only temporarily stored to pass between pipelines, e.g. that operate on different row frequencies')
classmethod default()[source]
class frametree.core.salience.ParameterSalience(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enum that holds the salience levels options that can be used when specifying class parameters. Salience is used to indicate whether the parameter should show up by default when listing the available parameters of an Analysis class in a menu.

debug = (0, 'typically only needed to be altered for debugging')
recommended = (20, 'recommended to keep defaults')
dependent = (40, 'best value can be dependent on the context of the analysis, but the default should work for most cases')
check = (60, 'default value should be checked for validity for particular use case')
arbitrary = (80, 'a default is provided, but it is not clear which value is best')
required = (100, 'No sensible default value, should be provided')
classmethod default()[source]
class frametree.core.salience.CheckSalience(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enum that holds the potential values for signifying how critical a check is to run.

debug = (0, 'typically only used to debug alterations to the pipeline')
potential = (20, 'check can be run but not typically necessary')
prudent = (40, 'it is prudent to run the check the results but you can skip if required')
recommended = (60, 'recommended to run the check as pipeline fails 1~2% of the time')
required = (100, 'Pipeline will often fail, checking the results is required')
classmethod default()[source]
class frametree.core.salience.CheckStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enum that holds the potential values that signify how likely a pipeline has " "failed

failed = (0, 'the pipeline has failed')
probable_fail = (25, 'probable that the pipeline has failed')
unclear = (50, 'cannot ascertain whether the pipeline has failed or not')
probable_pass = (75, 'probable that the pipeline has run successfully')
passed = (100, 'the pipeline has run successfully')
classmethod default()[source]
class frametree.core.quality.DataQuality(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

The quality of a data item. Can be manually specified or set by automatic quality control methods

usable = 100
noisy = 75
questionable = 50
artefactual = 25
unusable = 0
classmethod default()[source]

Core

class frametree.core.store.Store[source]

Abstract base class for all data store adapters. A data store can be an external data management system, e.g. XNAT, OpenNeuro, Datalad or just a defined structure of how to lay out data within a file-system, e.g. BIS.

For a data management system/data structure to be compatible with FrameTree, it must meet a number of criteria. In FrameTree, a store is assumed to

  • contain multiple projects/datasets addressable by unique IDs.

  • organise data within each project/dataset in trees

  • store arbitrary numbers of data "items" (e.g. "file-sets" and fields) within each tree node (including non-leaf nodes) addressable by unique "paths" relative to the node.

  • allow derivative data to be stored within in separate namespaces for different analyses on the same data

create_dataset(id: str, leaves: ty.Iterable[ty.Tuple[str, ...]], hierarchy: ty.List[str], axes: type, name: ty.Optional[str] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, **kwargs: ty.Any) FrameSet[source]

Creates a new dataset with new rows to store data in

Parameters:
  • id (str) -- ID of the dataset

  • leaves (list[tuple[str, ...]]) -- the list of tuple IDs (at each level of the tree)

  • name (str, optional) -- name of the dataset, if provided the dataset definition will be saved. To save the dataset with the default name pass an empty string.

  • hierarchy (list[str], optional) -- hierarchy of the dataset tree

  • axes (type, optional) -- the axes of the dataset

  • id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax

Returns:

the newly created dataset

Return type:

FrameSet

define_frameset(id: str, axes: ty.Optional[ty.Type[Axes]] = None, hierarchy: ty.Optional[ty.List[ty.Union[str, Axes]]] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, **kwargs: ty.Any) FrameSet[source]

Creates a FrameTree dataset definition for an existing data in the data store.

Parameters:
  • id (str) -- The ID (or file-system path) of the project (or directory) within the store

  • axes (Axes) -- The data axes of the frametree

  • hierarchy (ty.List[str]) -- The hierarchy of the frametree

  • id_patterns (dict[str, str], optional) -- Patterns used to infer row IDs not explicitly within the hierarchy of the data tree, e.g. groups and visits in an XNAT project with subject>session hierarchy

  • **kwargs -- Keyword args passed on to the FrameSet init method

Returns:

the newly defined dataset

Return type:

FrameSet

import_dataset(id: str, dataset: FrameSet, column_names: ty.Optional[ty.List[ty.Union[str, ty.Tuple[str, type]]]] = None, hierarchy: ty.Optional[ty.List[str]] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, use_original_paths: bool = False, **kwargs: ty.Any) None[source]

Import a dataset from another store, transferring metadata and columns defined on the original dataset

Parameters:
  • id (str) -- the ID of the dataset within this store

  • dataset (FrameSet) -- the dataset to import

  • column_names (list[str or tuple[str, type]], optional) -- list of columns to the to be included in the imported dataset. Items of the list are either a tuple corresponding to the name of a column to import and the datatype to import it as. If the datatype isn't provided and the store has a DEFAULT_DATATYPE attribute it would be used instead otherwise the original datatype will be used, by default all columns are imported

  • hierarchy (list[str], optional) -- the hierarchy of the imported dataset, by default either the default hierarchy of the target store if applicable or the hierarchy of the original dataset

  • id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax

  • use_original_paths (bool, optional) -- use the original paths in the source store instead of renaming the imported entries to match their column names

  • **kwargs -- keyword arguments passed through to the create_data_tree method

classmethod load(name: str, config_path: Path | None = None, **kwargs: Any) Store[source]

Loads a Store from that has been saved in the configuration file. If no entry is saved under that name, then it searches for Store sub-classes with aliases matching name and checks whether they can be initialised without any parameters.

Parameters:
  • name (str) -- Name that the store was saved under

  • config_path (Path, optional) -- path to the config file, defaults to ~/.frametree/stores.yaml

  • **kwargs -- keyword args passed to the store, overriding values stored in the entry

Returns:

The data store retrieved from the stores.yaml file

Return type:

Store

Raises:

FrameTreeNameError -- If the name is not found in the saved stores

classmethod remove(name: str, config_path: Path | None = None) None[source]

Removes the entry saved under 'name' in the config file

Parameters:

name -- Name of the configuration to remove

save(name: str | None = None, config_path: Path | None = None) None[source]

Saves the configuration of a Store in 'stores.yaml'

Parameters:
  • name -- The name under which to save the data store

  • config_path (Path, optional) -- the path to save the config file to, defaults to ~/.frametree/stores.yaml

class frametree.core.frameset.FrameSet(id, store: Store = NOTHING, axes: Type[Axes] = NOTHING, id_patterns=NOTHING, hierarchy: List[str | Axes] = NOTHING, metadata=NOTHING, include=NOTHING, exclude=NOTHING, name: str = '', columns=NOTHING, pipelines=NOTHING)[source]

A representation of a "dataset", the complete collection of data (file-sets and fields) to be used in an analysis.

Parameters:
  • id (str) -- The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)

  • store (Repository) -- The store the dataset is stored into. Can be the local file system by providing a MockRemote repo.

  • axes (Axes) -- The space of the dataset. See https://frametree.readthedocs.io/en/latest/data_model.html#spaces) for a description

  • id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax

  • hierarchy (list[str]) --

    The categorical variables that are explicitly present in the data tree. For example, if a MockRemote dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be

    ['subject', 'visit']

    Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be

    ['subject', 'session']

    In such cases, if there are multiple visits, the visit ID of the session will need to be extracted using the id_patterns argument.

    Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):

    ['visit', 'member', 'group']

    Note that the combination of layers in the hierarchy must span the space defined in the Axes enum, i.e. the "bitwise or" of the layer values of the hierarchy must be 1 across all bits (e.g. 'session': 0b111).

  • metadata (dict or Metadata) -- Generic metadata associated with the dataset, e.g. authors, funding sources, etc...

  • include (list[tuple[Axes, str or ty.List[str]]]) -- The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used

  • exclude (list[tuple[Axes, str or ty.List[str]]]) -- The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used

  • name (str) -- The name of the dataset as saved in the store under

  • columns (list[tuple[str, SourceColumn or SinkColumn]) -- The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).

  • pipelines (dict[str, pydra.Workflow]) -- Pipelines that have been applied to the dataset to generate sink

  • access_args (ty.Dict[str, Any]) -- Repository specific args used to control the way the dataset is accessed

__getitem__(name: str) DataColumn[source]

Return all data items across the dataset for a given source or sink

Parameters:

name (str) -- Name of the column to return

Returns:

the column object

Return type:

DataColumn

add_sink(name: str, datatype: type, row_frequency: str | None = None, overwrite: bool = False, **kwargs: Any) SinkColumn[source]

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters:
  • name (str) -- The name used to reference the dataset "column" for the sink

  • datatype (type) -- The file-format (for file-sets) or datatype (for fields) that the sink will be stored in within the dataset

  • path (str, optional) -- Specify a particular for the sink within the dataset, defaults to the column name within the dataset derivatives directory of the store

  • row_frequency (str, optional) -- The row_frequency of the sink within the dataset, by default the leaf frequency of the data tree

  • overwrite (bool) -- Whether to overwrite an existing sink

add_source(name: str, datatype: type, path: str | None = None, row_frequency: str | None = None, overwrite: bool = False, **kwargs: Any) SourceColumn[source]

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters:
  • name (str) -- The name used to reference the dataset "column" for the source

  • datatype (type) -- The file-format (for file-sets) or datatype (for fields) that the source will be stored in within the dataset

  • path (str, default name) -- The location of the source within the dataset

  • row_frequency (Axes, default self.leaf_freq) -- The row_frequency of the source within the dataset

  • overwrite (bool) -- Whether to overwrite existing columns

  • **kwargs (ty.Dict[str, Any]) -- Additional kwargs to pass to SourceColumn.__init__

apply(name: str, workflow: Workflow, inputs: ty.List[ty.Union['PipelineField', ty.Tuple[str, str, type], ty.Tuple[str, str]]], outputs: ty.List[ty.Union['PipelineField', ty.Tuple[str, str, type], ty.Tuple[str, str]]], row_frequency: ty.Union[Axes, str, None] = None, overwrite: bool = False, converter_args: ty.Optional[ty.Dict[str, ty.Any]] = None) Pipeline[source]

Connect a Pydra workflow as a pipeline of the dataset

Parameters:
  • name (str) -- name of the pipeline

  • workflow (pydra.Workflow) -- pydra workflow to connect to the dataset as a pipeline

  • inputs (list[frametree.core.pipeline.Input or tuple[str, str, type] or tuple[str, str]]) -- List of inputs to the pipeline (see frametree.core.pipeline.Pipeline.PipelineInput)

  • outputs (list[frametree.core.pipeline.Output or tuple[str, str, type] or tuple[str, str]]) -- List of outputs of the pipeline (see frametree.core.pipeline.Pipeline.PipelineOutput)

  • row_frequency (str, optional) -- the frequency of the data rows the pipeline will be executed over, i.e. will it be run once per-session, per-subject or per whole dataset, by default the highest row frequency (e.g. per-session for Clinical)

  • overwrite (bool, optional) -- overwrite connections to previously connected sinks, by default False

  • converter_args (dict[str, dict]) -- keyword arguments passed on to the converter to control how the conversion is performed.

Returns:

the pipeline added to the dataset

Return type:

Pipeline

Raises:

FrameTreeUsageError -- if overwrite is false and

derive(*sink_names: str, ids: Iterable[str] | None = None, cache_dir: Path = None, **kwargs: Any) None[source]

Generate derivatives from the workflows

Parameters:
  • *sink_names (Iterable[str]) -- Names of the columns corresponding to the items to derive

  • ids (Iterable[str]) -- The IDs of the data rows in each column to derive

  • cache_dir

Returns:

The derived columns

Return type:

Sequence[List[DataType]]

install_license(name: str, source_file: Plain) None[source]

Store project-specific license in dataset

Parameters:
  • name (str) -- name of the license to install

  • source_file (PlainText) -- the license file to install

classmethod load(id: str, store: Store | None = None, name: str | None = '', default_if_missing: bool = False, **kwargs: Any) FrameSet[source]

Loads a dataset from an store/ID/name string, as used in the CLI

Parameters:
  • id (str) -- either the ID of a dataset if store keyword arg is provided or a "dataset ID string" in the format <store-nickname>//<dataset-id>[@<dataset-name>]

  • store (Store, optional) -- the store to load the dataset. If not provided the provided ID is interpreted as an ID string

  • name (str, optional) -- the name of the dataset within the project/directory (e.g. 'test', 'training'). Used to specify a subset of data rows to work with, within a greater project

  • default_if_missing (bool, optional) -- If True, then a new dataset is created if the dataset is not found in the store

  • **kwargs -- keyword arguments parsed to the data store load

Returns:

the loaded dataset

Return type:

FrameSet

row(frequency: Axes | str | None = None, id: str | Tuple[str, ...] = NOTHING, **id_kwargs: Any) DataRow[source]

Returns the row associated with the given frequency and ids dict

Parameters:
  • frequency (Axes or str) -- The frequency of the row

  • id (str or Tuple[str], optional) -- The ID of the row to

  • **id_kwargs (Dict[str, str]) -- Alternatively to providing id, ID corresponding to the row to return passed as kwargs

Returns:

The selected data row

Return type:

DataRow

Raises:
  • FrameTreeUsageError -- Raised when attempting to use IDs with the frequency associated with the root row

  • FrameTreeNameError -- If there is no row corresponding to the given ids

rows(frequency: str | None = None, ids: Collection[str] | None = None) List[DataRow][source]

Return all the IDs in the dataset for a given frequency

Parameters:
  • frequency (Axes, optional) -- The "frequency" of the rows, e.g. per-session, per-subject, defaults to leaf rows

  • ids (Sequence[str or Tuple[str]]) -- The i

Returns:

The sequence of the data row within the dataset

Return type:

Sequence[DataRow]

class frametree.core.axes.Axes(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Base class for all "data axes" enums. Axes specify the categorical variables along which grids of data points are laid out on.

For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e control-matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D frameset along the group, member, and visit dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.

All dimensions should be included as members of a Axes subclass enum with orthogonal binary vector values, e.g.

member = 0b001 group = 0b010 visit = 0b100

In this space, an imaging session row is uniquely defined by its member, group and visit ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).

In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the frameset. The "row_frequency" of these rows can be specified using the "basis" members (i.e. member, group, visit) in contrast to the session row_frequency, which is the combination of all three

session = 0b111

Additionally, some data is stored in aggregated rows that across a plane of the frameset. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.

subject = 0b011 - uniquely identified subject within in the dataset. groupedvisit = 0b110 - separate group+visit combinations matchedvisit = 0b101 - matched members and time-points aggregated across groups

Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:

dataset = 0b000

is_parent(child, if_match=False)[source]

Checks to see whether the current frequency is a "parent" of the other data frequency, i.e. all the base row_frequency of self appear in the "child".

Parameters:
  • child (Axes) -- The data frequency to check parent/child relationship with

  • if_match (bool) -- Treat matching frequencies as "parents" of each other

Returns:

True if self is parent of child

Return type:

bool

span()[source]

Returns the basis dimensions in the data tree that the given enum-member projects into.

For example in Clinical data trees, the following frequencies can be decomposed into the following basis dims:

dataset -> [] group -> [group] member -> [member] visit -> [visit] subject -> [group, member] groupedvisit -> [visit, group] matchedvisit -> [visit, member] session -> [visit, group, member]

classmethod union(freqs: Sequence[Enum])[source]

Returns the union between data frequency values

class frametree.core.row.DataRow(*, ids: ty.Dict[Axes, str], frameset: FrameSet, frequency: str, tree_path: ty.List[str] = None, uri: ty.Optional[str] = None, metadata: ty.Optional[dict] = None)[source]

A "row" in a dataset "frame" where file-sets and fields can be placed, e.g. a session or subject.

Parameters:
  • ids (Dict[Axes, str]) -- The ids for the frequency of the row and all "parent" frequencies within the tree

  • dataset (FrameSet) -- A reference to the root of the data tree

  • frequency (str) -- The frequency of the row

  • tree_path (list[str], optional) -- the path to the row within the data tree. None if the row doesn't sit within the original tree (e.g. visits within a subject>session hierarchy)

  • uri (str, optional) -- a URI for the row, can be set and used by the data store implementation if appropriate, by default None

__getitem__(column_name: str) DataType[source]

Gets the item for the current row

Parameters:

column_name (str) -- Name of a selected column in the dataset

Returns:

The item matching the provided name specified by the column name

Return type:

DataType