Public API¶

Available Backends¶

class frametree.common.FileSystem(name: str = 'file_system')[source]¶: A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.

class frametree.bids.Bids(json_edits: list = NOTHING, name: str = 'bids')[source]¶

Repository for working with data stored on the file-system in BIDS format

Parameters:: json_edits (list[tuple[str, str]], optional) -- Specifications to edit JSON files as they are written to the store to enable manual modification of fields to correct metadata. List of tuples of the form: FILE_PATH - path expression to select the files, EDIT_STR - jq filter used to modify the JSON document.

class frametree.xnat.Xnat(server: str, cache_dir: str | Path, name: str | None = None, user: str = None, password: str = None, race_condition_delay: int = 5, verify_ssl: bool = True)[source]¶

Access class for XNAT data repositories

Parameters:

server (str (URI)) -- URI of XNAT server to connect to
project_id (str) -- The ID of the project in the XNAT repository
cache_dir (str (name_path)) -- Path to local directory to cache remote data in
user (str) -- Username with which to connect to XNAT with
password (str) -- Password to connect to the XNAT repository with
race_condition_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset

class frametree.xnat.XnatViaCS(name: str | None = None, race_condition_delay: int = 5, verify_ssl: bool = True, row_frequency: Axes = Clinical.session, row_id: str = None, input_mount=PosixPath('/input'), output_mount=PosixPath('/output'), server: str = NOTHING, user: str = NOTHING, password: str = NOTHING, cache_dir=PosixPath('/cache'), internal_upload: bool = False)[source]¶

Access class for XNAT repositories via the XNAT container service plugin. The container service allows the exposure of the underlying file system where imaging data can be accessed directly (for performance), and outputs

Parameters:

server (str) -- URI of XNAT server to connect to
project_id (str) -- The ID of the project in the XNAT repository
cache_dir (Path) -- Path to local directory to cache remote data in
user (str) -- Username with which to connect to XNAT with
password (str) -- Password to connect to the XNAT repository with
check_md5 (bool) -- Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached
race_cond_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset
row_frequency (Axes) -- the frequency of the row the pipeline is executed against
row_id (str) -- the ID of the row
input_mount (Path) -- the file-system path the inputs are mounted at
output_mount (Path) -- the file-system mount the outputs are to be stored in
server -- the URI of the server
user -- the username of the user
password -- the password of the user
cache_dir -- the path to the cache dir to download any files that aren't on the input mount
internal_upload (bool, optional) -- whether to use XNAT CS's built-in output uploader or use the more flexible API

Available Axes¶

class frametree.common.Samples(*values)[source]¶: The most basic data space within only one dimension

class frametree.common.Clinical(*values)[source]¶: An enum used to specify the hierarchy of data trees and the "frequencies" of items within dataset typical of medimage research, i.e. subjects split into groups scanned at different visits (in longitudinal studies).

Markers¶

class frametree.core.salience.ColumnSalience(*values)[source]¶

An enum that holds the salience levels options that can be used when specifying data columns. Salience is used to indicate whether it would be best to store the data in the data store or whether it can be just stored in the local file-system and discarded after it has been used. This choice is ultimately specified by the user by defining a salience threshold for a store.

The salience is also used when providing information on what sinks are available to avoid cluttering help menus

primary = (100, 'Primary input data, typically reconstructed by the instrument that collects them')¶

raw = (90, "Raw data from the scanner that haven't been reconstructed and are only typically used in advanced analyses")¶

publication = (80, 'Results that would typically be used as main outputs in publications')¶

supplementary = (60, 'Derivatives that would typically only be provided in supplementary material')¶

qa = (40, 'Derivatives that would typically be only kept for quality assurance of analysis workflows')¶

debug = (20, 'Derivatives that would typically only need to be checked when debugging analysis workflows')¶

temp = (0, 'Data only temporarily stored to pass between pipelines, e.g. that operate on different row frequencies')¶

classmethod default()[source]¶

class frametree.core.salience.ParameterSalience(*values)[source]¶

An enum that holds the salience levels options that can be used when specifying class parameters. Salience is used to indicate whether the parameter should show up by default when listing the available parameters of an Analysis class in a menu.

debug = (0, 'typically only needed to be altered for debugging')¶

recommended = (20, 'recommended to keep defaults')¶

dependent = (40, 'best value can be dependent on the context of the analysis, but the default should work for most cases')¶

check = (60, 'default value should be checked for validity for particular use case')¶

arbitrary = (80, 'a default is provided, but it is not clear which value is best')¶

required = (100, 'No sensible default value, should be provided')¶

classmethod default()[source]¶

class frametree.core.salience.CheckSalience(*values)[source]¶

An enum that holds the potential values for signifying how critical a check is to run.

debug = (0, 'typically only used to debug alterations to the pipeline')¶

potential = (20, 'check can be run but not typically necessary')¶

prudent = (40, 'it is prudent to run the check the results but you can skip if required')¶

recommended = (60, 'recommended to run the check as pipeline fails 1~2% of the time')¶

strongly_recommended = (80, 'strongly recommended to run the check as pipeline fails 5~10% of the time')¶

required = (100, 'Pipeline will often fail, checking the results is required')¶

classmethod default()[source]¶

class frametree.core.salience.CheckStatus(*values)[source]¶

An enum that holds the potential values that signify how likely a pipeline has " "failed

failed = (0, 'the pipeline has failed')¶

probable_fail = (25, 'probable that the pipeline has failed')¶

unclear = (50, 'cannot ascertain whether the pipeline has failed or not')¶

probable_pass = (75, 'probable that the pipeline has run successfully')¶

passed = (100, 'the pipeline has run successfully')¶

classmethod default()[source]¶

class frametree.core.quality.DataQuality(*values)[source]¶

The quality of a data item. Can be manually specified or set by automatic quality control methods

usable = 100¶

noisy = 75¶

questionable = 50¶

artefactual = 25¶

unusable = 0¶

classmethod default()[source]¶

Core¶

class frametree.core.store.Store[source]¶

Abstract base class for all data store adapters. A data store can be an external data management system, e.g. XNAT, OpenNeuro, Datalad or just a defined structure of how to lay out data within a file-system, e.g. BIS.

For a data management system/data structure to be compatible with FrameTree, it must meet a number of criteria. In FrameTree, a store is assumed to

contain multiple projects/datasets addressable by unique IDs.

organise data within each project/dataset in trees

store arbitrary numbers of data "items" (e.g. "file-sets" and fields) within each tree node (including non-leaf nodes) addressable by unique "paths" relative to the node.

allow derivative data to be stored within in separate namespaces for different analyses on the same data

create_dataset(id: str, leaves: ty.Iterable[ty.Tuple[str, ...]], hierarchy: ty.List[str], axes: type, name: ty.Optional[str] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, **kwargs: ty.Any) → FrameSet[source]¶

Creates a new dataset with new rows to store data in

Parameters:

id (str) -- ID of the dataset
leaves (list[tuple[str, ...]]) -- the list of tuple IDs (at each level of the tree)
name (str, optional) -- name of the dataset, if provided the dataset definition will be saved. To save the dataset with the default name pass an empty string.
hierarchy (list[str], optional) -- hierarchy of the dataset tree
axes (type, optional) -- the axes of the dataset
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax

Returns:

the newly created dataset

Return type:

FrameSet

define_frameset(id: str, axes: ty.Optional[ty.Type[Axes]] = None, hierarchy: ty.Optional[ty.List[ty.Union[str, Axes]]] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, **kwargs: ty.Any) → FrameSet[source]¶

Creates a FrameTree dataset definition for an existing data in the data store.

Parameters:

id (str) -- The ID (or file-system path) of the project (or directory) within the store
axes (Axes) -- The data axes of the frametree
hierarchy (ty.List[str]) -- The hierarchy of the frametree
id_patterns (dict[str, str], optional) -- Patterns used to infer row IDs not explicitly within the hierarchy of the data tree, e.g. groups and visits in an XNAT project with subject>session hierarchy
**kwargs -- Keyword args passed on to the FrameSet init method

Returns:

the newly defined dataset

Return type:

FrameSet

import_dataset(id: str, dataset: FrameSet, column_names: ty.Optional[ty.List[ty.Union[str, ty.Tuple[str, type]]]] = None, hierarchy: ty.Optional[ty.List[str]] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, use_original_paths: bool = False, **kwargs: ty.Any) → None[source]¶

Import a dataset from another store, transferring metadata and columns defined on the original dataset

Parameters:

id (str) -- the ID of the dataset within this store
dataset (FrameSet) -- the dataset to import
column_names (list[str or tuple[str, type]], optional) -- list of columns to the to be included in the imported dataset. Items of the list are either a tuple corresponding to the name of a column to import and the datatype to import it as. If the datatype isn't provided and the store has a DEFAULT_DATATYPE attribute it would be used instead otherwise the original datatype will be used, by default all columns are imported
hierarchy (list[str], optional) -- the hierarchy of the imported dataset, by default either the default hierarchy of the target store if applicable or the hierarchy of the original dataset
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax
use_original_paths (bool, optional) -- use the original paths in the source store instead of renaming the imported entries to match their column names
**kwargs -- keyword arguments passed through to the create_data_tree method

classmethod load(name: str, config_path: Path | None = None, **kwargs: Any) → Store[source]¶

Loads a Store from that has been saved in the configuration file. If no entry is saved under that name, then it searches for Store sub-classes with aliases matching name and checks whether they can be initialised without any parameters.

Parameters:

name (str) -- Name that the store was saved under
config_path (Path, optional) -- path to the config file, defaults to ~/.frametree/stores.yaml
**kwargs -- keyword args passed to the store, overriding values stored in the entry

Returns:

The data store retrieved from the stores.yaml file

Return type:

Store

Raises:

FrameTreeNameError -- If the name is not found in the saved stores

classmethod remove(name: str, config_path: Path | None = None) → None[source]¶

Removes the entry saved under 'name' in the config file

Parameters:: name -- Name of the configuration to remove

save(name: str | None = None, config_path: Path | None = None) → None[source]¶

Saves the configuration of a Store in 'stores.yaml'

Parameters:

name -- The name under which to save the data store
config_path (Path, optional) -- the path to save the config file to, defaults to ~/.frametree/stores.yaml

class frametree.core.frameset.FrameSet(id, store: Store = NOTHING, axes: Type[Axes] = NOTHING, id_patterns=NOTHING, hierarchy: List[str | Axes] = NOTHING, metadata=NOTHING, include=NOTHING, exclude=NOTHING, name: str = '', columns=NOTHING, pipelines=NOTHING)[source]¶

A representation of a "dataset", the complete collection of data (file-sets and fields) to be used in an analysis.

Parameters:

id (str) -- The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)
store (Repository) -- The store the dataset is stored into. Can be the local file system by providing a MockRemote repo.
axes (Axes) -- The space of the dataset. See https://frametree.readthedocs.io/en/latest/data_model.html#spaces) for a description
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax
hierarchy (list[str]) --
The categorical variables that are explicitly present in the data tree. For example, if a MockRemote dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be

['subject', 'visit']

Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be

['subject', 'session']

In such cases, if there are multiple visits, the visit ID of the session will need to be extracted using the id_patterns argument.

Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):

['visit', 'member', 'group']

Note that the combination of layers in the hierarchy must span the space defined in the Axes enum, i.e. the "bitwise or" of the layer values of the hierarchy must be 1 across all bits (e.g. 'session': 0b111).
metadata (dict or Metadata) -- Generic metadata associated with the dataset, e.g. authors, funding sources, etc...
include (list[tuple[Axes, str or ty.List[str]]]) -- The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used
exclude (list[tuple[Axes, str or ty.List[str]]]) -- The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used
name (str) -- The name of the dataset as saved in the store under
columns (list[tuple[str, SourceColumn or SinkColumn]) -- The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).
pipelines (dict[str, pydra.Workflow]) -- Pipelines that have been applied to the dataset to generate sink
access_args (ty.Dict[str, Any]) -- Repository specific args used to control the way the dataset is accessed

__getitem__(name: str) → DataColumn[source]¶

Return all data items across the dataset for a given source or sink

Parameters:: name (str) -- Name of the column to return
Returns:: the column object
Return type:: DataColumn

add_sink(name: str, datatype: type, row_frequency: str | None = None, overwrite: bool = False, **kwargs: Any) → SinkColumn[source]¶

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters:

name (str) -- The name used to reference the dataset "column" for the sink
datatype (type) -- The file-format (for file-sets) or datatype (for fields) that the sink will be stored in within the dataset
path (str, optional) -- Specify a particular for the sink within the dataset, defaults to the column name within the dataset derivatives directory of the store
row_frequency (str, optional) -- The row_frequency of the sink within the dataset, by default the leaf frequency of the data tree
overwrite (bool) -- Whether to overwrite an existing sink

add_source(name: str, datatype: type, path: str | None = None, row_frequency: str | None = None, overwrite: bool = False, **kwargs: Any) → SourceColumn[source]¶

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters:

name (str) -- The name used to reference the dataset "column" for the source
datatype (type) -- The file-format (for file-sets) or datatype (for fields) that the source will be stored in within the dataset
path (str, default name) -- The location of the source within the dataset
row_frequency (Axes, default self.leaf_freq) -- The row_frequency of the source within the dataset
overwrite (bool) -- Whether to overwrite existing columns
**kwargs (ty.Dict[str, Any]) -- Additional kwargs to pass to SourceColumn.__init__

apply(name: str, task: workflow.Task, inputs: ty.List[ty.Union['PipelineField', ty.Tuple[str, str, type], ty.Tuple[str, str]]], outputs: ty.List[ty.Union['PipelineField', ty.Tuple[str, str, type], ty.Tuple[str, str]]], row_frequency: ty.Union[Axes, str, None] = None, overwrite: bool = False, converter_args: ty.Optional[ty.Dict[str, ty.Any]] = None) → Pipeline[source]¶

Connect a Pydra workflow as a pipeline of the dataset

Parameters:

name (str) -- name of the pipeline
workflow (pydra.Workflow) -- pydra workflow to connect to the dataset as a pipeline
inputs (list[frametree.core.pipeline.Input or tuple[str, str, type] or tuple[str, str]]) -- List of inputs to the pipeline (see frametree.core.pipeline.Pipeline.PipelineInput)
outputs (list[frametree.core.pipeline.Output or tuple[str, str, type] or tuple[str, str]]) -- List of outputs of the pipeline (see frametree.core.pipeline.Pipeline.PipelineOutput)
row_frequency (str, optional) -- the frequency of the data rows the pipeline will be executed over, i.e. will it be run once per-session, per-subject or per whole dataset, by default the highest row frequency (e.g. per-session for Clinical)
overwrite (bool, optional) -- overwrite connections to previously connected sinks, by default False
converter_args (dict[str, dict]) -- keyword arguments passed on to the converter to control how the conversion is performed.

Returns:

the pipeline added to the dataset

Return type:

Pipeline

Raises:

FrameTreeUsageError -- if overwrite is false and

derive(*sink_names: str, ids: Iterable[str] | None = None, cache_dir: Path = None, **kwargs: Any) → None[source]¶

Generate derivatives from the workflows

Parameters:

*sink_names (Iterable[str]) -- Names of the columns corresponding to the items to derive
ids (Iterable[str]) -- The IDs of the data rows in each column to derive
cache_dir

Returns:

The derived columns

Return type:

Sequence[List[DataType]]

install_license(name: str, source_file: Plain) → None[source]¶

Store project-specific license in dataset

Parameters:

name (str) -- name of the license to install
source_file (PlainText) -- the license file to install

classmethod load(id: str, store: Store | None = None, name: str | None = '', default_if_missing: bool = False, **kwargs: Any) → FrameSet[source]¶

Loads a dataset from an store/ID/name string, as used in the CLI

Parameters:

id (str) -- either the ID of a dataset if store keyword arg is provided or a "dataset ID string" in the format <store-nickname>//<dataset-id>[@<dataset-name>]
store (Store, optional) -- the store to load the dataset. If not provided the provided ID is interpreted as an ID string
name (str, optional) -- the name of the dataset within the project/directory (e.g. 'test', 'training'). Used to specify a subset of data rows to work with, within a greater project
default_if_missing (bool, optional) -- If True, then a new dataset is created if the dataset is not found in the store
**kwargs -- keyword arguments parsed to the data store load

Returns:

the loaded dataset

Return type:

FrameSet

row(frequency: Axes | str | None = None, id: str | Tuple[str, ...] = NOTHING, **id_kwargs: Any) → DataRow[source]¶

Returns the row associated with the given frequency and ids dict

Parameters:

frequency (Axes or str) -- The frequency of the row
id (str or Tuple[str], optional) -- The ID of the row to
**id_kwargs (Dict[str, str]) -- Alternatively to providing id, ID corresponding to the row to return passed as kwargs

Returns:

The selected data row

Return type:

DataRow

Raises:

FrameTreeUsageError -- Raised when attempting to use IDs with the frequency associated with the root row
FrameTreeNameError -- If there is no row corresponding to the given ids

rows(frequency: str | None = None, ids: Collection[str] | None = None) → List[DataRow][source]¶

Return all the IDs in the dataset for a given frequency

Parameters:

frequency (Axes, optional) -- The "frequency" of the rows, e.g. per-session, per-subject, defaults to leaf rows
ids (Sequence[str or Tuple[str]]) -- The i

Returns:

The sequence of the data row within the dataset

Return type:

Sequence[DataRow]

save(name: str | None = None) → None[source]¶

Save the frameset to the store

Parameters:: name (str, optional) -- The name of the dataset to save in the store. If not provided the name of the dataset is used.

class frametree.core.axes.Axes(new_class_name, /, names, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Base class for all "data axes" enums. Axes specify the categorical variables along which grids of data points are laid out on.

For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e control-matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D frameset along the group, member, and visit dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.

All dimensions should be included as members of a Axes subclass enum with orthogonal binary vector values, e.g.

member = 0b001 group = 0b010 visit = 0b100

In this space, an imaging session row is uniquely defined by its member, group and visit ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).

In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the frameset. The "row_frequency" of these rows can be specified using the "basis" members (i.e. member, group, visit) in contrast to the session row_frequency, which is the combination of all three

session = 0b111

Additionally, some data is stored in aggregated rows that across a plane of the frameset. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.

subject = 0b011 - uniquely identified subject within in the dataset. groupedvisit = 0b110 - separate group+visit combinations matchedvisit = 0b101 - matched members and time-points aggregated across groups

Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:

dataset = 0b000

is_parent(child, if_match=False)[source]¶

Checks to see whether the current frequency is a "parent" of the other data frequency, i.e. all the base row_frequency of self appear in the "child".

Parameters:

child (Axes) -- The data frequency to check parent/child relationship with
if_match (bool) -- Treat matching frequencies as "parents" of each other

Returns:

True if self is parent of child

Return type:

bool

span()[source]¶

Returns the basis dimensions in the data tree that the given enum-member projects into.

For example in Clinical data trees, the following frequencies can be decomposed into the following basis dims:

dataset -> [] group -> [group] member -> [member] visit -> [visit] subject -> [group, member] groupedvisit -> [visit, group] matchedvisit -> [visit, member] session -> [visit, group, member]

classmethod union(freqs: Sequence[Enum])[source]¶: Returns the union between data frequency values

class frametree.core.row.DataRow(*, ids: ty.Dict[Axes, str], frameset: FrameSet, frequency: str, tree_path: ty.List[str] = None, uri: ty.Optional[str] = None, metadata: ty.Optional[dict] = None)[source]¶

A "row" in a dataset "frame" where file-sets and fields can be placed, e.g. a session or subject.

Parameters:

ids (Dict[Axes, str]) -- The ids for the frequency of the row and all "parent" frequencies within the tree
dataset (FrameSet) -- A reference to the root of the data tree
frequency (str) -- The frequency of the row
tree_path (list[str], optional) -- the path to the row within the data tree. None if the row doesn't sit within the original tree (e.g. visits within a subject>session hierarchy)
uri (str, optional) -- a URI for the row, can be set and used by the data store implementation if appropriate, by default None

__getitem__(column_name: str) → DataType[source]¶

Gets the item for the current row

Parameters:: column_name (str) -- Name of a selected column in the dataset
Returns:: The item matching the provided name specified by the column name
Return type:: DataType