Public API¶
Available Backends¶
- class frametree.common.FileSystem(name: str = 'file_system')[source]¶
A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.
- class frametree.bids.Bids(json_edits: list = NOTHING, name: str = 'bids')[source]¶
Repository for working with data stored on the file-system in BIDS format
- Parameters:
json_edits (list[tuple[str, str]], optional) -- Specifications to edit JSON files as they are written to the store to enable manual modification of fields to correct metadata. List of tuples of the form: FILE_PATH - path expression to select the files, EDIT_STR - jq filter used to modify the JSON document.
- class frametree.xnat.Xnat(server: str, cache_dir: str | Path, name: str | None = None, user: str = None, password: str = None, race_condition_delay: int = 5, verify_ssl: bool = True)[source]¶
Access class for XNAT data repositories
- Parameters:
server (str (URI)) -- URI of XNAT server to connect to
project_id (str) -- The ID of the project in the XNAT repository
cache_dir (str (name_path)) -- Path to local directory to cache remote data in
user (str) -- Username with which to connect to XNAT with
password (str) -- Password to connect to the XNAT repository with
race_condition_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset
- class frametree.xnat.XnatViaCS(name: ty.Optional[str] = None, race_condition_delay: int = 5, verify_ssl: bool = True, row_frequency: Axes = Clinical.session, row_id: str = None, input_mount=PosixPath('/input'), output_mount=PosixPath('/output'), server: str = NOTHING, user: str = NOTHING, password: str = NOTHING, cache_dir=PosixPath('/cache'))[source]¶
Access class for XNAT repositories via the XNAT container service plugin. The container service allows the exposure of the underlying file system where imaging data can be accessed directly (for performance), and outputs
- Parameters:
server (str (URI)) -- URI of XNAT server to connect to
project_id (str) -- The ID of the project in the XNAT repository
cache_dir (str (name_path)) -- Path to local directory to cache remote data in
user (str) -- Username with which to connect to XNAT with
password (str) -- Password to connect to the XNAT repository with
check_md5 (bool) -- Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached
race_cond_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset
Available Axes¶
- class frametree.common.Samples(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
The most basic data space within only one dimension
- class frametree.common.Clinical(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
An enum used to specify the hierarchy of data trees and the "frequencies" of items within dataset typical of medimage research, i.e. subjects split into groups scanned at different visits (in longitudinal studies).
Markers¶
- class frametree.core.salience.ColumnSalience(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
An enum that holds the salience levels options that can be used when specifying data columns. Salience is used to indicate whether it would be best to store the data in the data store or whether it can be just stored in the local file-system and discarded after it has been used. This choice is ultimately specified by the user by defining a salience threshold for a store.
The salience is also used when providing information on what sinks are available to avoid cluttering help menus
- primary = (100, 'Primary input data, typically reconstructed by the instrument that collects them')¶
- raw = (90, "Raw data from the scanner that haven't been reconstructed and are only typically used in advanced analyses")¶
- publication = (80, 'Results that would typically be used as main outputs in publications')¶
- supplementary = (60, 'Derivatives that would typically only be provided in supplementary material')¶
- qa = (40, 'Derivatives that would typically be only kept for quality assurance of analysis workflows')¶
- debug = (20, 'Derivatives that would typically only need to be checked when debugging analysis workflows')¶
- temp = (0, 'Data only temporarily stored to pass between pipelines, e.g. that operate on different row frequencies')¶
- class frametree.core.salience.ParameterSalience(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
An enum that holds the salience levels options that can be used when specifying class parameters. Salience is used to indicate whether the parameter should show up by default when listing the available parameters of an Analysis class in a menu.
- debug = (0, 'typically only needed to be altered for debugging')¶
- recommended = (20, 'recommended to keep defaults')¶
- dependent = (40, 'best value can be dependent on the context of the analysis, but the default should work for most cases')¶
- check = (60, 'default value should be checked for validity for particular use case')¶
- arbitrary = (80, 'a default is provided, but it is not clear which value is best')¶
- required = (100, 'No sensible default value, should be provided')¶
- class frametree.core.salience.CheckSalience(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
An enum that holds the potential values for signifying how critical a check is to run.
- debug = (0, 'typically only used to debug alterations to the pipeline')¶
- potential = (20, 'check can be run but not typically necessary')¶
- prudent = (40, 'it is prudent to run the check the results but you can skip if required')¶
- recommended = (60, 'recommended to run the check as pipeline fails 1~2% of the time')¶
- strongly_recommended = (80, 'strongly recommended to run the check as pipeline fails 5~10% of the time')¶
- required = (100, 'Pipeline will often fail, checking the results is required')¶
- class frametree.core.salience.CheckStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
An enum that holds the potential values that signify how likely a pipeline has " "failed
- failed = (0, 'the pipeline has failed')¶
- probable_fail = (25, 'probable that the pipeline has failed')¶
- unclear = (50, 'cannot ascertain whether the pipeline has failed or not')¶
- probable_pass = (75, 'probable that the pipeline has run successfully')¶
- passed = (100, 'the pipeline has run successfully')¶
- class frametree.core.quality.DataQuality(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
The quality of a data item. Can be manually specified or set by automatic quality control methods
- usable = 100¶
- noisy = 75¶
- questionable = 50¶
- artefactual = 25¶
- unusable = 0¶
Core¶
- class frametree.core.store.Store[source]¶
Abstract base class for all data store adapters. A data store can be an external data management system, e.g. XNAT, OpenNeuro, Datalad or just a defined structure of how to lay out data within a file-system, e.g. BIS.
For a data management system/data structure to be compatible with FrameTree, it must meet a number of criteria. In FrameTree, a store is assumed to
contain multiple projects/datasets addressable by unique IDs.
organise data within each project/dataset in trees
store arbitrary numbers of data "items" (e.g. "file-sets" and fields) within each tree node (including non-leaf nodes) addressable by unique "paths" relative to the node.
allow derivative data to be stored within in separate namespaces for different analyses on the same data
- create_dataset(id: str, leaves: ty.Iterable[ty.Tuple[str, ...]], hierarchy: ty.List[str], axes: type, name: ty.Optional[str] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, **kwargs: ty.Any) FrameSet [source]¶
Creates a new dataset with new rows to store data in
- Parameters:
id (str) -- ID of the dataset
leaves (list[tuple[str, ...]]) -- the list of tuple IDs (at each level of the tree)
name (str, optional) -- name of the dataset, if provided the dataset definition will be saved. To save the dataset with the default name pass an empty string.
hierarchy (list[str], optional) -- hierarchy of the dataset tree
axes (type, optional) -- the axes of the dataset
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See
Store.infer_ids()
for syntax
- Returns:
the newly created dataset
- Return type:
- define_frameset(id: str, axes: ty.Optional[ty.Type[Axes]] = None, hierarchy: ty.Optional[ty.List[ty.Union[str, Axes]]] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, **kwargs: ty.Any) FrameSet [source]¶
Creates a FrameTree dataset definition for an existing data in the data store.
- Parameters:
id (str) -- The ID (or file-system path) of the project (or directory) within the store
axes (Axes) -- The data axes of the frametree
hierarchy (ty.List[str]) -- The hierarchy of the frametree
id_patterns (dict[str, str], optional) -- Patterns used to infer row IDs not explicitly within the hierarchy of the data tree, e.g. groups and visits in an XNAT project with subject>session hierarchy
**kwargs -- Keyword args passed on to the FrameSet init method
- Returns:
the newly defined dataset
- Return type:
- import_dataset(id: str, dataset: FrameSet, column_names: ty.Optional[ty.List[ty.Union[str, ty.Tuple[str, type]]]] = None, hierarchy: ty.Optional[ty.List[str]] = None, id_patterns: ty.Optional[ty.Dict[str, str]] = None, use_original_paths: bool = False, **kwargs: ty.Any) None [source]¶
Import a dataset from another store, transferring metadata and columns defined on the original dataset
- Parameters:
id (str) -- the ID of the dataset within this store
dataset (FrameSet) -- the dataset to import
column_names (list[str or tuple[str, type]], optional) -- list of columns to the to be included in the imported dataset. Items of the list are either a tuple corresponding to the name of a column to import and the datatype to import it as. If the datatype isn't provided and the store has a DEFAULT_DATATYPE attribute it would be used instead otherwise the original datatype will be used, by default all columns are imported
hierarchy (list[str], optional) -- the hierarchy of the imported dataset, by default either the default hierarchy of the target store if applicable or the hierarchy of the original dataset
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See
Store.infer_ids()
for syntaxuse_original_paths (bool, optional) -- use the original paths in the source store instead of renaming the imported entries to match their column names
**kwargs -- keyword arguments passed through to the create_data_tree method
- classmethod load(name: str, config_path: Path | None = None, **kwargs: Any) Store [source]¶
Loads a Store from that has been saved in the configuration file. If no entry is saved under that name, then it searches for Store sub-classes with aliases matching name and checks whether they can be initialised without any parameters.
- Parameters:
name (str) -- Name that the store was saved under
config_path (Path, optional) -- path to the config file, defaults to ~/.frametree/stores.yaml
**kwargs -- keyword args passed to the store, overriding values stored in the entry
- Returns:
The data store retrieved from the stores.yaml file
- Return type:
- Raises:
FrameTreeNameError -- If the name is not found in the saved stores
- class frametree.core.frameset.FrameSet(id, store: Store = NOTHING, axes: Type[Axes] = NOTHING, id_patterns=NOTHING, hierarchy: List[str | Axes] = NOTHING, metadata=NOTHING, include=NOTHING, exclude=NOTHING, name: str = '', columns=NOTHING, pipelines=NOTHING)[source]¶
A representation of a "dataset", the complete collection of data (file-sets and fields) to be used in an analysis.
- Parameters:
id (str) -- The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)
store (Repository) -- The store the dataset is stored into. Can be the local file system by providing a MockRemote repo.
axes (Axes) -- The space of the dataset. See https://frametree.readthedocs.io/en/latest/data_model.html#spaces) for a description
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See
Store.infer_ids()
for syntaxThe categorical variables that are explicitly present in the data tree. For example, if a MockRemote dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be
['subject', 'visit']
Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be
['subject', 'session']
In such cases, if there are multiple visits, the visit ID of the session will need to be extracted using the id_patterns argument.
Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):
['visit', 'member', 'group']
Note that the combination of layers in the hierarchy must span the space defined in the Axes enum, i.e. the "bitwise or" of the layer values of the hierarchy must be 1 across all bits (e.g. 'session': 0b111).
metadata (dict or Metadata) -- Generic metadata associated with the dataset, e.g. authors, funding sources, etc...
include (list[tuple[Axes, str or ty.List[str]]]) -- The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used
exclude (list[tuple[Axes, str or ty.List[str]]]) -- The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used
name (str) -- The name of the dataset as saved in the store under
columns (list[tuple[str, SourceColumn or SinkColumn]) -- The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).
pipelines (dict[str, pydra.Workflow]) -- Pipelines that have been applied to the dataset to generate sink
access_args (ty.Dict[str, Any]) -- Repository specific args used to control the way the dataset is accessed
- __getitem__(name: str) DataColumn [source]¶
Return all data items across the dataset for a given source or sink
- Parameters:
name (str) -- Name of the column to return
- Returns:
the column object
- Return type:
DataColumn
- add_sink(name: str, datatype: type, row_frequency: str | None = None, overwrite: bool = False, **kwargs: Any) SinkColumn [source]¶
Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.
- Parameters:
name (str) -- The name used to reference the dataset "column" for the sink
datatype (type) -- The file-format (for file-sets) or datatype (for fields) that the sink will be stored in within the dataset
path (str, optional) -- Specify a particular for the sink within the dataset, defaults to the column name within the dataset derivatives directory of the store
row_frequency (str, optional) -- The row_frequency of the sink within the dataset, by default the leaf frequency of the data tree
overwrite (bool) -- Whether to overwrite an existing sink
- add_source(name: str, datatype: type, path: str | None = None, row_frequency: str | None = None, overwrite: bool = False, **kwargs: Any) SourceColumn [source]¶
Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.
- Parameters:
name (str) -- The name used to reference the dataset "column" for the source
datatype (type) -- The file-format (for file-sets) or datatype (for fields) that the source will be stored in within the dataset
path (str, default name) -- The location of the source within the dataset
row_frequency (Axes, default self.leaf_freq) -- The row_frequency of the source within the dataset
overwrite (bool) -- Whether to overwrite existing columns
**kwargs (ty.Dict[str, Any]) -- Additional kwargs to pass to SourceColumn.__init__
- apply(name: str, workflow: Workflow, inputs: ty.List[ty.Union['PipelineField', ty.Tuple[str, str, type], ty.Tuple[str, str]]], outputs: ty.List[ty.Union['PipelineField', ty.Tuple[str, str, type], ty.Tuple[str, str]]], row_frequency: ty.Union[Axes, str, None] = None, overwrite: bool = False, converter_args: ty.Optional[ty.Dict[str, ty.Any]] = None) Pipeline [source]¶
Connect a Pydra workflow as a pipeline of the dataset
- Parameters:
name (str) -- name of the pipeline
workflow (pydra.Workflow) -- pydra workflow to connect to the dataset as a pipeline
inputs (list[frametree.core.pipeline.Input or tuple[str, str, type] or tuple[str, str]]) -- List of inputs to the pipeline (see frametree.core.pipeline.Pipeline.PipelineInput)
outputs (list[frametree.core.pipeline.Output or tuple[str, str, type] or tuple[str, str]]) -- List of outputs of the pipeline (see frametree.core.pipeline.Pipeline.PipelineOutput)
row_frequency (str, optional) -- the frequency of the data rows the pipeline will be executed over, i.e. will it be run once per-session, per-subject or per whole dataset, by default the highest row frequency (e.g. per-session for Clinical)
overwrite (bool, optional) -- overwrite connections to previously connected sinks, by default False
converter_args (dict[str, dict]) -- keyword arguments passed on to the converter to control how the conversion is performed.
- Returns:
the pipeline added to the dataset
- Return type:
Pipeline
- Raises:
FrameTreeUsageError -- if overwrite is false and
- derive(*sink_names: str, ids: Iterable[str] | None = None, cache_dir: Path = None, **kwargs: Any) None [source]¶
Generate derivatives from the workflows
- install_license(name: str, source_file: Plain) None [source]¶
Store project-specific license in dataset
- Parameters:
name (str) -- name of the license to install
source_file (PlainText) -- the license file to install
- classmethod load(id: str, store: Store | None = None, name: str | None = '', default_if_missing: bool = False, **kwargs: Any) FrameSet [source]¶
Loads a dataset from an store/ID/name string, as used in the CLI
- Parameters:
id (str) -- either the ID of a dataset if store keyword arg is provided or a "dataset ID string" in the format <store-nickname>//<dataset-id>[@<dataset-name>]
store (Store, optional) -- the store to load the dataset. If not provided the provided ID is interpreted as an ID string
name (str, optional) -- the name of the dataset within the project/directory (e.g. 'test', 'training'). Used to specify a subset of data rows to work with, within a greater project
default_if_missing (bool, optional) -- If True, then a new dataset is created if the dataset is not found in the store
**kwargs -- keyword arguments parsed to the data store load
- Returns:
the loaded dataset
- Return type:
- row(frequency: Axes | str | None = None, id: str | Tuple[str, ...] = NOTHING, **id_kwargs: Any) DataRow [source]¶
Returns the row associated with the given frequency and ids dict
- Parameters:
- Returns:
The selected data row
- Return type:
- Raises:
FrameTreeUsageError -- Raised when attempting to use IDs with the frequency associated with the root row
FrameTreeNameError -- If there is no row corresponding to the given ids
- class frametree.core.axes.Axes(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Base class for all "data axes" enums. Axes specify the categorical variables along which grids of data points are laid out on.
For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e control-matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D frameset along the group, member, and visit dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.
All dimensions should be included as members of a Axes subclass enum with orthogonal binary vector values, e.g.
member = 0b001 group = 0b010 visit = 0b100
In this space, an imaging session row is uniquely defined by its member, group and visit ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).
In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the frameset. The "row_frequency" of these rows can be specified using the "basis" members (i.e. member, group, visit) in contrast to the session row_frequency, which is the combination of all three
session = 0b111
Additionally, some data is stored in aggregated rows that across a plane of the frameset. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.
subject = 0b011 - uniquely identified subject within in the dataset. groupedvisit = 0b110 - separate group+visit combinations matchedvisit = 0b101 - matched members and time-points aggregated across groups
Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:
dataset = 0b000
- is_parent(child, if_match=False)[source]¶
Checks to see whether the current frequency is a "parent" of the other data frequency, i.e. all the base row_frequency of self appear in the "child".
- span()[source]¶
Returns the basis dimensions in the data tree that the given enum-member projects into.
For example in Clinical data trees, the following frequencies can be decomposed into the following basis dims:
dataset -> [] group -> [group] member -> [member] visit -> [visit] subject -> [group, member] groupedvisit -> [visit, group] matchedvisit -> [visit, member] session -> [visit, group, member]
- class frametree.core.row.DataRow(*, ids: ty.Dict[Axes, str], frameset: FrameSet, frequency: str, tree_path: ty.List[str] = None, uri: ty.Optional[str] = None, metadata: ty.Optional[dict] = None)[source]¶
A "row" in a dataset "frame" where file-sets and fields can be placed, e.g. a session or subject.
- Parameters:
ids (Dict[Axes, str]) -- The ids for the frequency of the row and all "parent" frequencies within the tree
dataset (FrameSet) -- A reference to the root of the data tree
frequency (str) -- The frequency of the row
tree_path (list[str], optional) -- the path to the row within the data tree. None if the row doesn't sit within the original tree (e.g. visits within a subject>session hierarchy)
uri (str, optional) -- a URI for the row, can be set and used by the data store implementation if appropriate, by default None