Developer guide

Contributions to the project and extensions are more than welcome in various forms. Please see the contribution guide for details. If you contribute code, documentation or bug reports to the repository please add your name and affiliation to the Zenodo file

Dev install

To install a development version of frametree, clone the GitHub repository https://github.com/ArcanaFramework/frametree and install an editable package with pip with the dev install option

$ pip3 install -e /path/to/local/frametree/repo[dev]

Extensions

The core FrameTree code base is implemented in the frametree.core module. Extensions which implement data store connectors and analyses are installed in separate packages (e.g. frametree-xnat, frametree-bids). Use the extension template on GitHub as a starting point. Note that all Store and Axes subclasses should be imported into the extension package root (e.g. frametree.xnat.__init__.py) so they can be found by CLI commands. Additional CLI commands specific to a particular backend should be implemented as click commands under the frametree.core.cli.ext group and also imported into the subpackage root.

Alternative Backends

Alternative storage systems can be implemented by writing a new subclass of Store. If you would like help writing a new storage backend please create an issue for it in the GitHub Issue Tracker.

In addition to the base Store class, which lays out the interface to be implemented by all backend implementations, two partial implementations, LocalStore and RemoteStore are provided as starting points for alternative backend implementations. These partial implementations have slightly more specific abstract methods to implement and handle some of the common functionality of local and remote stores.

Local stores

The LocalStore partial implementation is for data stores that are mappings from specific data structures stored in directory trees on the local file-system (even if they are mounted from network drives), such as the basic FileSystem or the Bids stores. Implementations for the following abstract methods are required to create a local store.

class frametree.core.store.LocalStore(name: str)[source]

A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.

Parameters:

base_dir (str) -- Path to the base directory of the "store", i.e. datasets are arranged by name as sub-directories of the base dir.

abstract create_data_tree(id: str, leaves: List[Tuple[str, ...]], hierarchy: List[str], axes: Type[Axes], **kwargs: Any) None

Creates a new empty dataset within in the store. Used in test routines and importing/exporting datasets between stores

Parameters:
  • id (str) -- ID for the newly created dataset

  • leaves (list[tuple[str, ...]]) -- list of IDs for each leaf node to be added to the dataset. The IDs for each leaf should be a tuple with an ID for each level in the tree's hierarchy, e.g. for a hierarchy of [subject, visit] -> [("SUBJ01", "TIMEPOINT01"), ("SUBJ01", "TIMEPOINT02"), ....]

  • hierarchy (ty.List[str]) -- the hierarchy of the dataset to be created

  • axes (type(Axes)) -- the data axes of the dataset

  • id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax

  • **kwargs -- implementing methods should take wildcard kwargs to allow compatibility with future arguments that might be added

abstract field_uri(path: str, datatype: type, row: DataRow) str[source]

Returns the "uri" (e.g. file-system path relative to root dir) of a field entry at the given path relative to the given row

Parameters:
  • path (str) -- path to the entry relative to the row

  • datatype (type) -- the datatype of the entry

  • row (DataRow) -- the row of the entry

Returns:

uri -- the "uri" to the field entry relative to the data store

Return type:

str

abstract fileset_uri(path: str, datatype: type, row: DataRow) str[source]

Returns the "uri" (e.g. file-system path relative to root dir) of a file-set entry at the given path relative to the given row

Parameters:
  • path (str) -- path to the entry relative to the row

  • datatype (type) -- the datatype of the entry

  • row (DataRow) -- the row of the entry

Returns:

uri -- the "uri" to the file-set entry relative to the data store

Return type:

str

abstract get_field(entry: DataEntry, datatype: type) Field[Any, Any][source]

Retrieves a field from a data entry

Parameters:
  • entry (DataEntry) -- the entry to retrieve the file-set from

  • datatype (type) -- the type of the field from

Returns:

the retrieved field

Return type:

Field

abstract get_field_provenance(entry: DataEntry) Dict[str, Any] | None[source]

Retrieves provenance associated with a field data entry

Parameters:

entry (DataEntry) -- the entry of the field to retrieve the provenance for

Returns:

the retrieved provenance

Return type:

ty.Dict[str, ty.Any] or None

abstract get_fileset(entry: DataEntry, datatype: type) FileSet[source]

Retrieves a file-set from a data entry

Parameters:
  • entry (DataEntry) -- the entry to retrieve the file-set from

  • datatype (type) -- the type of the file-set

Returns:

the retrieved file-set

Return type:

FileSet

abstract get_fileset_provenance(entry: DataEntry) Dict[str, Any] | None[source]

Retrieves provenance associated with a file-set data entry

Parameters:

entry (DataEntry) -- the entry of the file-set to retrieve the provenance for

Returns:

the retrieved provenance

Return type:

ty.Dict[str, ty.Any] or None

abstract populate_row(row: DataRow) None

Populate a row with all data entries found in the corresponding node in the data store (e.g. files within a directory, scans within an XNAT session) using the DataRow.add_entry method. Within a node/row there are assumed to be two types of entries, "primary" entries (e.g. acquired scans) common to all analyses performed on the dataset and "derivative" entries corresponding to intermediate outputs of previously performed analyses. These types should be stored in separate namespaces so there is no chance of a derivative overriding a primary data item.

The name of the dataset/analysis a derivative was generated by is appended to to a base path, delimited by "@", e.g. "brain_mask@my_analysis". The dataset name is left blank by default, in which case "@" is just appended to the derivative path, i.e. "brain_mask@".

Parameters:

row (DataRow) -- The row to populate with entries

abstract populate_tree(tree: DataTree) None

Populates the nodes of the data tree with those found in the dataset using the DataTree.add_leaf method for every "leaf" node of the dataset tree.

The order that the tree leaves are added is important and should be consistent between reads, because it is used to give default values to the ID's of data space axes not explicitly in the hierarchy of the tree.

Parameters:

tree (DataTree) -- The tree to populate with nodes

abstract put_field(field: Field[Any, Any], entry: DataEntry) None[source]

Stores a field into a data entry

Parameters:
  • field (Field) -- the field to store

  • entry (DataEntry) -- the entry to store the field in

abstract put_field_provenance(provenance: Dict[str, Any], entry: DataEntry) None[source]

Puts provenance associated with a field data entry into the store

Parameters:
  • provenance (dict[str, ty.Any]) -- the provenance to store

  • entry (DataEntry) -- the entry to associate the proveance with

abstract put_fileset(fileset: FileSet, entry: DataEntry) FileSet[source]

Stores a file-set into a data entry

Parameters:
  • fileset (FileSet) -- the file-set to store

  • entry (DataEntry) -- the entry to store the file-set in

Returns:

the file-set within the store

Return type:

FileSet

abstract put_fileset_provenance(provenance: Dict[str, Any], entry: DataEntry) None[source]

Puts provenance associated with a file-set data entry into the store

Parameters:
  • provenance (dict[str, ty.Any]) -- the provenance to store

  • entry (DataEntry) -- the entry to associate the proveance with

Remote stores

The RemoteStore partial implementation is for managed informatics platforms such as XNAT and Flywheel. It has a slightly different set of abstract methods that need to be implemented, such as connect and disconnect, which handle the login/out methods.

class frametree.core.store.RemoteStore(server: str, cache_dir: str | Path, name: str | None = None, user: str = None, password: str = None, race_condition_delay: int = 5)[source]

Access class for XNAT data repositories

Parameters:
  • server (str (URI)) -- URI of XNAT server to connect to

  • cache_dir (Path) -- Path to local directory to cache remote data in

  • name (str, optional) -- the name of the store as it is saved in the store config file, by default None

  • user (str, optional) -- Username with which to connect to XNAT with, by default None

  • password (str, optional) -- Password to connect to the XNAT repository with, by default None

  • race_condition_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset

abstract calculate_checksums(fileset: FileSet) Dict[str, str][source]

Calculates the checksum digests associated with the files in the file-set. These checksums should match the cryptography method used by the remote store (e.g. MD5, SHA256)

Parameters:

uri (str) -- uri of the data item to download the checksums for

Returns:

checksums -- the checksums calculated from the local file-set

Return type:

dict[str, str]

abstract connect() Any

If a connection session is required to the store manage it here

Returns:

session -- a session object that will be stored in the connection manager and accessible at Store.connection

Return type:

Any

abstract create_data_tree(id: str, leaves: List[Tuple[str, ...]], hierarchy: List[str], axes: Type[Axes], **kwargs: Any) None

Creates a new empty dataset within in the store. Used in test routines and importing/exporting datasets between stores

Parameters:
  • id (str) -- ID for the newly created dataset

  • leaves (list[tuple[str, ...]]) -- list of IDs for each leaf node to be added to the dataset. The IDs for each leaf should be a tuple with an ID for each level in the tree's hierarchy, e.g. for a hierarchy of [subject, visit] -> [("SUBJ01", "TIMEPOINT01"), ("SUBJ01", "TIMEPOINT02"), ....]

  • hierarchy (ty.List[str]) -- the hierarchy of the dataset to be created

  • axes (type(Axes)) -- the data axes of the dataset

  • id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See Store.infer_ids() for syntax

  • **kwargs -- implementing methods should take wildcard kwargs to allow compatibility with future arguments that might be added

abstract create_field_entry(path: str, datatype: type, row: DataRow) DataEntry[source]

Creates a new resource entry to store a field

Parameters:
  • path (str) -- the path to the entry relative to the row

  • datatype (type) -- the datatype of the entry

  • row (DataRow) -- the row of the data entry

Returns:

entry -- the created entry for the field

Return type:

DataEntry

abstract create_fileset_entry(path: str, datatype: type, row: DataRow) DataEntry[source]

Creates a new resource entry to store a fileset

Parameters:
  • path (str) -- the path to the entry relative to the row

  • datatype (type) -- the datatype of the entry

  • row (DataRow) -- the row of the data entry

Returns:

entry -- the created entry for the file-set

Return type:

DataEntry

abstract disconnect(session: Any) None

If a connection session is required to the store manage it here

Parameters:

session (Any) -- the session object returned by connect to be closed gracefully

abstract download_files(entry: DataEntry, download_dir: Path) Path[source]

Download files associated with the given entry in the data store, using download_dir as temporary storage location (will be monitored by downloads in sibling processes to detect if download activity has stalled), return the path to a directory containing only the downloaded files

Parameters:
  • entry (DataEntry) -- entry in the data store to download the files/directories from

  • download_dir (Path) -- temporary storage location for the downloaded files and/or compressed archives. Monitored by sibling processes to detect if download activity has stalled.

Returns:

output_dir -- a directory containing the downloaded files/directories and nothing else

Return type:

Path

abstract download_value(entry: DataEntry) float | int | str | List[float] | List[int] | List[str][source]

Extract and return the value of the field from the store

Parameters:

entry (DataEntry) -- The data entry to retrieve the value from

Returns:

value -- The value of the Field

Return type:

float or int or str or ty.List[float] or ty.List[int] or ty.List[str]

abstract get_checksums(uri: str) Dict[str, str][source]

Downloads the checksum digests associated with the files in the file-set. These are saved with the downloaded files in the cache and used to check if the files have been updated on the server

Parameters:

uri (str) -- uri of the data item to download the checksums for

abstract get_provenance(entry: DataEntry) ty.Dict[str, ty.Any]

Stores provenance information for a given data item in the store

Parameters:

entry (DataEntry) -- The item to store the provenance data for

Returns:

provenance -- The provenance data stored in the repository for the data item. None if no provenance data has been stored

Return type:

ty.Dict[str, Any] or None

abstract get_provenance(entry: DataEntry) ty.Dict[str, ty.Any]

Stores provenance information for a given data item in the store

Parameters:

entry (DataEntry) -- The item to store the provenance data for

Returns:

provenance -- The provenance data stored in the repository for the data item. None if no provenance data has been stored

Return type:

ty.Dict[str, Any] or None

abstract load_frameset_definition(dataset_id: str, name: str) Dict[str, Any]

Load definition of a dataset saved within the store

Parameters:
  • dataset_id (str) -- The ID (e.g. file-system path, XNAT project ID) of the project

  • name (str) -- Name for the dataset definition to distinguish it from other definitions for the same directory/project

Returns:

definition -- A dct FrameSet object that was saved in the data store

Return type:

ty.Dict[str, Any]

abstract populate_row(row: DataRow) None

Populate a row with all data entries found in the corresponding node in the data store (e.g. files within a directory, scans within an XNAT session) using the DataRow.add_entry method. Within a node/row there are assumed to be two types of entries, "primary" entries (e.g. acquired scans) common to all analyses performed on the dataset and "derivative" entries corresponding to intermediate outputs of previously performed analyses. These types should be stored in separate namespaces so there is no chance of a derivative overriding a primary data item.

The name of the dataset/analysis a derivative was generated by is appended to to a base path, delimited by "@", e.g. "brain_mask@my_analysis". The dataset name is left blank by default, in which case "@" is just appended to the derivative path, i.e. "brain_mask@".

Parameters:

row (DataRow) -- The row to populate with entries

abstract populate_tree(tree: DataTree) None

Populates the nodes of the data tree with those found in the dataset using the DataTree.add_leaf method for every "leaf" node of the dataset tree.

The order that the tree leaves are added is important and should be consistent between reads, because it is used to give default values to the ID's of data space axes not explicitly in the hierarchy of the tree.

Parameters:

tree (DataTree) -- The tree to populate with nodes

put_checksums(uri: str, fileset: FileSet) Dict[str, str][source]

Uploads the checksum digests associated with the files in the file-set to the repository. Can be left as NotImplementedError if the repository calculates its own checksums on upload internally.

Parameters:
  • uri (str) -- uri of the data item to upload the checksums of

  • fileset (FileSet) -- the fileset to calculate and upload the checksums for

Returns:

checksums -- the calculated checksums

Return type:

dict[str, str]

abstract put_provenance(provenance: ty.Dict[str, ty.Any], entry: DataEntry) None

Stores provenance information for a given data item in the store

Parameters:
  • entry (DataEntry) -- The item to store the provenance data for

  • provenance (ty.Dict[str, Any]) -- The provenance data to store

abstract put_provenance(provenance: ty.Dict[str, ty.Any], entry: DataEntry) None

Stores provenance information for a given data item in the store

Parameters:
  • entry (DataEntry) -- The item to store the provenance data for

  • provenance (ty.Dict[str, Any]) -- The provenance data to store

abstract save_frameset_definition(dataset_id: str, definition: Dict[str, Any], name: str) None

Save definition of dataset within the store

Parameters:
  • dataset_id (str) -- The ID/path of the dataset within the store

  • definition (ty.Dict[str, Any]) -- A dictionary containing the dct FrameSet to be saved. The dictionary is in a format ready to be dumped to file as JSON or YAML.

  • name (str) -- Name for the dataset definition to distinguish it from other definitions for the same directory/project

abstract upload_files(input_dir: Path, entry: DataEntry)[source]

Upload all files contained within input_dir to the specified entry in the data store

Parameters:
  • input_dir (Path) -- directory containing the files/directories to be uploaded

  • entry (DataEntry) -- the entry in the data store to upload the files to

abstract upload_value(value: float | int | str | List[float] | List[int] | List[str], entry: DataEntry)[source]

Store the value for a field in the XNAT repository

Parameters:
  • value (float or int or str or ty.List[float] or ty.List[int] or ty.List[str]) -- the value to store in the entry

  • entry (DataEntry) -- the entry to store the value in

New axes

FrameTree was initially developed for medical-imaging analysis. Therefore, if you planning to use it for alternative domains you may need to add support for domain-specific file formats and "data axes". File formats are specified using the FileFormats package. Please refer to its documentation on how to add new file formats.

New data axes (see Axes) are defined by extending the Axes abstract base class. Axes subclasses are be enums with binary string values of consistent length (i.e. all of length 2 or all of length 3, etc...). The length of the binary string defines the number of axes. The enum must contain members for each permutation of the bit string (e.g. for 2 dimensions, there must be members corresponding to the values 0b00, 0b01, 0b10, 0b11).

For example, in imaging studies scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subject ID) and time-points for longitudinal studies. In this case, we can visualise the imaging sessions arranged in a 3-D grid along the group, member, and visit bases. Note that datasets that only contain one group or time-point can still be represented by these axes, and just be singleton along the corresponding axis.

All bases should be included as members of a Axes subclass enum with orthogonal binary vector values, e.g.:

member = 0b001
group = 0b010
visit = 0b100

The axis that is most often non-singleton should be given the smallest bit as this will be assumed to be the default when there is only one layer in the data tree, e.g. imaging datasets will not always have different groups or time-points but will always have different members (which are equivalent to subjects when there is only one group).

The "leaf rows" of a data tree, imaging sessions in this example, will be the bitwise-and of the dimension vectors, i.e. an imaging session is uniquely defined by its member, group and visit ID.:

session = 0b111

In addition to the data items stored in leaf rows, some data, particularly derivatives, may be stored in the dataset along a particular dimension, at a lower "row_frequency" than 'per session'. For example, brain templates are sometimes calculated 'per group'. Additionally, data can also be stored in aggregated rows that across a plane of the frameset. These frequencies should also be added to the enum, i.e. all permutations of the base dimensions must be included and given intuitive names if possible:

subject = 0b011 - uniquely identified subject within in the dataset.
groupedvisit = 0b110 - separate group + visit combinations
matchedvisit = 0b101 - matched members and visits aggregated across groups

Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:

constant = 0b000

For example, if you wanted to analyse daily recordings from various weather stations you could define a 2-dimensional "Weather" data space with axes for the date and weather station of the recordings, with the following code

from frametree.core.axes import Axes

class Weather(Axes):

    # Define the axes of the dataspace
    visit = 0b01
    station = 0b10

    # Name the leaf and root frequencies of the data space
    recording = 0b11
    constant = 0b00

Note

All permutations of N-D binary strings need to be named within the enum.