Developer guide¶
Contributions to the project and extensions are more than welcome in various forms. Please see the contribution guide for details. If you contribute code, documentation or bug reports to the repository please add your name and affiliation to the Zenodo file
Dev install¶
To install a development version of frametree, clone the GitHub repository https://github.com/ArcanaFramework/frametree and install an editable package with pip with the dev install option
$ pip3 install -e /path/to/local/frametree/repo[dev]
Extensions¶
The core FrameTree code base is implemented in the frametree.core
module. Extensions
which implement data store connectors and analyses are installed in separate packages
(e.g. frametree-xnat
, frametree-bids
). Use the extension template
on GitHub as a starting point. Note that all Store
and Axes
subclasses should be
imported into the extension package root (e.g. frametree.xnat.__init__.py
) so they can
be found by CLI commands. Additional CLI commands specific to a particular backend should
be implemented as click
commands under the frametree.core.cli.ext
group and also
imported into the subpackage root.
Alternative Backends¶
Alternative storage systems can be implemented by writing a new subclass of
Store
. If you would like help writing a new storage backend please
create an issue for it in the GitHub Issue Tracker.
In addition to the base Store
class, which lays out the interface to be
implemented by all backend implementations, two partial implementations, LocalStore
and RemoteStore
are provided as starting points for alternative backend implementations.
These partial implementations have slightly more specific abstract methods to implement
and handle some of the common functionality of local and remote stores.
Local stores¶
The LocalStore
partial implementation is for data stores that are mappings from
specific data structures stored in directory trees on the local file-system (even if
they are mounted from network drives), such as the basic FileSystem
or the
Bids
stores. Implementations for the following abstract methods are required to
create a local store.
- class frametree.core.store.LocalStore(name: str)[source]
A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.
- Parameters:
base_dir (str) -- Path to the base directory of the "store", i.e. datasets are arranged by name as sub-directories of the base dir.
- abstract create_data_tree(id: str, leaves: List[Tuple[str, ...]], hierarchy: List[str], axes: Type[Axes], **kwargs: Any) None
Creates a new empty dataset within in the store. Used in test routines and importing/exporting datasets between stores
- Parameters:
id (str) -- ID for the newly created dataset
leaves (list[tuple[str, ...]]) -- list of IDs for each leaf node to be added to the dataset. The IDs for each leaf should be a tuple with an ID for each level in the tree's hierarchy, e.g. for a hierarchy of [subject, visit] -> [("SUBJ01", "TIMEPOINT01"), ("SUBJ01", "TIMEPOINT02"), ....]
hierarchy (ty.List[str]) -- the hierarchy of the dataset to be created
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See
Store.infer_ids()
for syntax**kwargs -- implementing methods should take wildcard
kwargs
to allow compatibility with future arguments that might be added
- abstract field_uri(path: str, datatype: type, row: DataRow) str [source]
Returns the "uri" (e.g. file-system path relative to root dir) of a field entry at the given path relative to the given row
- abstract fileset_uri(path: str, datatype: type, row: DataRow) str [source]
Returns the "uri" (e.g. file-system path relative to root dir) of a file-set entry at the given path relative to the given row
- abstract get_field(entry: DataEntry, datatype: type) Field[Any, Any] [source]
Retrieves a field from a data entry
- Parameters:
entry (DataEntry) -- the entry to retrieve the file-set from
datatype (type) -- the type of the field from
- Returns:
the retrieved field
- Return type:
Field
- abstract get_field_provenance(entry: DataEntry) Dict[str, Any] | None [source]
Retrieves provenance associated with a field data entry
- Parameters:
entry (DataEntry) -- the entry of the field to retrieve the provenance for
- Returns:
the retrieved provenance
- Return type:
ty.Dict[str, ty.Any] or None
- abstract get_fileset(entry: DataEntry, datatype: type) FileSet [source]
Retrieves a file-set from a data entry
- Parameters:
entry (DataEntry) -- the entry to retrieve the file-set from
datatype (type) -- the type of the file-set
- Returns:
the retrieved file-set
- Return type:
FileSet
- abstract get_fileset_provenance(entry: DataEntry) Dict[str, Any] | None [source]
Retrieves provenance associated with a file-set data entry
- Parameters:
entry (DataEntry) -- the entry of the file-set to retrieve the provenance for
- Returns:
the retrieved provenance
- Return type:
ty.Dict[str, ty.Any] or None
- abstract populate_row(row: DataRow) None
Populate a row with all data entries found in the corresponding node in the data store (e.g. files within a directory, scans within an XNAT session) using the
DataRow.add_entry
method. Within a node/row there are assumed to be two types of entries, "primary" entries (e.g. acquired scans) common to all analyses performed on the dataset and "derivative" entries corresponding to intermediate outputs of previously performed analyses. These types should be stored in separate namespaces so there is no chance of a derivative overriding a primary data item.The name of the dataset/analysis a derivative was generated by is appended to to a base path, delimited by "@", e.g. "brain_mask@my_analysis". The dataset name is left blank by default, in which case "@" is just appended to the derivative path, i.e. "brain_mask@".
- Parameters:
row (DataRow) -- The row to populate with entries
- abstract populate_tree(tree: DataTree) None
Populates the nodes of the data tree with those found in the dataset using the
DataTree.add_leaf
method for every "leaf" node of the dataset tree.The order that the tree leaves are added is important and should be consistent between reads, because it is used to give default values to the ID's of data space axes not explicitly in the hierarchy of the tree.
- Parameters:
tree (DataTree) -- The tree to populate with nodes
- abstract put_field(field: Field[Any, Any], entry: DataEntry) None [source]
Stores a field into a data entry
- Parameters:
field (Field) -- the field to store
entry (DataEntry) -- the entry to store the field in
- abstract put_field_provenance(provenance: Dict[str, Any], entry: DataEntry) None [source]
Puts provenance associated with a field data entry into the store
- abstract put_fileset(fileset: FileSet, entry: DataEntry) FileSet [source]
Stores a file-set into a data entry
- Parameters:
fileset (FileSet) -- the file-set to store
entry (DataEntry) -- the entry to store the file-set in
- Returns:
the file-set within the store
- Return type:
FileSet
Remote stores¶
The RemoteStore
partial implementation is for managed informatics platforms such
as XNAT and Flywheel. It has a slightly different set of abstract methods that need to
be implemented, such as connect and disconnect, which handle the login/out methods.
- class frametree.core.store.RemoteStore(server: str, cache_dir: str | Path, name: str | None = None, user: str = None, password: str = None, race_condition_delay: int = 5)[source]
Access class for XNAT data repositories
- Parameters:
server (str (URI)) -- URI of XNAT server to connect to
cache_dir (Path) -- Path to local directory to cache remote data in
name (str, optional) -- the name of the store as it is saved in the store config file, by default None
user (str, optional) -- Username with which to connect to XNAT with, by default None
password (str, optional) -- Password to connect to the XNAT repository with, by default None
race_condition_delay (int) -- The amount of time to wait before checking that the required fileset has been downloaded to cache by another process has completed if they are attempting to download the same fileset
- abstract calculate_checksums(fileset: FileSet) Dict[str, str] [source]
Calculates the checksum digests associated with the files in the file-set. These checksums should match the cryptography method used by the remote store (e.g. MD5, SHA256)
- abstract connect() Any
If a connection session is required to the store manage it here
- Returns:
session -- a session object that will be stored in the connection manager and accessible at Store.connection
- Return type:
Any
- abstract create_data_tree(id: str, leaves: List[Tuple[str, ...]], hierarchy: List[str], axes: Type[Axes], **kwargs: Any) None
Creates a new empty dataset within in the store. Used in test routines and importing/exporting datasets between stores
- Parameters:
id (str) -- ID for the newly created dataset
leaves (list[tuple[str, ...]]) -- list of IDs for each leaf node to be added to the dataset. The IDs for each leaf should be a tuple with an ID for each level in the tree's hierarchy, e.g. for a hierarchy of [subject, visit] -> [("SUBJ01", "TIMEPOINT01"), ("SUBJ01", "TIMEPOINT02"), ....]
hierarchy (ty.List[str]) -- the hierarchy of the dataset to be created
id_patterns (dict[str, str]) -- Patterns for inferring IDs of rows not explicitly present in the hierarchy of the data tree. See
Store.infer_ids()
for syntax**kwargs -- implementing methods should take wildcard
kwargs
to allow compatibility with future arguments that might be added
- abstract create_field_entry(path: str, datatype: type, row: DataRow) DataEntry [source]
Creates a new resource entry to store a field
- abstract create_fileset_entry(path: str, datatype: type, row: DataRow) DataEntry [source]
Creates a new resource entry to store a fileset
- abstract disconnect(session: Any) None
If a connection session is required to the store manage it here
- Parameters:
session (Any) -- the session object returned by connect to be closed gracefully
- abstract download_files(entry: DataEntry, download_dir: Path) Path [source]
Download files associated with the given entry in the data store, using download_dir as temporary storage location (will be monitored by downloads in sibling processes to detect if download activity has stalled), return the path to a directory containing only the downloaded files
- Parameters:
entry (DataEntry) -- entry in the data store to download the files/directories from
download_dir (Path) -- temporary storage location for the downloaded files and/or compressed archives. Monitored by sibling processes to detect if download activity has stalled.
- Returns:
output_dir -- a directory containing the downloaded files/directories and nothing else
- Return type:
Path
- abstract download_value(entry: DataEntry) float | int | str | List[float] | List[int] | List[str] [source]
Extract and return the value of the field from the store
- abstract get_checksums(uri: str) Dict[str, str] [source]
Downloads the checksum digests associated with the files in the file-set. These are saved with the downloaded files in the cache and used to check if the files have been updated on the server
- Parameters:
uri (str) -- uri of the data item to download the checksums for
- abstract get_provenance(entry: DataEntry) ty.Dict[str, ty.Any]
Stores provenance information for a given data item in the store
- Parameters:
entry (DataEntry) -- The item to store the provenance data for
- Returns:
provenance -- The provenance data stored in the repository for the data item. None if no provenance data has been stored
- Return type:
ty.Dict[str, Any] or None
- abstract get_provenance(entry: DataEntry) ty.Dict[str, ty.Any]
Stores provenance information for a given data item in the store
- Parameters:
entry (DataEntry) -- The item to store the provenance data for
- Returns:
provenance -- The provenance data stored in the repository for the data item. None if no provenance data has been stored
- Return type:
ty.Dict[str, Any] or None
- abstract load_frameset_definition(dataset_id: str, name: str) Dict[str, Any]
Load definition of a dataset saved within the store
- Parameters:
- Returns:
definition -- A dct FrameSet object that was saved in the data store
- Return type:
ty.Dict[str, Any]
- abstract populate_row(row: DataRow) None
Populate a row with all data entries found in the corresponding node in the data store (e.g. files within a directory, scans within an XNAT session) using the
DataRow.add_entry
method. Within a node/row there are assumed to be two types of entries, "primary" entries (e.g. acquired scans) common to all analyses performed on the dataset and "derivative" entries corresponding to intermediate outputs of previously performed analyses. These types should be stored in separate namespaces so there is no chance of a derivative overriding a primary data item.The name of the dataset/analysis a derivative was generated by is appended to to a base path, delimited by "@", e.g. "brain_mask@my_analysis". The dataset name is left blank by default, in which case "@" is just appended to the derivative path, i.e. "brain_mask@".
- Parameters:
row (DataRow) -- The row to populate with entries
- abstract populate_tree(tree: DataTree) None
Populates the nodes of the data tree with those found in the dataset using the
DataTree.add_leaf
method for every "leaf" node of the dataset tree.The order that the tree leaves are added is important and should be consistent between reads, because it is used to give default values to the ID's of data space axes not explicitly in the hierarchy of the tree.
- Parameters:
tree (DataTree) -- The tree to populate with nodes
- put_checksums(uri: str, fileset: FileSet) Dict[str, str] [source]
Uploads the checksum digests associated with the files in the file-set to the repository. Can be left as NotImplementedError if the repository calculates its own checksums on upload internally.
- abstract put_provenance(provenance: ty.Dict[str, ty.Any], entry: DataEntry) None
Stores provenance information for a given data item in the store
- Parameters:
entry (DataEntry) -- The item to store the provenance data for
provenance (ty.Dict[str, Any]) -- The provenance data to store
- abstract put_provenance(provenance: ty.Dict[str, ty.Any], entry: DataEntry) None
Stores provenance information for a given data item in the store
- Parameters:
entry (DataEntry) -- The item to store the provenance data for
provenance (ty.Dict[str, Any]) -- The provenance data to store
- abstract save_frameset_definition(dataset_id: str, definition: Dict[str, Any], name: str) None
Save definition of dataset within the store
- Parameters:
dataset_id (str) -- The ID/path of the dataset within the store
definition (ty.Dict[str, Any]) -- A dictionary containing the dct FrameSet to be saved. The dictionary is in a format ready to be dumped to file as JSON or YAML.
name (str) -- Name for the dataset definition to distinguish it from other definitions for the same directory/project
- abstract upload_files(input_dir: Path, entry: DataEntry)[source]
Upload all files contained within input_dir to the specified entry in the data store
- Parameters:
input_dir (Path) -- directory containing the files/directories to be uploaded
entry (DataEntry) -- the entry in the data store to upload the files to
New axes¶
FrameTree was initially developed for medical-imaging analysis. Therefore, if you planning to use it for alternative domains you may need to add support for domain-specific file formats and "data axes". File formats are specified using the FileFormats package. Please refer to its documentation on how to add new file formats.
New data axes (see Axes) are defined by extending the
Axes
abstract base class. Axes
subclasses are be
enums with binary string
values of consistent length (i.e. all of length 2 or all of length 3, etc...).
The length of the binary string defines the number of axes. The enum must contain
members for each permutation of the bit string (e.g. for 2 dimensions, there
must be members corresponding to the values 0b00, 0b01, 0b10, 0b11).
For example, in imaging studies scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subject ID) and time-points for longitudinal studies. In this case, we can visualise the imaging sessions arranged in a 3-D grid along the group, member, and visit bases. Note that datasets that only contain one group or time-point can still be represented by these axes, and just be singleton along the corresponding axis.
All bases should be included as members of a Axes subclass enum with orthogonal binary vector values, e.g.:
member = 0b001
group = 0b010
visit = 0b100
The axis that is most often non-singleton should be given the smallest bit as this will be assumed to be the default when there is only one layer in the data tree, e.g. imaging datasets will not always have different groups or time-points but will always have different members (which are equivalent to subjects when there is only one group).
The "leaf rows" of a data tree, imaging sessions in this example, will be the bitwise-and of the dimension vectors, i.e. an imaging session is uniquely defined by its member, group and visit ID.:
session = 0b111
In addition to the data items stored in leaf rows, some data, particularly derivatives, may be stored in the dataset along a particular dimension, at a lower "row_frequency" than 'per session'. For example, brain templates are sometimes calculated 'per group'. Additionally, data can also be stored in aggregated rows that across a plane of the frameset. These frequencies should also be added to the enum, i.e. all permutations of the base dimensions must be included and given intuitive names if possible:
subject = 0b011 - uniquely identified subject within in the dataset.
groupedvisit = 0b110 - separate group + visit combinations
matchedvisit = 0b101 - matched members and visits aggregated across groups
Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:
constant = 0b000
For example, if you wanted to analyse daily recordings from various weather stations you could define a 2-dimensional "Weather" data space with axes for the date and weather station of the recordings, with the following code
from frametree.core.axes import Axes
class Weather(Axes):
# Define the axes of the dataspace
visit = 0b01
station = 0b10
# Name the leaf and root frequencies of the data space
recording = 0b11
constant = 0b00
Note
All permutations of N-D binary strings need to be named within the enum.