Frame sets¶

"Frame-sets" are defined on datasets consisting of a collection of experiments classified by several categorical variables. For example, medical imaging sessions in a clinical trial could be categorised by study group and subject and longitudinal timepoint, or weather readings in a meterological analysis by date-time and weather station. In frame-sets, such experiments (i.e. imaging sessions or weather readings) are mapped onto the rows of virtual frames for each combination of categorical variables. The different measurements in each experiment (e.g. T1-weighted MRI, fMRI, atomospheric pressure, humidity), metadata (e.g. subject year of birth, weather station altitude) or derived metrics (e.g. average grey matter thickness, dew point) are considered to form the the columns of the frames in the set.

In the case of clinical imaging research studies/trials, imaging sessions are classified by the subject who was scanned and, if applicable, the longitudinal timepoint. The subjects themselves are often classified by different study groups (e.g. test or control group). Therefore, we factor imaging session classifications into

group - study group (e.g. 'test' or 'control')
member - ID relative to group
- can be arbitrary or used to signify control-matched pairs
- e.g. the '03' in 'TEST03' & 'CONT03' pair of control-matched subject IDs
visit - longintudinal timepoint

Alternatively, for a meterological analysis, data could be categorised by

datetime - the time of the reading
location - the location of the weather station

These "axes", and combinations thereof, form the "row frequencies" of the frames in the frameset. There is considered to be one frame in the set per row frequency. Different types of acquisitions or metrics form the columns of the data frames. For example, the "per-session" frequency in a clinical imaging dataset is a combination of the three bases, group, member and visit, and the corresponding frame has columns corresponding to the different imaging acquisitions, e.g. 'T1-weighted MRI' or 'functional MRI'. Whereas, social economic status and genetic data are constant per subject so would correspond to columns in the "per subject" (group + membership ID) data frame.

Defining framesets¶

Frameset definitions are stored within YAML files inside the dataset to be analysed. This allows analyses to be performed iteratively across different computing resources. In the FrameTree CLI, Framesets are referenced by addresses of the form

<store-name>//<dataset-id>[@<frameset-name>]

where <store-name> is the nickname of the store as saved by 'frametree store add' (see Stores), and <dataset-id> is an identifier that specifies the dataset within the data store, e.g.

the file-system path to the data directory for file-system (and BIDS) stores
the project ID for XNAT stores

For example, a project called "MYXNATPROJECT" stored the xnat-central store can be defined with

$ # Create a reference to the Central XNAT instance and save it in the user home dir
$ frametree store add \
  xnat-central \
  --server https://central.xnat.org \
  --user $XNAT_USER \
  --password $XNAT_PASS

$ # Create the frameset definition and save it into the 'MYXNATPROJECT' XNAT project
$ frametree define xnat-central//MYXNATPROJECT

Alternatively via the Python API:

import os
from frametree.xnat import Xnat

# Create a store entry
xnat_ct = Xnat(
    server="https://central.xnat.org",
    user=os.environ["XNAT_USER"],
    password=os.environ["XNAT_PASS"]
)

# Save the xnat_central entry to your user profile
xnat_ct.save("xnat-central")

# Create the frameset definition
frameset = xnat_ct.define_frameset(id='MYXNATPROJECT')

# Save the frameset definition in the XNAT project
frameset.save()

<frameset-name> is an optional component (empty string by default), which allows multiple frame sets to be defined on the same dataset. This allows different exclusion criteria and pipeline parameterisations to be used for different analyses on the same dataset (see Subsets and Pipelines).

Axes¶

The virtual mapping from data trees to frames can be visualised by mapping the acquired data points onto multi-dimensional grid, where the categorical variables used to distinguish the data points form the axes of the space. In this grid, the rows of the eventual data frames correspond either to points, lines or plains, etc... depending on their row frequency.

Note

The frameset of a particular dataset can have a single point along any given dimension (e.g. one study group or visit) and still exist in the data space. Therefore, when creating data spaces it is better to be inclusive of potential categories to make them more general. In these cases row frequencies are equivalent, e.g. member === subject if there is only one study group.

This visualisation can be useful because in addition to data frames corresponding to row frequencies that explicitly appear in the hierarchy of the data tree, derived metrics can exist along any orientation of the grid.

These axes are defined in Frametree by Axes enums. For clinical research/trials the medimage.Clinical axes is defined as such

Bases

group - study group, e.g. test or control
member - matched subject groups (e.g. aged matched test/control pair)
visit - visit number (e.g. longitudinal timepoint)

Combinations

session (member + group + visit) - imaging session
subject (member + group) - subject
groupedvisit (group + visit) - metadata/metrics for each study group at each visit
matchedvisit (member + visit) - metadata/metrics for each matched subject group at each visit
constant () - metadata/metrics that are constant across the analysis

See the Developer guide for help on designing custom Axes for different domains/analyses.

Branch hierarchy¶

When defining a frameset on a data tree, the "hierarchy" in which the categorical variables appear in the branches of the tree (e.g. groups > subjects > sessions) needs to be specified. Consider the following example dataset consisting of imaging sessions is sorted by subjects, then longintudinal visits

my-dataset
├── subject1
│   ├── visit1
│   │   ├── t1w_mprage
│   │   ├── t2w_space
│   │   └── bold_rest
│   └── visit2
│       ├── t1w_mprage
│       ├── t2w_space
│       └── bold_rest
├── subject2
│   ├── visit1
│   │   ├── t1w_mprage
│   │   ├── t2w_space
│   │   └── bold_rest
│   └── visit2
│       ├── t1w_mprage
│       ├── t2w_space
│       └── bold_rest
└── subject3
    ├── visit1
    │   ├── t1w_mprage
    │   ├── t2w_space
    │   └── bold_rest
    └── visit2
        ├── t1w_mprage
        ├── t2w_space
        └── bold_rest

The leaves of the tree contain data from specific "imaging session" data points, as designated by the combination of one of the three subject IDs and one of the two visit IDs. Data items at the session level of the hierarchy will be mapped onto a data frame, where each session data point correspondds to a row and the columns correspond to different acquisition methods or derived metrics (e.g. T1-weighted MRI scan, subject's YOB, presence of genetic marker, atomospheric pressure, rainfall, annual rainfall, altitude, etc...).

While the majority of data items are stored in the leaves of the tree, data can exist for any branch. For example, an analysis may use genomics data, which will be constant for each subject, and therefore sits at the subject level of the tree sit in special SUBJECT branches

my-dataset
├── subject1
│   ├── SUBJECT
│   │   └── geneomics.dat
│   ├── visit1
│   │   ├── t1w_mprage
│   │   ├── t2w_space
│   │   └── bold_rest
│   └── visit2
│       ├── t1w_mprage
│       ├── t2w_space
│       └── bold_rest
├── subject2
│   ├── SUBJECT
│   │   └── geneomics.dat
│   ├── visit1
│   │   ├── t1w_mprage
│   │   ├── t2w_space
│   │   └── bold_rest
│   └── visit2
│       ├── t1w_mprage
│       ├── t2w_space
│       └── bold_rest
└── subject3
    ├── SUBJECT
    │   └── geneomics.dat
    ├── visit1
    │   ├── t1w_mprage
    │   ├── t2w_space
    │   └── bold_rest
    └── visit2
        ├── t1w_mprage
        ├── t2w_space
        └── bold_rest

In this case, the genomics data is in the "per-subject" data frame, in which each row corresponds to a subject instead of a session.

Depending on the hierarchy of the data tree, data belonging to the base frequencies may or may not have a corresponding branch to be stored in. In these cases, new branches are created off the root of the tree to hold the derivatives. For example, average trial performance data, calculated at each visit and the age difference between matched-control pairs, would need to be stored in new sub-branches for visits and members, respectively.

my-dataset
├── VISIT
│   ├── visit1
│   │   └── avg_trial_performance
│   └── visit2
│       └── avg_trial_performance
├── MEMBER
│   ├── member1
│   │   └── age_diff
│   └── member2
│       └── age_diff
├── group1
│   ├── member1
│   │   ├── visit1
│   │   │   ├── t1w_mprage
│   │   │   ├── t2w_space
│   │   │   └── bold_rest
│   │   └── visit2
│   │       ├── t1w_mprage
│   │       ├── t2w_space
│   │       └── bold_rest
│   └── member2
│       ├── visit1
│       │   ├── t1w_mprage
│       │   ├── t2w_space
│       │   └── bold_rest
│       └── visit2
│           ├── t1w_mprage
│           ├── t2w_space
│           └── bold_rest
└── group2
    |── member1
    │   ├── visit1
    │   │   ├── t1w_mprage
    │   │   ├── t2w_space
    │   │   └── bold_rest
    │   └── visit2
    │       ├── t1w_mprage
    │       ├── t2w_space
    │       └── bold_rest
    └── member2
        ├── visit1
        │   ├── t1w_mprage
        │   ├── t2w_space
        │   └── bold_rest
        └── visit2
            ├── t1w_mprage
            ├── t2w_space
            └── bold_rest

If they are not present in the data tree, alternative row frequencies are stored in new branches under the dataset root, in the same manner as data space axes

my-dataset
├── BATCH
│   ├── group1_visit1
│   │   └── avg_connectivity
│   ├── group1_visit2
│   │   └── avg_connectivity
│   ├── group2_visit1
│   │   └── avg_connectivity
│   └── group2_visit2
│       └── avg_connectivity
├── MATCHEDPOINT
│   ├── member1_visit1
│   │   └── comparative_trial_performance
│   ├── member1_visit2
│   │   └── comparative_trial_performance
│   ├── member2_visit1
│   │   └── comparative_trial_performance
│   └── member2_visit2
│       └── comparative_trial_performance
├── group1
│   ├── member1
│   │   ├── visit1
│   │   │   ├── t1w_mprage
│   │   │   ├── t2w_space
│   │   │   └── bold_rest
│   │   └── visit2
│   │       ├── t1w_mprage
│   │       ├── t2w_space
│   │       └── bold_rest
│   └── member2
│       ├── visit1
│       │   ├── t1w_mprage
│       │   ├── t2w_space
│       │   └── bold_rest
│       └── visit2
│           ├── t1w_mprage
│           ├── t2w_space
│           └── bold_rest
└── group2
    |── member1
    │   ├── visit1
    │   │   ├── t1w_mprage
    │   │   ├── t2w_space
    │   │   └── bold_rest
    │   └── visit2
    │       ├── t1w_mprage
    │       ├── t2w_space
    │       └── bold_rest
    └── member2
        ├── visit1
        │   ├── t1w_mprage
        │   ├── t2w_space
        │   └── bold_rest
        └── visit2
            ├── t1w_mprage
            ├── t2w_space
            └── bold_rest

For stores that support datasets with arbitrary tree structures (i.e. FileSystem), the "axes" (Axes) and the hierarchy of layers in the data tree needs to be provided when defining the frameset.

$ frametree define '/data/imaging/my-project' group session --axes common/clinical

Alternatively via the Python API:

from frametree.common import Clinical, FileSystem

fs_frameset = FileSystem().define_frameset(
    id='/data/imaging/my-project',
    # Define the hierarchy of the dataset in which imaging session
    # sub-directories are separated into directories via their study group
    # (i.e. test & control)
    axes=Clinical,
    hierarchy=['group', 'session'])

For datasets where the fundamental hierarchy of the storage system is fixed (e.g. XNAT) you don't need to provide the axes or hierarchy. However, you may need to specify how to infer the values of an axis by decomposing the label of a branch a given a naming convention, e.g. "CONTROL01" -> group="CONTROL" and member="01". This inference is specified via a regular expression (Python syntax) passed to the id-inference argument of the frameset definition. For example, given an XNAT project with the following structure and a naming convention where the subject ID is composed of the group and member ID, <GROUPID><MEMBERID>, and the session ID is composed of the subject ID and visit, <SUBJECTID>_MR<VISITID>

MY_XNAT_PROJECT
├── TEST01
│   └── TEST01_MR01
│       ├── t1w_mprage
│       └── t2w_space
├── TEST02
│   └── TEST02_MR01
│       ├── t1w_mprage
│       └── t2w_space
├── CONT01
│   └── CONT01_MR01
│       ├── t1w_mprage
│       └── t2w_space
└── CONT02
    └── CONT02_MR01
        ├── t1w_mprage
        └── t2w_space

IDs for group, member and visit can be inferred from the subject and session IDs, by providing the frequency of the ID to decompose and a regular-expression (in Python syntax) to decompose it with. The regular expression should contain named groups that correspond to row frequencies of the IDs to be inferred, e.g.

$ frametree define 'xnat-central//MYXNATPROJECT' \
  --id-inference group 'subject:([A-Z]+)_\d+' \
  --id-inference member 'subject:[A-Z]+_(\d+)' \
  --id-inference visit 'subject:[A-Z0-9]+_MR(\d+)'

Subsets¶

By default all data points within the dataset are included in the frameset. However, often there are data points that need to be removed from a given analysis due to missing or corrupted data. Such sections need to be removed in a way that the data points still lie on a rectangular grid within the data axes (see Axes) so derivatives computed over a given axis or axes are drawn from comparable number of data points.

The --exclude option is used to specify the data points to exclude from a dataset.

$ frametree define '/data/imaging/my-project@manually_qcd' \
  subject session \
  --axes common/clinical \
  --exclude member 03,11,27

The include argument is the inverse of exclude and can be more convenient when you only want to select a small sample or split the dataset into sections. include can be used in conjunction with exclude but not for the same frequencies.

$ frametree define '/data/imaging/my-project@manually_qcd' \
  subject session \
  --axes common/clinical \
  --exclude member 03,11,27 \
  --include visit 1,2

You can also pass a range of IDs, <start>:<finish> like you would in Python slicing. This can be used to partition a dataset into separate framesets for machine learning training and testing, e.g. to partition a dataset with 100 members/subject into subjects 1-80 for training and subjects 80-100 for testing you would use

$ # Partition the dataset into training and test framesets
$ frametree define '/data/imaging/my-project@training' \
  group subject \
  --axes common/clinical \
  --include member 1:81
$ frametree define '/data/imaging/my-project@test' \
  group subject \
  --axes common/clinical \
  --include member 81:101

Alternatively, via Python API:

from frametree.xnat import Xnat

# Load existing store spec
xnat_store = Xnat.load('xnat-central')

# Partition dataset into training and test
training = xnat_store.define_frameset(
    id='MYXNATPROJECT', include={'member': range(1, 81)}
)
test = xnat_store.define_frameset(
    id='MYXNATPROJECT', include={'member': range(81, 101)}
)

# Save to the dataset
training.save("training")
test.save("test")