File handling

In addition to validation of format types, FileFormats is able to detect related files, hash them and move them around the file-system as single object.

Paths

Once a file object is initiated you are able to access the properties of the format class, which for single file formats is typically just the file-system path, fspath.

>>> from fileformats.image import Jpeg
>>> jpeg_file = Jpeg("/path/to/image.jpg")
>>> jpeg_file.fspath
"/path/to/image.jpg"
>>> str(jpeg_file)  # returns the path file
"/path/to/image.jpg"

However, file-formats that consist of multiple files (common in scientific data) will typically define separate required properties for each file. For example, the Analyze neuroimaging format, which stores the image in a file with the extension ".img" and metadata in a separate header file with the extension ".hdr".

>>> from fileformats.medimage import Analyze
>>> analyze_file = Analyze(["/path/to/neuroimage.hdr", "/path/to/neuroimage.img"])
>>> analyze_file.fspath
"/path/to/neuroimage.img"
>>> analyze_file.header
"/path/to/neuroimage.hdr"
>>> str(analyze_file)  # returns the path to the primary file
"/path/to/neuroimage.img"

To access all file-system paths in a format object you can access the fspaths attribute from the base class of all file formats fileformats.core.base.FileSet

>>> analyze_file.fspaths
{"/path/to/neuroimage.hdr", "/path/to/neuroimage.img"}

In the case of file formats with "adjacent" files that share the same file-name stem, i.e. same file path and name minus the file extension (such as Analyze), you only need to provide one the primary path and the header will be automatically detected and added to the file-set

>>> from fileformats.medimage import Analyze
>>> analyze_file = repr(Analyze("/path/to/neuroimage.img"))
Analyze("/path/to/neuroimage.hdr", "/path/to/neuroimage.img")

This is very useful when reading the output path of a workflow where only primary path is returned and associated files also need to be saved to an output directory.

FileSet formats from a format class to a workflow/task input, the transformation of the format object to a path-like string is handled implicitly through the implementation of the __str__ and __fspath__ magic methods. This means that format objects can be used in place of the path objects themselves, e.g.

import subprocess
from fileformats.text import Plain
text_file = Plain("/path/to/text-file.txt")

with open(text_file) as f:
    contents = f.read()

subprocess.run(f"cp {text_file} /path/to/destination", shell=True)

Noting that it is only the "primary" path as returned by the fspath property that is rendered.

Copy

To copy all files/directories in a format you can use the FileSet.copy() method

>>> repr(analyze_file.copy(dest_dir="/path/to/destination"))
Analyze("/path/to/destination/mprage.hdr", "/path/to/destination/mprage.img")

By default, the source filenames will be used in the destination directory. To specify a new file stem, pass the new_stem argument

>>> repr(analyze_file.copy(dest_dir="/path/to/destination", new_stem="t1w"))
Analyze("/path/to/destination/t1w.hdr", "/path/to/destination/t1w.img")

For formats that define a file extension, this will be used to determine which part of the filename is considered stem, and which is extension. This is useful when dealing with double-barrel extensions such as ".nii.gz"

>>> from fileformats.medimage import NiftiGz
>>> niftigz = NiftiGzX(["/path/to/image.nii.gz"])
>>> repr(niftigzx.copy(dest_dir="/path/to/destination", new_stem="t1w"))
NiftiGz("/path/to/destination/t1w.nii.gz")

However, if you are working with generic base classes such as FileSet, FsObject and File, what is extension and what is stem is not defined and needs to be specified by a FileSet.ExtensionDecomposition enum passed to the extension_decomposition argument

>>> from fileformats.generic import File
>>> a_file = File(["/path/to/image.nii.gz"])
>>> repr(a_file.copy(
...     dest_dir="/path/to/destination",
...     new_stem="t1w",
...     extension_decomposition=FileSet.ExtensionDecomposition.single)
... )
File("/path/to/destination/t1w.gz")
>>> repr(a_file.copy(
...     dest_dir="/path/to/destination",
...     new_stem="t1w",
...     extension_decomposition="multiple")
... )
File("/path/to/destination/t1w.nii.gz")
>>> repr(a_file.copy(
...     dest_dir="/path/to/destination",
...     new_stem="t1w",
...     extension_decomposition=FileSet.ExtensionDecomposition.none)
... )
File("/path/to/destination/t1w")

Warning

If extension_decomposition == "multiple" and there are '.' in the filename they will be treated as if they are part of the filename even if they aren't intended to be.

Additional files within a fileset that aren't required for the format can be trimmed using the trim argument

>>> niftigz = NiftiGz(["/path/to/t1w.nii.gz", "/path/to/t1w.json"])
>>> repr(niftigz)
NiftiGz("/path/to/t1w.nii.gz", "/path/to/t1w.json")
>>> trimmed_niftigz = niftigz.copy("/new/destination", trim=True)
>>> repr(trimmed_niftigz)
NiftiGz("/new/destination/t1w.nii.gz")

The other (self-explanatory) arguments that can be provided to copy are make_dirs and overwrite.

Copy-mode

The copy method also supports creating links (both soft and hard) instead of copying the file by passing a value from the FileSet.CopyMode enum to the mode argument.

>>> from fileformats.core import FileSet
>>> new_analyze = analyze_file.copy(
...    dest_dir="/path/to/destination", mode=FileSet.CopyMode.hardlink
... )
>>> new_analyze.fspaths
{"/path/to/destination/t1w.hdr", "/path/to/destination/t1w.img"}

For some applications you might prefer to create a link instead of creating a duplicate of the original files, but depending on the mounts/drives that the source files and destination directories sit on this might not be possible due to limitations of the file-system, or the source and destination locations being different physical drives (and therefore can't hardlink). To handle these cases the mode flag can be set to a combination of link and copy modes,

new_analyze = analyze_file.copy(dest_dir="/path/to/destination", mode="link_or_copy")

in which case the copy method will attempt to create a symlink, then if that fails, a hardlink, and failing that fallback to a copy. The supported modes can also be specified manually by passing a FileSet.CopyMode flag to the supported_modes argument, which will be used to mask the requested mode. Note that automatically detected unsupported modes will be masked out of the supported_modes before it is applied.

new_analyze = analyze_file.copy(
    dest_dir="/path/to/destination",
    mode=user_requested,
    supported_modes=FileSet.CopyMode.hardlink_or_copy
)

Copy-collation

There is not requirement that file formats consisting of multiple files (e.g. with a separate header) are "adjacent" to each other, i.e. in the same directory with the same file-stem

>>> from fileformats.medimage import NiftiX
>>> niftix = NiftiX(["/a/path/to/a/t1w.nii", "/an/unrelated/path/t1-weighted.json"])

However, some commands expect side-car and header files to be "adjacent" to the primary file, i.e. in the same directory as the primary with the same file stem. To support this use case, the FileSet.copy() can be passed a collation argument, which takes a FileSet.Collation enum value.

>>> new_niftix = niftix.copy(
...    dest_dir="/path/to/destination", collation=FileSet.Collation.adjacent
... )
>>> new_niftix.fspaths
{"/path/to/destination/t1w.nii", "/path/to/destination/t1w.json"}

To control what the files are collated as, the new_stem argument can be passed to the copy() method.

>>> new_niftix = niftix.copy(
...    dest_dir="/path/to/destination", new_stem="t1-weighted"
... )
>>> new_niftix.fspaths
{"/path/to/destination/t1-weighted.nii", "/path/to/destination/t1-weighted.json"}
>>> new_analyze = analyze_file.copy(dest_dir="/path/to/destination")
>>> new_analyze.fspaths
{"/path/to/destination/t1w.hdr", "/path/to/destination/t1w.img"}

If the files just need to be in the same directory, but not necessarily adjacent, the collation argument can be set to FileSet.Collation.siblings

>>> new_niftix = niftix.copy(dest_dir="/path/to/destination", collation="siblings")
>>> new_niftix.fspaths
{"/path/to/destination/t1w.nii", "/path/to/destination/t1-weighted.json"}

The collation setting will also be used to decide whether files need to be copied or linked to a new location. For example, if the files are already adjacent, then they can be simply left where they are by setting the mode to FileSet.CopyMode.any flag, which encompasses the FileSet.CopyMode.leave mode.

>>> new_niftix = niftix.copy(
...    dest_dir="/path/to/destination",
...    collation=FileSet.Collation.adjacent,
...    mode=FileSet.CopyMode.any
... )

The behaviour of this call is a little complex and will be determined by the file paths in the niftix FileSet and the location of the source and destination directories. For example, if the file paths are already adjacent in the source directory they will be left where they are. However, if the files are not adjacent, they will be symlinked to the destination directory, unless the mount/drive that directory is on doesn't support symlinks, in which case they will be hardlinked, unless the destination directory is on a different physical drive, in which case the copy method will fallback to a full copy.

Moving

The FileSet.move() method can be used to move files to a new location. It has same signature as FileSet.move() with the exception of the mode and supported_modes arguments, which are not relevant for moving files.

>>> new_analyze = analyze_file.move(
...    dest_dir="/path/to/destination", new_stem="t1-weighted"
...)
>>> new_analyze.fspaths
{"/path/to/destination/t1-weighted.hdr", "/path/to/destination/t1-weighted.img"}

Hashing

When working with files, particularly in workflows, it is often useful to be able to hash the contents of the files in the set to check for changes or successful transfers.

There are two methods for doing this conveniently in FileFormats:

  1. The FileSet.hash() method will hash the contents of all files in the set and return a hash value.

  2. The FileSet.hash_files() method will hash the contents of all files in the set and return a dictionary of hashes keyed by the file path.