Data Workflows#

FileFormats is primarily designed for the typing of data workflows to ensure that data is transferred between workflow nodes in compatible formats. See Pydra and Fastr for examples of compatible workflow engines.

Base features#

Base features are available for all types within FileFormats and its extension packages using the base install. They are designed to be broadly available for a large range of types and very light-weight in terms of external dependencies.

Validation#

In the basic case, FileFormats can be used for checking the format of files and directories against known types. Typically, there are two layers of checks, ones performed on the file-system paths alone,

from fileformats.image import Jpeg

jpeg_file = Jpeg("/path/to/image.jpg")  # PASSES
jpeg_file = Jpeg("/path/to/image.png")  # FAILS!

The second layer of checks, which typically require reading the file and peeking at its contents for magic numbers and the like

fspath = "/path/to/fake-image.jpg"

with open(fspath, "w") as f:
    f.write("this is not a valid JPEG file")

jpeg_file = Jpeg(fspath)  # FAILS!

Directories are classified by the contents of the files within them, via the content_types class attribute, e.g.

from fileformats.generic import File, Directory

class Dicom(WithMagicNumber, File):
    magic_number = b"DICM"
    magic_number_offset = 128

class  DicomDir(Directory):
    content_types = (Dicom,)

Note that only one file within the directory needs to match the specified content type for it to be considered a match and additional files will be ignored. For example, the Dicom type would be considered valid on the following directory structure despite the presence of the .DS_Store directory and the catalog.xml file.

dicom-directory
├── .DS_Store
│   ├── deleted-file1.txt
│   ├── deleted-file2.txt
│   └── ...
├── 1.dcm
├── 2.dcm
├── 3.dcm
├── ...
├── 1024.dcm
└── catalog.xml

In addition to statically defining Directory formats such as the Dicom example above, dynamic directory types can be created on the fly by providing the content types as arguments to the DirectoryContaining[] method, e.g.

from fileformats.generic import Directory
from fileformats.image import Png
from fileformats.text import Csv

def my_task(image_dir: DirectoryContaining[Png]) -> Csv:
    ... task implementation ...

Path handling#

Once a file object is initiated you are able to access the "required properties" of the format class, which for single file formats is typically just the file-system path, fspath.

>>> from fileformats.image import Jpeg
>>> jpeg_file = Jpeg("/path/to/image.jpg")
>>> jpeg_file.fspath
"/path/to/image.jpg"

However, file-formats that consist of multiple files (common in scientific data) will typically define separate required properties for each file. For example, the Analyze neuroimaging format, which stores the image in a file with the extension ".img" and metadata in a separate header file with the extension ".hdr".

>>> from fileformats.medimage import Analyze
>>> analyze_file = Analyze(["/path/to/neuroimage.hdr", "/path/to/neuroimage.img"])
>>> analyze_file.fspath
"/path/to/neuroimage.img"
>>> analyze_file.header
"/path/to/neuroimage.hdr"

To access all file-system paths in a format object you can access the fspaths attribute from the base class of all file formats fileformats.core.base.FileSet

>>> analyze_file.fspaths
{"/path/to/neuroimage.hdr", "/path/to/neuroimage.img"}

In the case of file formats with "adjacent" files that share the same file-name stem, i.e. same file path and name minus the file extension (such as Analyze), you only need to provide one the primary path and the header will be automatically detected and added to the file-set

>>> from fileformats.medimage import Analyze
>>> analyze_file = Analyze("/path/to/neuroimage.img")
>>> analyze_file.fspaths
{"/path/to/neuroimage.hdr", "/path/to/neuroimage.img"}

This is very useful when reading the output path of a workflow where only primary path is returned and associated files also need to be saved to an output directory. To copy all files/directories in a format you can use the FileSet.copy() method

>>> analyze_file_copy = analyze_file.copy(dest_dir="/path/to/destination", stem="new-stem")
>>> analyze_file_copy.fspaths
{"/path/to/destination/new-stem.hdr", "/path/to/destination/new-stem.img"}

Going in the other direction from a format class to a workflow/task input, the transformation of the format object to a path-like string is handled implicitly through the implementation of the __str__ and __fspath__ magic methods. This means that format objects can be used in place of the path objects themselves, e.g.

import subprocess
from fileformats.text import Plain
text_file = Plain("/path/to/text-file.txt")

with open(text_file) as f:
    contents = f.read()

subprocess.run(f"cp {text_file} /path/to/destination", shell=True)

Noting that it is only the "primary" path as returned by the fspath property that is rendered.

Extras#

In addition to the basic features of validation and path handling, it is possible to implement methods to interact with the data of file format objects via "extras hooks". Such features are added to selected format classes on a needs basis (pull requests welcome 😊, see Developer Guide), so are by no means comprehensive, and are very much provided "as-is".

Since these features, typically rely on a range of external libraries, the dependencies are kept separate and only installed if the [extended] install option is used (i.e. python3 -m pip install filformats[extended]).

Metadata#

If there has been an extras overload registered for the read_metadata method, then metadata associated with the fileset can be accessed via the metadata property, e.g.

>>> dicom.metadata["SeriesDescription"]
"localizer"

Load/saving data#

Several classes in the base fileformats package implement load and save methods. An advantage of implementing them in the format class is that objects instantiated from them can then be duck-typed in calling functions/methods. For example, both Yaml and Json formats (both inherit from the DataSerialization type) implement the load method, which returns a dictionary

from fileformats.application import DataSerialization

def read_json_or_yaml_to_dict(serialized: DataSerialization):
    return serialized.load()

Also, when providing the the WithSeparateHeader and WithSideCars mixin classes will

Conversion#

Several conversion methods are available between equivalent file-formats in the standard classes. For example, archive types such as Zip can be converted into and generic file/directories using the convert classmethod of the target format to convert to

from fileformats.application import Zip
from fileformats.generic import Directory

zip_file = Zip.convert(Directory("/path/to/a/directory"))
extracted = Directory.convert(zip_file)
copied = extracted.copy_to("/path/to/output")

The converters are implemented in the Pydra dataflow framework, and can be linked into wider Pydra workflows by accessing the underlying converter task with the get_converter classmethod

import pydra
from pydra.tasks.mypackage import MyTask
from fileformats.image import Gif, Png

wf = pydra.Workflow(name="a_workflow", input_spec=["in_gif"])
wf.add(
    Png.get_converter(Gif, name="gif2png", in_file=wf.lzin.in_gif)
)
wf.add(
    MyTask(
        name="my_task",
        in_file=wf.gif2png.lzout.out_file,
    )
)
...