Developer Guide#
FileFormats has been designed so that file-formats specified by standard features, such as file extension and magic number can be implemented in a few lines, while still being flexible enough handle any weird whacky file formats used in obscure domains.
Extension packages#
Format classes not covered by IANA Media Types should be implemented in a separate FileFormats extension packages. New extension packages can be conveniently created from the FileFormats extension template, https://github.com/ArcanaFramework/fileformats-medimage, including CI/CD workflows.
Extension packages add a new unique format namespace under the fileformats
namespace package.
For example, the FileFormats Medimage Extension
implements a range of file formats used in medical imaging research under the
fileformats.medimage
namespace.
When designing an extension packages, try to keep any external dependencies to an absolute minimum (ideally just attrs). If you would like to implement extra functionality into your format classes in addition to the core detection/validation, please use an "extra" hook and implement the method in an "extras" package (i.e. fileformats-<yournamespace>-extras), see the extras template for further instructions.
Basic formats#
In the simplest case of a file format identified by its extension alone, you only need
to inherit from the fileformats.generic.File
class and set the ext
attr, e.g
from fileformats.generic import File
class MyFileFormat(File):
ext = ".my"
Likewise if the format you are defining is a directory containing one or more files of
a given type you can just inherit from the fileformats.generic.Directory
class and
set the content_types
attributes
from fileformats.generic import Directory
from fileformats.text import Markdown, Html
class MyDirFormat(File):
content_types = (Markdown, Html)
Standard mixins#
If the format is a binary file with a magic number (identifying byte string at start of
file), you can use the fileformats.core.mixin.WithMagicNumber
mixin. For files with
magic numbers you will also need to set the binary
attr to True.
from fileformats.generic import File
from fileformats.core.mixin import WithMagicNumber
class MyBinaryFormat1(WithMagicNumber, File):
ext = ".myb1"
binary = True
magic_number = "98E3F12200AA" # Unicode strings are interpreted as hex
class MyBinaryFormat2(WithMagicNumber, File):
ext = ".myb2"
binary = True
magic_number = b"MYB2" # Byte strings are not converted
Formats will contain metadata in a separate header file can be defined using
the WithSeparateHeader
mixin.
from fileformats.generic import File
from fileformats.core.mixin import WithSeparateHeader
class MyHeaderFormat(File):
ext = ".hdr"
def load(self):
return dict(ln.split(":") for ln in self.contents.splitlines())
class MyFormatWithHeader(WithSeparateHeader, File):
ext = ".myh"
header_type = MyHeaderFormat
The header file can be accessed from an instantiated file object via the header
property. If the header format implements the load
method, then it is assumed to
return a dictionary containing metadata for the file-set.
>>> my_file = MyFormatWithHeader("/path/to/a/file.myh")
>>> my_file.header
MyHeaderFormat(fspaths={"/path/to/a/file.hdr"})
>>> my_file.metadata["experiment-id"] # load experiment ID from header file
'0001'
Similar to WithSeparateHeader
, WithSideCars
can be used to define a format that
contains some metadata within the main file, and additional metadata in a separate
"side-car" file. It can be used the same as WithSeparateHeader
, however, the
type of the primary file that reads the metadata from the binary file with read_metadata
must also be defined in primary_type
Warning
Mixin classes in the fileformats.core.mixin
package must come first in the
method resolution order of the types bases, so that they can override methods in
FileSet
if need be.
Custom format patterns#
While the standard mixin classes should cover 90% of all formats, in the wild-west of
scientific data formats you might need to write custom validators using the
@fileformats.core.mark.required
and @fileformats.core.mark.check
decorators.
Take for example the GIS shapefile structure,
it is a file-set consisting of 3 to 6 files differentiated by their extensions. To
implement this class we use the required
decorator. We inherit from the WithAdjacentFiles
mixin so that neighbouring files (i.e. files with the same stem but different extension)
are included when the class is instantiated with just the primary ".shp" file.
from fileformats.generic import File
from fileformats.application import Xml
from fileformats.mixin import WithAdjacentFiles
from fileformats.core import mark
class GisShapeIndex(File):
"the file that indexes the geometry."
ext = ".shx"
class GisShapeFeatures(File):
"the file that stores feature attributes in a tabular format"
ext = ".dbf"
class WellKnownText(File):
"""the file that contains information on projection format including the
coordinate system and projection information. It is a plain text file
describing the projection using well-known text (WKT) format."""
ext = ".prj"
class GisShapeSpatialIndexN(File):
"the files that are a spatial index of the features."
ext = ".shn"
class GisShapeSpatialIndexB(File):
"the files that are a spatial index of the features."
ext = ".shb"
class GisShapeGeoSpatialMetadata(Xml):
"the file that is the geospatial metadata in XML format"
ext = ".shp.xml"
class GisShape(WithAdjacentFiles, File):
ext = ".shp" # the main file that will be mapped to fspath
@mark.required
@property
def index_file(self):
return GisShapeIndex(self.select_by_ext(GisShapeIndex))
@mark.required
@property
def features_file(self):
return GisShapeFeatures(self.select_by_ext(GisShapeFeatures))
@mark.required
@property
def project_file(self):
return WellKnownText(self.select_by_ext(WellKnownText), allow_none=True)
@mark.required
@property
def spatial_index_n_file(self):
return GisShapeSpatialIndexN(
self.select_by_ext(GisShapeSpatialIndexN), allow_none=True
)
@mark.required
@property
def spatial_index_n_file(self):
return GisShapeSpatialIndexB(
self.select_by_ext(GisShapeSpatialIndexB), allow_none=True
)
@mark.required
@property
def geospatial_metadata_file(self):
return GisShapeGeoSpatialMetadata(
self.select_by_ext(GisShapeGeoSpatialMetadata), allow_none=True
)
By marking the properties as required, means that they need to be able to return a
value without raising a FormatsMismatchError
for the class to be initiated. Required
properties, that appear in fspaths
attribute of the object are considered to be
"required paths", and are copied along side the main path in the copy_to
method.
After the required properties have been deeper checks can be by using the check
decorator. Take the fileformats.image.Tiff
class
class Tiff(RasterImage):
ext = ".tiff"
iana_mime = "image/tiff"
magic_number_le = "49492A00"
magic_number_be = "4D4D002A"
@mark.check
def endianness(self):
read_magic = self.read_contents(len(self.magic_number_le) // 2)
if read_magic == bytes.fromhex(self.magic_number_le):
endianness = "little"
elif read_magic == bytes.fromhex(self.magic_number_be):
endianness = "big"
else:
read_magic_str = bytes.hex(read_magic)
raise FormatMismatchError(
f"Magic number of file '{read_magic_str}' doesn't match either the "
f"little-endian '{self.magic_number_le}' and big-endian "
f"'{self.magic_number_be}'"
)
return endianness
The Tiff
format class needs to check two different magic numbers, one for big endian
files and another one for little endian files. Therefore we can't just use the
WithMagicNumber
mixin and have to roll our own, endianness
is decorated with
fileformats.core.mark.check
.
Converters#
Converters between two equivalent formats are defined using Pydra dataflow engine
tasks. There are two types
of Pydra tasks, function tasks, Python functions decorated by @pydra.mark.task
, and
shell-command tasks, which wrap command-line tools in Python classes. To register a
Pydra task as a converter between two file formats it needs to be decorated with the
@fileformats.core.mark.converter
decorator. Note that converters that rely on
any additional dependencies should not be implemented in your extension package, rather
in a sister "extras" package named fileformats-<yournamespace>-extras,
see the extras template
for further instructions.
Pydra uses type annotations to define the input and outputs of the tasks. It there is
a input to the task named in_file
, and either a single anonymous output or an output
named out_file
, and both are format classes, then no arguments need to be passed
to the converter decorator and the conversion source and target formats are determined
automatically. For example,
from pathlib import Path
import tempfile
import pydra.mark
import fileformats.core.mark
from .mypackage import MyFormat, MyOtherFormat
@fileformats.core.mark.converter
@pydra.mark.task
def convert_my_format(in_file: MyFormat, conversion_argument: int = 2) -> MyOtherFormat:
data = in_file.load()
output_path = Path(tempfile.mkdtemp()) / ("out" + MyOtherFormat.ext)
... do conversion ...
return MyOtherFormat.save_new(output_path, data)
defines a converter between MyFormat
and MyOtherFormat
, with the converter
argument conversion_argument
.
The @converter
decorator registers the class in a class attribute of the target class,
therefore only if module containing the converter methods is imported will the converters
be available. Converter arguments can be passed as keyword-arguments to the
get_converter
and convert
methods if required.
Sometimes the source and target formats cannot be automatically determined from the
task signature, and need to be provided as arguments to the @converter
decorator
instead. For example, the converter between raster images using the imageio
package
to do a generic conversion between all image types,
from pathlib import Path
import tempfile
import pydra.mark
import pydra.engine.specs
from fileformats.core import mark
from .raster import RasterImage, Bitmap, Gif, Jpeg, Png, Tiff
@mark.converter(target_format=Bitmap, output_format=Bitmap)
@mark.converter(target_format=Gif, output_format=Gif)
@mark.converter(target_format=Jpeg, output_format=Jpeg)
@mark.converter(target_format=Png, output_format=Png)
@mark.converter(target_format=Tiff, output_format=Tiff)
@pydra.mark.task
@pydra.mark.annotate({"return": {"out_file": RasterImage}})
def convert_image(in_file: RasterImage, output_format: type, out_dir: ty.Optional[Path] = None):
data_array = in_file.load()
if out_dir is None:
out_dir = Path(tempfile.mkdtemp())
output_path = out_dir / (in_file.fspath.stem + output_format.ext)
return output_format.save_new(output_path, data_array)
In this case because we can write the converter in a generic way that allows us to convert
between any image type supported by imageio
, we use the RasterImage
base class
for the input and output format, and explicitly set the target_format
of the output
for each of the support output formats. We also pass output_format
as a keyword argument
from the converter decorator to specify the format we want to convert to.
Note that while the source_format
can be a base class of the format to be converted,
the target_format
can't be, since the subclass my have specific characteristics not
captured by transformation to the base class. However, you can attempt to "cast" a
base class to a sub-class simply by providing the base class as an input, since it will
simply iterate over paths in the base class and attempt to validate them.
>>> sub_format = SubFormat(BaseFormat.convert(another_format))
Shell commands are marked as converters in the same way as function tasks, and existing
ShellCommandTask classes can be registered by calling the converter method on the ShellCommandTask
directly. If required, you can also map the input and output files to in_file
and
out_file
via the converter decorator for any converter task and set appropriate
input fields
from fileformats.yourpackage import YourFormat, YourOtherFormat
from pydra.tasks.thirdparty import ThirdPartyShellCmd
converter(
source_format=YourFormat,
target_format=YourOtherFormat,
in_file=your_file,
out_file=other_file,
compression="y",
)(ThirdPartyShellCmd)
If you need to map any of the converter arguments or perform more complex logic, it is
also possible to decorate a generic function that returns an instantiated Pydra task,
such as in the mrconvert
converter in the fileformats-medimage
package.
@mark.converter(source_format=MedicalImage, target_format=Analyze, out_ext=Analyze.ext)
@mark.converter(
source_format=MedicalImage, target_format=MrtrixImage, out_ext=MrtrixImage.ext
)
@mark.converter(
source_format=MedicalImage,
target_format=MrtrixImageHeader,
out_ext=MrtrixImageHeader.ext,
)
def mrconvert(name, out_ext: str):
"""Initiate an MRConvert task with the output file extension set
Parameters
----------
name : str
name of the converter task
out_ext : str
extension of the output file, used by MRConvert to determine the desired format
Returns
-------
pydra.ShellCommandTask
the converter task
"""
return pydra_mrtrix3_utils.MRConvert(name=name, out_file="out" + out_ext)
Since converter tasks rely on Pydra, which should be added as an "extended" dependency,
they are not loaded by default. However, if there is a package at
fileformats.<namespace>.converters
, it will be attempted to be imported and throw
a warning if the import fails, when get_converter is called on a format in that
namespace.
Note
If the converters aren't imported successfully, then you will receive a
FormatConversionError
error saying there are no converters between FormatA and
FormatB.