Extensions¶
FileFormats has been designed so that file-formats specified by standard features, such as file extension and magic number can be implemented in a few lines, while still being flexible enough handle any weird whacky file formats used in obscure domains.
Format classes not covered by IANA Media Types should be implemented in a separate FileFormats extension packages. New extension packages can be conveniently created from the FileFormats extension template, https://github.com/ArcanaFramework/fileformats-medimage, including CI/CD workflows.
Extension packages add a new unique format namespace under the fileformats
namespace package.
For example, the FileFormats Medimage Extension
implements a range of file formats used in medical imaging research under the
fileformats.medimage
namespace.
Extension packages shouldn't have any external dependencies (i.e. except the base fileformats package). Additional functionality that requires external dependencies should be implemented in a "extras" package (see Extras).
Basic formats¶
In the simplest case of a file format identified by its extension alone, you only need
to inherit from the fileformats.generic.File
class and set the ext
attr, e.g
from fileformats.generic import File
class MyFileFormat(File):
ext = ".my"
Likewise if the format you are defining is a directory containing one or more files of
a given type you can just inherit from the fileformats.generic.Directory
class and
set the content_types
attributes
from fileformats.generic import Directory
from fileformats.text import Markdown, Html
class MyDirFormat(File):
content_types = (Markdown, Html)
Standard mixins¶
If the format is a binary file with a magic number (identifying byte string at start of
file), you can use the fileformats.core.mixin.WithMagicNumber
mixin. For files with
magic numbers you will also need to set the binary
attr to True.
from fileformats.generic import File
from fileformats.core.mixin import WithMagicNumber
class MyBinaryFormat1(WithMagicNumber, File):
ext = ".myb1"
binary = True
magic_number = "98E3F12200AA" # Unicode strings are interpreted as hex
class MyBinaryFormat2(WithMagicNumber, File):
ext = ".myb2"
binary = True
magic_number = b"MYB2" # Byte strings are not converted
Formats will contain metadata in a separate header file can be defined using
the WithSeparateHeader
mixin.
from fileformats.generic import File
from fileformats.core.mixin import WithSeparateHeader
class MyHeaderFormat(File):
ext = ".hdr"
def load(self):
return dict(ln.split(":") for ln in self.raw_contents.splitlines())
class MyFormatWithHeader(WithSeparateHeader, File):
ext = ".myh"
header_type = MyHeaderFormat
The header file can be accessed from an instantiated file object via the header
property. If the header format implements the load
method, then it is assumed to
return a dictionary containing metadata for the file-set.
>>> my_file = MyFormatWithHeader("/path/to/a/file.myh")
>>> my_file.header
MyHeaderFormat(fspaths={"/path/to/a/file.hdr"})
>>> my_file.metadata["experiment-id"] # load experiment ID from header file
'0001'
Similar to WithSeparateHeader
, WithSideCars
can be used to define a format that
contains some metadata within the main file, and additional metadata in a separate
"side-car" file. It can be used the same as WithSeparateHeader
, however, the
type of the primary file that reads the metadata from the binary file with read_metadata()
must also be defined in primary_type
Warning
Mixin classes in the fileformats.core.mixin
package must come first in the
method resolution order of the types bases, so that they can override methods in
FileSet
if need be.
Mixin classes¶
- class fileformats.core.mixin.WithMagicNumber[source]¶
Bases:
object
Mixin class for Files with magic numbers at the start of their contents.
- class fileformats.core.mixin.WithMagicVersion[source]¶
Bases:
object
Mixin class for Files with version numbers embedded within "magic numbers" the start of their contents.
- class fileformats.core.mixin.WithAdjacentFiles[source]¶
Bases:
object
If only the main fspath is provided to the __init__ of the class, this mixin automatically includes any "adjacent files", i.e. any files with the same stem but different extensions
Note that WithAdjacentFiles must come before the primary type in the method-resolution order of the class so it can override the '_additional_paths' method in
class MyFileFormatWithSeparateHeader(WithSeparateHeader, MyFileFormat):
header_type = MyHeaderType
- class fileformats.core.mixin.WithSeparateHeader[source]¶
Bases:
WithAdjacentFiles
Mixin class for Files with metadata stored in separate header files (typically with the same file stem but differing extension)
Note that WithSeparateHeader must come before the primary type in the method-resolution order of the class so it can override the '__attrs_post_init__' method, e.g.
class MyFileFormatWithSeparateHeader(WithSeparateHeader, MyFileFormat):
header_type = MyHeaderType
- class fileformats.core.mixin.WithSideCars[source]¶
Bases:
WithAdjacentFiles
Mixin class for Files with a "side-car" file that augments the inline metadata (typically with the same file stem but differing extension).
Note that WithSideCars must come before the primary type in the method-resolution order of the class so it can override the '__attrs_post_init__' and 'read_metadata' methods, e.g.
class MyFileFormatWithSideCars(WithSideCars, MyFileFormat):
primary_type = MyFileFormat side_car_types = (MySideCarType,)
- class fileformats.core.mixin.WithClassifiers[source]¶
Bases:
object
Mixin class for adding the ability to qualify the format class to designate the type of information stored within the format, e.g.
DirectoryOf[Png, Gif]
for a directory containing PNG and GIF files,Zip[DataFile]
for a zipped data file,Array[Integer]
for an array containing integers, or DicomDir[T1w, Brain] for a T1-weighted MRI scan of the brain in DICOM format.class MyFormatWithClassifiers(WithClassifiers, BinaryFile):
ext = ".myf
- def my_func(file: MyFormatWithClassifiers[Integer]):
...
A unique class will be returned (i.e. multiple calls with the same arguments will return the same class)
Custom format patterns¶
While the standard mixin classes should cover the large majority standard formats, in the wild-west of science data formats you are likely to need to design custom validators for your format. This is done by adding a property to the class using the fileformats.core.validated_property decorator. Validated properties should check the validity of an aspect of the file, and raise a FormatMismatchError if the file does not match the expected pattern.
To detect the presence of associated files, you can use the select_by_ext method of the file object, which selects a single file from a list of file paths that matches given extension, raising a FormatMismatchError if either no files or multiple files are found.
Take for example the GIS shapefile structure,
it is a file-set consisting of 3 to 6 files differentiated by their extensions. To
implement this class we use the @validated_property
decorator. We inherit from the WithAdjacentFiles
mixin so that neighbouring files (i.e. files with the same stem but different extension)
are included when the class is instantiated with just the primary ".shp" file.
from fileformats.generic import File
from fileformats.application import Xml
from fileformats.mixin import WithAdjacentFiles
from fileformats.core import mark, validated_property
class GisShapeIndex(File):
"the file that indexes the geometry."
ext = ".shx"
class GisShapeFeatures(File):
"the file that stores feature attributes in a tabular format"
ext = ".dbf"
class WellKnownText(File):
"""the file that contains information on projection format including the
coordinate system and projection information. It is a plain text file
describing the projection using well-known text (WKT) format."""
ext = ".prj"
class GisShapeSpatialIndexN(File):
"the files that are a spatial index of the features."
ext = ".shn"
class GisShapeSpatialIndexB(File):
"the files that are a spatial index of the features."
ext = ".shb"
class GisShapeGeoSpatialMetadata(Xml):
"the file that is the geospatial metadata in XML format"
ext = ".shp.xml"
class GisShape(WithAdjacentFiles, File):
ext = ".shp" # the main file that will be mapped to fspath
@validated_property
def index_file(self):
return GisShapeIndex(self.select_by_ext(GisShapeIndex))
@validated_property
def features_file(self):
return GisShapeFeatures(self.select_by_ext(GisShapeFeatures))
@validated_property
def project_file(self):
return WellKnownText(self.select_by_ext(WellKnownText), allow_none=True)
@validated_property
def spatial_index_n_file(self):
return GisShapeSpatialIndexN(
self.select_by_ext(GisShapeSpatialIndexN), allow_none=True
)
@validated_property
def spatial_index_n_file(self):
return GisShapeSpatialIndexB(
self.select_by_ext(GisShapeSpatialIndexB), allow_none=True
)
@validated_property
def geospatial_metadata_file(self):
return GisShapeGeoSpatialMetadata(
self.select_by_ext(GisShapeGeoSpatialMetadata), allow_none=True
)
Properties that appear in fspaths
attribute of the object are considered to be
"required paths", and are copied along side the main path in the copy_to
method
even when the trim
argument is set to True.
After the required properties have been deeper checks can be by using the check
decorator. Take the fileformats.image.Tiff
class
class Tiff(RasterImage):
ext = ".tiff"
iana_mime = "image/tiff"
magic_number_le = "49492A00"
magic_number_be = "4D4D002A"
@property
def endianness(self):
read_magic = self.read_contents(len(self.magic_number_le) // 2)
if read_magic == bytes.fromhex(self.magic_number_le):
endianness = "little"
elif read_magic == bytes.fromhex(self.magic_number_be):
endianness = "big"
else:
read_magic_str = bytes.hex(read_magic)
raise FormatMismatchError(
f"Magic number of file '{read_magic_str}' doesn't match either the "
f"little-endian '{self.magic_number_le}' and big-endian "
f"'{self.magic_number_be}'"
)
return endianness
The Tiff
format class needs to check two different magic numbers, one for big endian
files and another one for little endian files. Therefore we can't just use the
WithMagicNumber
mixin and have to roll our own.