Format Identification#
In addition to simply importing and using a format class in your Python code, format classes can be loaded from their MIME type or detected from file-system objects.
Detecting formats#
While not a primary design goal of the FileFormats library, it is
possible to detect the formats that match a given set of files using the find_matching
function. Note that it isn't always possible to uniquely identify a single format, since
there may be several matching formats for, non-descript binary file formats that use the
".dat" extension for example.
>>> from fileformats.core import find_matching
>>> find_matching("/path/to/word.doc")
[<class 'fileformats.application.Msword'>]
Note that the installation of additional sub-packages may cause detection code to
break if your code doesn't the potential of new formats being added with overlapping
cases where they will both match a given file set. If you are only interested in
formats covered in the main fileformats package then you should use the standard_only
flag
MIME Types#
Namespaces in the fileformats
package are largely named after MIME type registries
as defined by the Internet Assigned Numbering Authority (IANA).
The difference is that there is no "application" registry, which acts as a
bit of a catch-all in the MIME-type specification. Instead, types that
fall under the "application" registry are grouped by the types of data that they
store, e.g. fileformats.application
for (typically compressed) archives such as
zip, bzip, gzip, etc..., fileformats.application
for PDFs, word docs,
fileformats.application
for JSON, YAML and XML, etc...
Format class can be converted to and from MIME type strings using the to_mime
and
from_mime
functions. If the the iana_mime
attribute
is present in the type class, it should correspond to a formally recognised MIME type
by the , e.g.
from fileformats.core import to_mime, from_mime
from fileformats.application import MswordX
Loaded = from_mime("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
assert Loaded is MswordX
assert Loaded.mime_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
If the format class doesn't define an iana_mime
attribute (i.e. in the actual class,
not including iana_mime
attributes defined in base classes), it will be assigned an informal
MIME-type of "application/x-<transformed-class-name>", where transformed-class-name
is the name of the format class converted from "PascalCase" to "kebab-case", with the
single underscores converterd to "." and a double underscores converted to "+" (there
should be only one), e.g.
>>> Nifti__Gzip_Json.mime_type
"application/x-nifti+gzip.json"
Note that if there are two file-formats with the same class name in different sub-packages
then the iana_mime
attribute will need to be set on at least one of them otherwise an
error will be raised when they are loaded from a MIME type.
Warning
Note that the installation of additional sub-packages may cause detection code to break if your code doesn't the potential of new formats being added with the same class name. Therefore, you may prefer to use "MIME-like" type strings (see below) unless IANA compliance is required.
MIME-like types#
To avoid the issue with format classes in separate namespaces mapping onto the same
IANA-style MIME type, as well as improving readability of the MIME string (i.e. not
drowning in a sea of "application/x-*" types), it can be preferable in some use cases
not to worry with closely matching the MIME-type specification for non-standard formats
and just use the FileFormats namespace inplace of the generic "application/x-" prefix.
This is accessed via the mime_like
class-property.
>>> from fileformats.datascience import Hdf5
>>> from fileformats.medimage import Nifti1
>>> Hdf5.mime_like
"datascience/hdf5"
>>> Nifti1.mime_like
"medimage/nifti1"
The from_mime
function will resolve both official-style MIME types and the MIME-like
types, so it is possible to roundtrip from both.
from fileformats.core import to_mime, from_mime
from from fileformats.medimage import DicomSeries
# Using official-style MIME string
mime_type = DicomSeries.mime_type
assert mime_type == "application/x-dicom-series"
assert from_mime(mime_type) is DicomSeries
# Using MIME-like string
mimelike_type = DicomSeries.mime_like
assert mimelike_type == "medimage/dicom-series"
assert from_mime(mimelike_type) is DicomSeries