Detection¶
FileFormats has been designed to detect whether a set of files matches a given format specification. This can be used either be in the form of validating file types in workflows or identifying the format in which user input files have been provided.
Validation¶
In the basic case, FileFormats can be used for checking the format of files and directories against known types. Typically this will involve checking the file extension and magic number if applicable
from fileformats.image import Jpeg
jpeg_file = Jpeg("/path/to/image.jpg") # PASSES
Jpeg("/path/to/image.png") # FAILS!
fake_fspath = "/path/to/fake-image.jpg"
with open(fake_fspath, "w") as f:
f.write("this is not a valid JPEG file")
Jpeg(fake_fspath) # FAILS!
To check whether a format matches without attempting to initialise the object use the
FileSet.matches()
method
if Jpeg.matches("/path/to/image.jpg"):
...
Formats that consists of directories with specific nested file formats within them can
be defined using the TypedDirectory
with content_types
class attribute, e.g.
from fileformats.generic import File, Directory
class Dicom(WithMagicNumber, File):
magic_number = b"DICM"
magic_number_offset = 128
class DicomDir(TypedDirectory):
content_types = (Dicom,)
Note that only one file within the directory needs to match the specified content type
for it to be considered a match and additional files will be ignored. For example,
the Dicom
type would be considered valid on the following directory structure
despite the presence of the .DS_Store
directory and the catalog.xml
file.
dicom-directory
├── .DS_Store
│ ├── deleted-file1.txt
│ ├── deleted-file2.txt
│ └── ...
├── 1.dcm
├── 2.dcm
├── 3.dcm
├── ...
├── 1024.dcm
└── catalog.xml
The file-sets contained within the directory can be accessed via the contents
attribute
dicom_dir = DicomDir("dicom-directory")
for dicom_file in dicom_dir.contents:
assert isinstance(dicom_file, Dicom)
For types with optional content types, the content_types
attribute can be set to
an "optional", i.e. Xml | None
, and the contents
attribute will include these
optional types in addition to the required types
class CatalogedDicomDir(TypedDirectory):
content_types = (Dicom, Xml | None)
dicom_dir = DicomDir("dicom-directory")
for dicom_file in dicom_dir.contents:
assert isinstance(dicom_file, (Dicom, Xml))
In addition to statically defining TypedDirectory formats such as the Dicom example above, dynamic directory types can be created on the fly by providing the content types as "classifier" arguments to the DirectoryOf[] class (see Classifiers), e.g.
from fileformats.generic import Directory
from fileformats.image import Png
from fileformats.text import Csv
def my_task(image_dir: DirectoryOf[Png]) -> Csv:
... task implementation ...
Identification¶
The find_matching
function can be used to list the formats that match a given file
>>> from fileformats.core import find_matching
>>> find_matching(["/path/to/word.doc"])
[<class 'fileformats.application.Msword'>]
Warning
The installation of extension packages may cause detection code to break if one of
the newly added formats also matches the file and your code doesn't handle this case.
If you are only interested in formats covered in the main fileformats package then
you should use the standard_only
flag
For loosely formats without many constraints, find_matching
may return multiple
formats that are not plausible for the given use case, in which case the candidates
argument can be passed to restrict the possible formats that can be returned
>>> from fileformats.datascience import MatFile, RData, Hdf5
>>> find_matching(["/path/to/text/matrix/file.mat"])
[fileformats.datascience.data.TextMatrix]
>>> find_matching(["/path/to/matlab/file.mat"])
[fileformats.datascience.data.TextMatrix, fileformats.datascience.data.MatFile]
>>> find_matching(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5])
[fileformats.datascience.data.MatFile]
from_paths
can be used to return an initialised object instead of a list of matching
files, however, since you need to be confident that there is only than one possible format
it is advisable to also provide a list of candidate formats
>>> from fileformats.core import from_paths
>>> repr(from_paths(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5]))
fileformats.datascience.data.MatFile({"/path/to/matlab/file.mat"})