Detection ========= *FileFormats* has been designed to detect whether a set of files matches a given format specification. This can be used either be in the form of validating file types in workflows or identifying the format in which user input files have been provided. Validation ---------- In the basic case, *FileFormats* can be used for checking the format of files and directories against known types. Typically this will involve checking the file extension and magic number if applicable .. code-block:: python from fileformats.image import Jpeg jpeg_file = Jpeg("/path/to/image.jpg") # PASSES Jpeg("/path/to/image.png") # FAILS! fake_fspath = "/path/to/fake-image.jpg" with open(fake_fspath, "w") as f: f.write("this is not a valid JPEG file") Jpeg(fake_fspath) # FAILS! To check whether a format matches without attempting to initialise the object use the :meth:`FileSet.matches()` method .. code-block:: python if Jpeg.matches("/path/to/image.jpg"): ... Formats that consists of directories with specific nested file formats within them can be defined using the ``TypedDirectory`` with ``content_types`` class attribute, e.g. .. code-block:: python from fileformats.generic import File, Directory class Dicom(WithMagicNumber, File): magic_number = b"DICM" magic_number_offset = 128 class DicomDir(TypedDirectory): content_types = (Dicom,) Note that only one file within the directory needs to match the specified content type for it to be considered a match and additional files will be ignored. For example, the ``Dicom`` type would be considered valid on the following directory structure despite the presence of the ``.DS_Store`` directory and the ``catalog.xml`` file. .. code-block:: dicom-directory ├── .DS_Store │ ├── deleted-file1.txt │ ├── deleted-file2.txt │ └── ... ├── 1.dcm ├── 2.dcm ├── 3.dcm ├── ... ├── 1024.dcm └── catalog.xml The file-sets contained within the directory can be accessed via the ``contents`` attribute .. code-block:: python dicom_dir = DicomDir("dicom-directory") for dicom_file in dicom_dir.contents: assert isinstance(dicom_file, Dicom) For types with optional content types, the ``content_types`` attribute can be set to an "optional", i.e. ``Xml | None``, and the ``contents`` attribute will include these optional types in addition to the required types .. code-block:: python class CatalogedDicomDir(TypedDirectory): content_types = (Dicom, Xml | None) dicom_dir = DicomDir("dicom-directory") for dicom_file in dicom_dir.contents: assert isinstance(dicom_file, (Dicom, Xml)) In addition to statically defining `TypedDirectory` formats such as the Dicom example above, dynamic directory types can be created on the fly by providing the content types as "classifier" arguments to the `DirectoryOf[]` class (see :ref:`Classifiers`), e.g. .. code-block:: python from fileformats.generic import Directory from fileformats.image import Png from fileformats.text import Csv def my_task(image_dir: DirectoryOf[Png]) -> Csv: ... task implementation ... Identification -------------- The ``find_matching`` function can be used to list the formats that match a given file .. code-block:: >>> from fileformats.core import find_matching >>> find_matching(["/path/to/word.doc"]) [] .. warning:: The installation of extension packages may cause detection code to break if one of the newly added formats also matches the file and your code doesn't handle this case. If you are only interested in formats covered in the main fileformats package then you should use the ``standard_only`` flag For loosely formats without many constraints, ``find_matching`` may return multiple formats that are not plausible for the given use case, in which case the ``candidates`` argument can be passed to restrict the possible formats that can be returned .. code-block:: >>> from fileformats.datascience import MatFile, RData, Hdf5 >>> find_matching(["/path/to/text/matrix/file.mat"]) [fileformats.datascience.data.TextMatrix] >>> find_matching(["/path/to/matlab/file.mat"]) [fileformats.datascience.data.TextMatrix, fileformats.datascience.data.MatFile] >>> find_matching(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5]) [fileformats.datascience.data.MatFile] ``from_paths`` can be used to return an initialised object instead of a list of matching files, however, since you need to be confident that there is only than one possible format it is advisable to also provide a list of candidate formats .. code-block:: >>> from fileformats.core import from_paths >>> repr(from_paths(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5])) fileformats.datascience.data.MatFile({"/path/to/matlab/file.mat"})