开发者

Detect excel .xlsx file mimetype via PHP

开发者 https://www.devze.com 2023-04-01 03:10 出处:网络
I can\'t detect mimetype for xlsx Excel file via PHP because 开发者_如何学Pythonit\'s zip archive.

I can't detect mimetype for xlsx Excel file via PHP because 开发者_如何学Pythonit's zip archive.

File utilite

file file.xlsx
file.xlsx: Zip archive data, at least v2.0 to extract

PECL fileinfo

$finfo = finfo_open(FILEINFO_MIME_TYPE);
finfo_file($finfo, "file.xlsx");
application/zip

How to validate it? Unpack and view structure? But if it's arcbomb?


Overview

PHP uses libmagic. When Magic detects the MIME type as "application/zip" instead of "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", this is because the files added to the ZIP archive need to be in a certain order.

This causes a problem when uploading files to services that enforce matching file extension and MIME type. For example, Mediawiki-based wikis (written using PHP) are blocking certain XLSX files from being uploaded because they are detected as ZIP files.

What you need to do is fix your XLSX by reordering the files written to the ZIP archive so that Magic can detect the MIME type properly.

Analyzing files

For this example, we will analyze an XLSX file created using Openpyxl and Excel.

The file list can be viewed using unzip:

$ unzip -l Openpyxl.xlsx
Archive:  Openpyxl.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      177  2019-12-21 04:34   docProps/app.xml
      452  2019-12-21 04:34   docProps/core.xml
    10140  2019-12-21 04:34   xl/theme/theme1.xml
    22445  2019-12-21 04:34   xl/worksheets/sheet1.xml
      586  2019-12-21 04:34   xl/tables/table1.xml
      238  2019-12-21 04:34   xl/worksheets/_rels/sheet1.xml.rels
      951  2019-12-21 04:34   xl/styles.xml
      534  2019-12-21 04:34   _rels/.rels
      552  2019-12-21 04:34   xl/workbook.xml
      507  2019-12-21 04:34   xl/_rels/workbook.xml.rels
     1112  2019-12-21 04:34   [Content_Types].xml
---------                     -------
    37694                     11 files

$ unzip -l Excel.xlsx
Archive:  Excel.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1476  1980-01-01 00:00   [Content_Types].xml
      732  1980-01-01 00:00   _rels/.rels
      831  1980-01-01 00:00   xl/_rels/workbook.xml.rels
     1159  1980-01-01 00:00   xl/workbook.xml
      239  1980-01-01 00:00   xl/sharedStrings.xml
      293  1980-01-01 00:00   xl/worksheets/_rels/sheet1.xml.rels
     6796  1980-01-01 00:00   xl/theme/theme1.xml
     1540  1980-01-01 00:00   xl/styles.xml
     1119  1980-01-01 00:00   xl/worksheets/sheet1.xml
    39574  1980-01-01 00:00   docProps/thumbnail.wmf
      785  1980-01-01 00:00   docProps/app.xml
      169  1980-01-01 00:00   xl/calcChain.xml
      513  1980-01-01 00:00   xl/tables/table1.xml
      601  1980-01-01 00:00   docProps/core.xml
---------                     -------
    55827                     14 files

Notice that the file order is different.

The MIME types can be viewed using PHP:

<?php
echo mime_content_type('Openpyxl.xlsx') . "<br/>\n";
echo mime_content_type('Excel.xlsx');

or using python-magic:

pip install python-magic

on Windows:

pip install python-magic-bin==0.4.14

‌Code:

import magic
mime = magic.Magic(mime=True)
print(mime.from_file("Openpyxl.xlsx"))
print(mime.from_file("Excel.xlsx"))

Output:

application/zip
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Solution

@adrilo has investigated this problem and has developed a solution.

Hey @garak,

After pulling my hair out for a few hours, I finally figured out why the mime type is wrong. It turns out the order in which the XML files gets added to the final ZIP file (an XLSX file being a ZIP file with the xlsx extension) matters for the heuristics used to detect types.

Currently, files are added in this order:

[Content_Types].xml
_rels/.rels
docProps/app.xml
docProps/core.xml
xl/_rels/workbook.xml.rels
xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
xl/worksheets/sheet1.xml

The problem comes from inserting the "docProps" related files. It seems like the heuristic is to look at the first few bytes and check if it finds Content_Types and xl. By having the "docProps" files inserted in between, the first xl occurrence must happen outside of the first bytes the algorithm looks at and therefore concludes it's a simple zip file.

I'll try to fix this nicely

  • https://github.com/box/spout/issues/149#issuecomment-162049588

Fixes #149

Heuristics to detect proper mime type for XLSX files expect to see certain files at the beginning of the XLSX archive. The order in which the XML files are added therefore matters. Specifically, "[Content_Types].xml" should be added first, followed by the files located in the "xl" folder (at least 1 file).

  • https://github.com/box/spout/pull/152

According to Spout's FileSystemHelper.php:

In order to have the file's mime type detected properly, files need to be added to the zip file in a particular order. "[Content_Types].xml" then at least 2 files located in "xl" folder should be zipped first.

  • https://github.com/box/spout/blob/master/src/Spout/Writer/XLSX/Helper/FileSystemHelper.php#L382

The solution is to add the files "[Content_Types].xml", "xl/workbook.xml", and "xl/styles.xml" in that order and then the remaining files.

Code

This Python script will rewrite an XLSX file that has the archive files in the proper order.

#!/usr/bin/env python

from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED

XL_FOLDER_NAME = "xl"

CONTENT_TYPES_XML_FILE_NAME = "[Content_Types].xml"
WORKBOOK_XML_FILE_NAME = "workbook.xml"
STYLES_XML_FILE_NAME = "styles.xml"

FIRST_NAMES = [
    CONTENT_TYPES_XML_FILE_NAME,
    f"{XL_FOLDER_NAME}/{WORKBOOK_XML_FILE_NAME}",
    f"{XL_FOLDER_NAME}/{STYLES_XML_FILE_NAME}"
]


def fix_workbook_mime_type(file_path):
    buffer = BytesIO()

    with ZipFile(file_path) as zip_file:
        names = zip_file.namelist()
        print(names)

        remaining_names = [name for name in names if name not in FIRST_NAMES]
        ordered_names = FIRST_NAMES + remaining_names
        print(ordered_names)

        with ZipFile(buffer, "w", ZIP_DEFLATED, allowZip64=True) as buffer_zip_file:
            for name in ordered_names:
                try:
                    file = zip_file.open(name)
                    buffer_zip_file.writestr(file.name, file.read())
                except KeyError:
                    pass

    with open(file_path, "wb") as file:
        file.write(buffer.getvalue())


def main(*args):
    fix_workbook_mime_type("File.xlsx")


if __name__ == "__main__":
    main()


I know this works for zip files, but I'm not too sure about xlsx files. It's worth a try:

To list the files in a zip archive:

$zip = new ZipArchive;
$res = $zip->open('test.zip');
if ($res === TRUE) {
    for ($i=0; $i<$zip->numFiles; $i++) {
        print_r($zip->statIndex($i));
    }
    $zip->close();
} else {
    echo 'failed, code:' . $res;
}

This will print all the files like this:

Array
(
    [name] => file.png
    [index] => 2
    [crc] => -485783131
    [size] => 1486337
    [mtime] => 1311209860
    [comp_size] => 1484832
    [comp_method] => 8
)

As you can see here, it gives the size and the comp_size for each archive. If it is an archive bomb, the ratio between these two numbers will be astronomical. You could simply put a limit of however many megabytes you want the maximum decompressed file size and if it exceeds that amount, skip that file and give an error message back to the user, else proceed with your extraction. See the manual for more information.


Here is an wrapper that will properly identify Microsoft Office 2007 documents. It's trivial and straightforward to use, edit, and to add more file extentions/mimetypes.

function get_mimetype($filepath) {
    if(!preg_match('/\.[^\/\\\\]+$/',$filepath)) {
        return finfo_file(finfo_open(FILEINFO_MIME_TYPE), $filepath);
    }
    switch(strtolower(preg_replace('/^.*\./','',$filepath))) {
        // START MS Office 2007 Docs
        case 'docx':
            return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
        case 'docm':
            return 'application/vnd.ms-word.document.macroEnabled.12';
        case 'dotx':
            return 'application/vnd.openxmlformats-officedocument.wordprocessingml.template';
        case 'dotm':
            return 'application/vnd.ms-word.template.macroEnabled.12';
        case 'xlsx':
            return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet';
        case 'xlsm':
            return 'application/vnd.ms-excel.sheet.macroEnabled.12';
        case 'xltx':
            return 'application/vnd.openxmlformats-officedocument.spreadsheetml.template';
        case 'xltm':
            return 'application/vnd.ms-excel.template.macroEnabled.12';
        case 'xlsb':
            return 'application/vnd.ms-excel.sheet.binary.macroEnabled.12';
        case 'xlam':
            return 'application/vnd.ms-excel.addin.macroEnabled.12';
        case 'pptx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.presentation';
        case 'pptm':
            return 'application/vnd.ms-powerpoint.presentation.macroEnabled.12';
        case 'ppsx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.slideshow';
        case 'ppsm':
            return 'application/vnd.ms-powerpoint.slideshow.macroEnabled.12';
        case 'potx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.template';
        case 'potm':
            return 'application/vnd.ms-powerpoint.template.macroEnabled.12';
        case 'ppam':
            return 'application/vnd.ms-powerpoint.addin.macroEnabled.12';
        case 'sldx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.slide';
        case 'sldm':
            return 'application/vnd.ms-powerpoint.slide.macroEnabled.12';
        case 'one':
            return 'application/msonenote';
        case 'onetoc2':
            return 'application/msonenote';
        case 'onetmp':
            return 'application/msonenote';
        case 'onepkg':
            return 'application/msonenote';
        case 'thmx':
            return 'application/vnd.ms-officetheme';
            //END MS Office 2007 Docs

    }
    return finfo_file(finfo_open(FILEINFO_MIME_TYPE), $filepath);
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号