[okfn-labs] File type detection in python-magic, MS Office files

Marian Steinbach marian at sendung.de
Tue Apr 16 12:55:19 UTC 2013


Hi everybody!

I am trying to guess the correct mime type and file extension for binary
files scraped from a server, within python. Since this is something that
should have come up in a multitude of projects, I'm curious if there is a
robust solution.

The python-magic module uses libmagic in the background to guess file
types. (This means that results may vary from platform to platform.)

Currently I am developing on Mac OS and I have these results for the six
most common MS Office formats (which are, besides PDF, the most important
ones for me).

.XLS: application/vnd.ms-excel (okay)

.XLSX: application/vnd.ms-excel
  expected:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

.PPT: application/msword
  expected: application/vnd.ms-powerpoint

.PPTX: application/vnd.ms-powerpoint
  expected:
application/vnd.openxmlformats-officedocument.presentationml.presentation

.DOC: application/msword (okay)

.DOCX: application/msword
  expected:
application/vnd.openxmlformats-officedocument.wordprocessingml.document

python-magic's Magic class accepts a file path to an alternative magic file.

Does anybody here have experience with creating a magic file that
python-magic digests, especially one that helps properly recognize the
office formats?

Is it possible to use one magic file accross different platforms? I've had
bad luck with a magic and magic.mgc file copied from Ubuntu. Both create
error messages.

Thanks!

Marian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130416/64b3fe8a/attachment-0001.html>


More information about the okfn-labs mailing list