Convert doc or docx files to txt in python

So, when you have many text files that you have to convert to txt and maybe too HTML and as follow add it in database for example. Then you can use this commands and libs.

In most cases we are using tika but there is also an interesting lib.

import textract
text = textract.process("path/to/file.extension")

.docx via python-docx2txt

So, if you want you can use this package without textract..

https://github.com/ankushshah89/python-docx2txt

sudo pip3 install docx2txt

Minimal script is like.

import docx2txt

text = docx2txt.process("job today.docx", "text/")

import pdb;pdb.set_trace()

You can run to test it..

python3 convert2text.py

.csv via python builtins
.doc via antiword
.docx via python-docx2txt
.eml via python builtins
.epub via ebooklib
.gif via tesseract-ocr
.jpg and .jpeg via tesseract-ocr
.json via python builtins
.html and .htm via beautifulsoup4
.mp3 via sox, SpeechRecognition, and pocketsphinx
.msg via msg-extractor
.odt via python builtins
.ogg via sox, SpeechRecognition, and pocketsphinx
.pdf via pdftotext (default) or pdfminer.six
.png via tesseract-ocr
.pptx via python-pptx
.ps via ps2text
.rtf via unrtf
.tiff and .tif via tesseract-ocr
.txt via python builtins
.wav via SpeechRecognition and pocketsphinx
.xlsx via xlrd
.xls via xlrd

For more advanced cases. When you have to support very many file formats like PDF, DOC, ezv.. then tika is the best tool I have ever seen so far.

https://github.com/chrismattmann/tika-python

It runs small server created in Java. So, this case when i convert only docx files it is a bit overkill.

Search This Blog

Convert doc or docx files to txt in python

Comments

Post a Comment