So, when you have many text files that you have to convert to txt and maybe too HTML and as follow add it in database for example. Then you can use this commands and libs.
In most cases we are using tika but there is also an interesting lib.
import textract
text = textract.process("path/to/file.extension")
.docx via python-docx2txt
So, if you want you can use this package without textract..
https://github.com/ankushshah89/python-docx2txt
sudo pip3 install docx2txt
Minimal script is like.
import docx2txt
text = docx2txt.process("job today.docx", "text/")
import pdb;pdb.set_trace()
You can run to test it..
python3 convert2text.py
.csv
via python builtins.doc
via antiword.docx
via python-docx2txt.eml
via python builtins.epub
via ebooklib.gif
via tesseract-ocr.jpg
and.jpeg
via tesseract-ocr.json
via python builtins.html
and.htm
via beautifulsoup4.mp3
via sox, SpeechRecognition, and pocketsphinx.msg
via msg-extractor.odt
via python builtins.ogg
via sox, SpeechRecognition, and pocketsphinx.pdf
via pdftotext (default) or pdfminer.six.png
via tesseract-ocr.pptx
via python-pptx.ps
via ps2text.rtf
via unrtf.tiff
and.tif
via tesseract-ocr.txt
via python builtins.wav
via SpeechRecognition and pocketsphinx.xlsx
via xlrd.xls
via xlrd
For more advanced cases. When you have to support very many file formats like PDF, DOC, ezv.. then tika is the best tool I have ever seen so far.
It runs small server created in Java. So, this case when i convert only docx files it is a bit overkill.
Comments
Post a Comment