Extract text from PDF files
Adobe’s Portable Document Format (PDF) has reached great popularity over the last years and is the number one format for easy document exchange. It comes with great features such as embeddable images and multimedia, but also has rather unpleasant properties. The so called Security Features represent a simple Digital Rights Management (DRM) system and allow PDF authors to restrict the file usage. Using the DRM system, authors can allow or deny actions such as printing a file, commenting or copying content.
Even though this is a good idea for some situations, most of the times, it’s just annoying: Collecting ideas for seminar papers or a thesis, for instance, is almost impossible without being able to Copy & Paste certain paragraphs from the PDF.
Fortunately, Linux can solve this problem with a simple tool called pdf to text. This command line tool simply strips all text from the PDF file and saves it to a given text-file.
Installation
The tool is part of the package poppler-utils and can be installed via your favorite package manager, e.g. apt-get:
1 |
apt-get install poppler-utils |
Extract text from PDF files
This is also pretty simple and the man-page gives the instructions: pdftotext [options] <PDF> [<text-file>].
1 |
pdftotext PDF-file-with-copy-and-paste-restriction.pdf |
In case you’d like to perform this for every PDF-file in a folder (recursive search), simple do that:
1 |
find -name '*.pdf' -exec pdftotext "{}" \; |
After executing the command, there will be a *.txt-file for each PDF file in the folder, – containing the plain-text of the corresponding PDF file.
Recent Comments