Extract text from PDF files

Aug 09 / 2009
Comments Off on Extract text from PDF files
Linux, Office, PDF

Adobe’s Portable Document Format (PDF) has reached great popularity over the last years and is the number one format for easy document exchange. It comes with great features such as embeddable images and multimedia, but also has rather unpleasant properties. The so called Security Features represent a simple Digital Rights Management (DRM) system and allow PDF authors to restrict the file usage. Using the DRM system, authors can allow or deny actions such as printing a file, commenting or copying content.

Even though this is a good idea for some situations, most of the times, it’s just annoying: Collecting ideas for seminar papers or a thesis, for instance, is almost impossible without being able to Copy & Paste certain paragraphs from the PDF.

Fortunately, Linux can solve this problem with a simple tool called pdf to text. This command line tool simply strips all text from the PDF file and saves it to a given text-file.

Installation

The tool is part of the package poppler-utils and can be installed via your favorite package manager, e.g. apt-get:

apt-get install poppler-utils

1	apt-get install poppler-utils