My name is Philipp C. Heckel and I write about nerdy things.
This site moved here recently from blog.philippheckel.com!

Extract text from PDF files


Linux, Office

Extract text from PDF files


Adobe’s Portable Document Format (PDF) has reached great popularity over the last years and is the number one format for easy document exchange. It comes with great features such as embeddable images and multimedia, but also has rather unpleasant properties. The so called Security Features represent a simple Digital Rights Management (DRM) system and allow PDF authors to restrict the file usage. Using the DRM system, authors can allow or deny actions such as printing a file, commenting or copying content.

Even though this is a good idea for some situations, most of the times, it’s just annoying: Collecting ideas for seminar papers or a thesis, for instance, is almost impossible without being able to Copy & Paste certain paragraphs from the PDF.

Fortunately, Linux can solve this problem with a simple tool called pdf to text. This command line tool simply strips all text from the PDF file and saves it to a given text-file.

Installation

The tool is part of the package poppler-utils and can be installed via your favorite package manager, e.g. apt-get:

Extract text from PDF files

This is also pretty simple and the man-page gives the instructions: pdftotext [options] <PDF> [<text-file>].

In case you’d like to perform this for every PDF-file in a folder (recursive search), simple do that:

After executing the command, there will be a *.txt-file for each PDF file in the folder, – containing the plain-text of the corresponding PDF file.

Leave a comment

I'd very much like to hear what you think of this post. Feel free to leave a comment. I usually respond within a day or two, sometimes even faster. I will not share or publish your e-mail address anywhere.