Intelligence is simply talking to many people

Convert pdf to text (several-step process)

Uses qpdf, imagemagick, ocrmypdf, and pdftotext.


0 Rename pdf to orig.pdf

1 Extract the paragraph from the pdf

  • qpdf orig.pdf --pages . 100-110 -- just-the-chapter.pdf

2 Create vr.pdf (VeryReadable)

  • convert -density 288 just-the-chapter.pdf output-%02d.jpg
  • convert output*.jpg -level 25% final-%02d.jpg
  • convert final*.jpg vr.pdf

3 Create text layer on vr.pdf as a new pdf

  • ocrmypdf -l spa vr.pdf vr-layer-spa.pdf

4 Convert that new pdf to .txt

  • pdftotext -layout vr-layer-spa.pdf chapter-spa.txt

5 Delete everthing except the new txt file and the new layered pdf * rm output* && rm final* && rm vr.pdf && rm just-the-chapters.pdf

All at once:

qpdf orig.pdf --pages . 318-419 -- just-the-chapters.pdf && convert -density 288 just-the-chapters.pdf output-%02d.jpg && convert output*.jpg -level 25% final-%02d.jpg && convert final*.jpg vr.pdf && ocrmypdf -l spa vr.pdf bookonech9-10.pdf && pdftotext -layout bookonech9-10.pdf bookonech9-10.txt && rm output* && rm final* && rm vr.pdf && rm just-the-chapters.pdf

Comments: 0

Interested to discuss? Leave a comment.

Image




Your email will not be published nor shared with anyone. In your text you can use markdown for marking up *italic*, links <http://example.org> and other elements. These comments are moderated and published manually as soon as possible.