Uses qpdf, imagemagick, ocrmypdf, and pdftotext.
0 Rename pdf to orig.pdf
1 Extract the paragraph from the pdf
- qpdf orig.pdf --pages . 100-110 -- just-the-chapter.pdf
2 Create vr.pdf (VeryReadable)
- convert -density 288 just-the-chapter.pdf output-%02d.jpg
- convert output*.jpg -level 25% final-%02d.jpg
- convert final*.jpg vr.pdf
3 Create text layer on vr.pdf as a new pdf
- ocrmypdf -l spa vr.pdf vr-layer-spa.pdf
4 Convert that new pdf to .txt
- pdftotext -layout vr-layer-spa.pdf chapter-spa.txt
5 Delete everthing except the new txt file and the new layered pdf * rm output* && rm final* && rm vr.pdf && rm just-the-chapters.pdf
All at once:
qpdf orig.pdf --pages . 318-419 -- just-the-chapters.pdf && convert -density 288 just-the-chapters.pdf output-%02d.jpg && convert output*.jpg -level 25% final-%02d.jpg && convert final*.jpg vr.pdf && ocrmypdf -l spa vr.pdf bookonech9-10.pdf && pdftotext -layout bookonech9-10.pdf bookonech9-10.txt && rm output* && rm final* && rm vr.pdf && rm just-the-chapters.pdf
Comments: 0