Splitting large pdf file, need a trick to recognize non legible text /u/spirito_santo Python Education

This question is probably more about .pdf then python, but there doesn’t seem to be a pdf subreddit.

I’ve got a 6207 page .pdf document that consists of a large number of concatenated .pdf files, and I want to split them up.

The problem is, some of the pages aren’t legible when I extract the text, maybe due to embedded fonts?

Other pages are simply images of text.

I could OCR the whole thing, but that’s too simple, and it’s also very time consuming, so ideally I’d like to be able to test if a page is legible or not, so my program can choose whether to just extract the text, or to OCR the page.

Any ideas, reddit?

submitted by /u/spirito_santo
[link] [comments]

​r/learnpython This question is probably more about .pdf then python, but there doesn’t seem to be a pdf subreddit. I’ve got a 6207 page .pdf document that consists of a large number of concatenated .pdf files, and I want to split them up. The problem is, some of the pages aren’t legible when I extract the text, maybe due to embedded fonts? Other pages are simply images of text. I could OCR the whole thing, but that’s too simple, and it’s also very time consuming, so ideally I’d like to be able to test if a page is legible or not, so my program can choose whether to just extract the text, or to OCR the page. Any ideas, reddit? submitted by /u/spirito_santo [link] [comments] 

This question is probably more about .pdf then python, but there doesn’t seem to be a pdf subreddit.

I’ve got a 6207 page .pdf document that consists of a large number of concatenated .pdf files, and I want to split them up.

The problem is, some of the pages aren’t legible when I extract the text, maybe due to embedded fonts?

Other pages are simply images of text.

I could OCR the whole thing, but that’s too simple, and it’s also very time consuming, so ideally I’d like to be able to test if a page is legible or not, so my program can choose whether to just extract the text, or to OCR the page.

Any ideas, reddit?

submitted by /u/spirito_santo
[link] [comments] 

Leave a Reply

Your email address will not be published. Required fields are marked *