This question is probably more about .pdf then python, but there doesn’t seem to be a pdf subreddit.
I’ve got a 6207 page .pdf document that consists of a large number of concatenated .pdf files, and I want to split them up.
The problem is, some of the pages aren’t legible when I extract the text, maybe due to embedded fonts?
Other pages are simply images of text.
I could OCR the whole thing, but that’s too simple, and it’s also very time consuming, so ideally I’d like to be able to test if a page is legible or not, so my program can choose whether to just extract the text, or to OCR the page.
Any ideas, reddit?
submitted by /u/spirito_santo
[link] [comments]
r/learnpython This question is probably more about .pdf then python, but there doesn’t seem to be a pdf subreddit. I’ve got a 6207 page .pdf document that consists of a large number of concatenated .pdf files, and I want to split them up. The problem is, some of the pages aren’t legible when I extract the text, maybe due to embedded fonts? Other pages are simply images of text. I could OCR the whole thing, but that’s too simple, and it’s also very time consuming, so ideally I’d like to be able to test if a page is legible or not, so my program can choose whether to just extract the text, or to OCR the page. Any ideas, reddit? submitted by /u/spirito_santo [link] [comments]
This question is probably more about .pdf then python, but there doesn’t seem to be a pdf subreddit.
I’ve got a 6207 page .pdf document that consists of a large number of concatenated .pdf files, and I want to split them up.
The problem is, some of the pages aren’t legible when I extract the text, maybe due to embedded fonts?
Other pages are simply images of text.
I could OCR the whole thing, but that’s too simple, and it’s also very time consuming, so ideally I’d like to be able to test if a page is legible or not, so my program can choose whether to just extract the text, or to OCR the page.
Any ideas, reddit?
submitted by /u/spirito_santo
[link] [comments]