Hi everyone,
I’m working on a project where I need to extract text from a PDF file using PyMuPDF. Here’s the basic breakdown of some of my code:
import pymupdf pdf_path = "Trial.pdf" doc = pymupdf.open(pdf_path) text_from_pdf = ''.join(page.get_text() for page in doc) doc.close() print(repr(text_from_pdf))
This works fine to extract the text, but the issue is that PyMuPDF seems to only use n
for all breaks—both for line breaks within paragraphs and for full paragraph breaks with actual white space between them.
Is there any way to leverage pymupdf to split the text in the actual original paragraphs? I know this is possible with other libraries like pymupdf4llm and others, but I prefer to stick with PyMuPDF because it’s more widely used and pretty well maintained as far as I can tell.
Does anyone know if there’s a way to achieve this with PyMuPDF, or perhaps a clever workaround to identify and split paragraphs after the text is extracted?
Thanks in advance!
submitted by /u/AnterosNL
[link] [comments]
r/learnpython Hi everyone, I’m working on a project where I need to extract text from a PDF file using PyMuPDF. Here’s the basic breakdown of some of my code: import pymupdf pdf_path = “Trial.pdf” doc = pymupdf.open(pdf_path) text_from_pdf = ”.join(page.get_text() for page in doc) doc.close() print(repr(text_from_pdf)) This works fine to extract the text, but the issue is that PyMuPDF seems to only use n for all breaks—both for line breaks within paragraphs and for full paragraph breaks with actual white space between them. Is there any way to leverage pymupdf to split the text in the actual original paragraphs? I know this is possible with other libraries like pymupdf4llm and others, but I prefer to stick with PyMuPDF because it’s more widely used and pretty well maintained as far as I can tell. Does anyone know if there’s a way to achieve this with PyMuPDF, or perhaps a clever workaround to identify and split paragraphs after the text is extracted? Thanks in advance! submitted by /u/AnterosNL [link] [comments]
Hi everyone,
I’m working on a project where I need to extract text from a PDF file using PyMuPDF. Here’s the basic breakdown of some of my code:
import pymupdf pdf_path = "Trial.pdf" doc = pymupdf.open(pdf_path) text_from_pdf = ''.join(page.get_text() for page in doc) doc.close() print(repr(text_from_pdf))
This works fine to extract the text, but the issue is that PyMuPDF seems to only use n
for all breaks—both for line breaks within paragraphs and for full paragraph breaks with actual white space between them.
Is there any way to leverage pymupdf to split the text in the actual original paragraphs? I know this is possible with other libraries like pymupdf4llm and others, but I prefer to stick with PyMuPDF because it’s more widely used and pretty well maintained as far as I can tell.
Does anyone know if there’s a way to achieve this with PyMuPDF, or perhaps a clever workaround to identify and split paragraphs after the text is extracted?
Thanks in advance!
submitted by /u/AnterosNL
[link] [comments]