Hey Everyone,
I’m new to Gen AI and working on my second project which is a healthcare app to provide financial advice to patients. I need to train the model using data from different insurance policies defining the prices for different procedures. The data is in tabular format inside PDFs. All Pdfs have different table structure and columns – most pdfs have a single table continuing into next pages. I have tried using unstructured, camelot, llamaparse, pymupdf4llm, img2table to preprocess the files, some worked but lacked semantics when converted to markdown upon querying.
I had the best results for converting pdf into markdown from using pymupdf4llm and llamaparse but need guidance on how to proceed further since with markdown format its difficult to retrieve data with no headers in cases of dynamic tables [which continue into next pages]. I will be very grateful if someone helps me out with this and points me in the right direction. How to proceed with chunking? Or is there any better way to preprocess the data?
submitted by /u/idcmuch1805
[link] [comments]
r/learnpython Hey Everyone, I’m new to Gen AI and working on my second project which is a healthcare app to provide financial advice to patients. I need to train the model using data from different insurance policies defining the prices for different procedures. The data is in tabular format inside PDFs. All Pdfs have different table structure and columns – most pdfs have a single table continuing into next pages. I have tried using unstructured, camelot, llamaparse, pymupdf4llm, img2table to preprocess the files, some worked but lacked semantics when converted to markdown upon querying. I had the best results for converting pdf into markdown from using pymupdf4llm and llamaparse but need guidance on how to proceed further since with markdown format its difficult to retrieve data with no headers in cases of dynamic tables [which continue into next pages]. I will be very grateful if someone helps me out with this and points me in the right direction. How to proceed with chunking? Or is there any better way to preprocess the data? submitted by /u/idcmuch1805 [link] [comments]
Hey Everyone,
I’m new to Gen AI and working on my second project which is a healthcare app to provide financial advice to patients. I need to train the model using data from different insurance policies defining the prices for different procedures. The data is in tabular format inside PDFs. All Pdfs have different table structure and columns – most pdfs have a single table continuing into next pages. I have tried using unstructured, camelot, llamaparse, pymupdf4llm, img2table to preprocess the files, some worked but lacked semantics when converted to markdown upon querying.
I had the best results for converting pdf into markdown from using pymupdf4llm and llamaparse but need guidance on how to proceed further since with markdown format its difficult to retrieve data with no headers in cases of dynamic tables [which continue into next pages]. I will be very grateful if someone helps me out with this and points me in the right direction. How to proceed with chunking? Or is there any better way to preprocess the data?
submitted by /u/idcmuch1805
[link] [comments]