Abstract:
This study explores automated extraction and analysis of tabular data from research papers to streamline researchers’
workflows. It integrates Generative AI and Optical Character Recognition within an end-to-end pipeline applied to over
1,200 open-access Artificial Intelligence and Machine Learning papers. PDF files were first converted into high-resolution
images, after which tables were detected using a fine-tuned YOLOv8 model. Text from the detected tables was extracted
using Tesseract OCR, and performance-related data was filtered and analyzed using Retrieval-Augmented Generation
methods. The analysis identified top-performing models, such as BERT and CODEX, and widely used datasets including
SQuAD and GSM8K, enabling automated meta-analysis. The results demonstrate the scalability and effectiveness of
combining computer vision and NLP for high-quality data extraction.Models such as Llama3-8B and deepseek-r1:8b
0528-quwen3-q8_0 provided domain-specific insights. The study also suggests improved table detection without relying
on keyword searches by leveraging advanced AI, NLP, and ensemble learning techniques.