DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation
Published in ICML: Efficient Systems for Foundation Models (ES-FoMo II), 2024
Recommended citation: Mohammadshirazi, A., Nosratifiroozsalari, A., Zhou, Z., Kulshrestha, D., & Ramnath, R. (2024). DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation. arXiv preprint arXiv:2406.17591. https://arxiv.org/pdf/2406.17591
Summary
DocParseNet is a multi-modal deep learning model designed for efficient annotation of scanned documents by integrating visual features from semantic segmentation with textual data extracted using Optical Character Recognition (OCR). Its architecture incorporates a modified UNet for image processing, DistilBERT for textual embedding, and a fusion module that combines these modalities using multi-head attention, enabling precise and context-aware document parsing. With only 2.8 million parameters, DocParseNet achieves a remarkable balance of accuracy, computational efficiency, and scalability, making it particularly suited for applications in corporate document analysis and beyond.
Authors
Ahmad Mohammadshirazi, Ali Nosratifiroozsalari, Mengxi Zhou, Dheeraj Kulshrestha, Rajiv Ramnath