DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Published in CVPR (submitted), 2024

Recommended citation: Mohammadshirazi, A., Neogi, PPG., Lim, S., & Ramnath, R. (2024). DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness. arXiv preprint arXiv:2412.00151. https://arxiv.org/pdf/2412.00151

Summary

DLaVA, or Document Language and Vision Assistant, is a cutting-edge system designed to enhance Document Visual Question Answering (VQA) by integrating answer localization directly into multimodal language models. Unlike traditional methods, DLaVA uses both OCR-dependent and OCR-free approaches, ensuring high interpretability and reliability. Its innovative design combines textual and spatial data to deliver precise answers with annotated bounding boxes, fostering trust and reducing AI hallucinations. By achieving state-of-the-art performance on benchmark datasets, DLaVA sets a new standard for document understanding, offering a streamlined, transparent, and efficient solution that bridges the gap between visual content and linguistic queries.

Authors

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath

Download paper here