DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness
Published in CVPR (submitted), 2024
Recommended citation: Mohammadshirazi, A., Neogi, PPG., Lim, S., & Ramnath, R. (2024). DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness. arXiv preprint arXiv:2412.00151. https://arxiv.org/pdf/2412.00151
Summary
DLaVA, or Document Language and Vision Assistant, is a cutting-edge system designed to enhance Document Visual Question Answering (VQA) by integrating answer localization directly into multimodal language models. Unlike traditional methods, DLaVA uses both OCR-dependent and OCR-free approaches, ensuring high interpretability and reliability. Its innovative design combines textual and spatial data to deliver precise answers with annotated bounding boxes, fostering trust and reducing AI hallucinations. By achieving state-of-the-art performance on benchmark datasets, DLaVA sets a new standard for document understanding, offering a streamlined, transparent, and efficient solution that bridges the gap between visual content and linguistic queries.
Authors
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath