Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H. Pham, Quan T. M. Nguyen, Bang Q. Vo, Suong N. Hoang

2024-08-23

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Summary

This paper introduces Vintern-1B, a new multimodal language model specifically designed for Vietnamese tasks, which can handle both text and images effectively.

What's the problem?

Many existing language models do not perform well with the Vietnamese language or struggle to process different types of data, such as images and text together. This limits their usefulness in applications like document analysis and question-answering.

What's the solution?

The authors developed Vintern-1B, a model with 1 billion parameters that combines language processing with visual understanding. It was trained on over 3 million pairs of images and questions, allowing it to perform well on various tasks such as optical character recognition (OCR) and answering questions in Vietnamese. The model is designed to be efficient enough to run on personal devices.

Why it matters?

This research is significant because it improves the tools available for processing the Vietnamese language and handling multimodal data. By making advanced AI technology accessible for Vietnamese speakers, it can enhance applications in education, business, and everyday communication.

Abstract

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.

View Paper