VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Jeongho Ju, Daeyoung Kim, SunYoung Park, Youngjune Kim

2024-12-05

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Summary

This paper introduces VARCO-VISION, an open-source vision-language model that understands and generates content in both Korean and English, enhancing bilingual capabilities in AI.

What's the problem?

Creating effective AI models that can understand and generate text related to images in multiple languages is challenging. Many existing models struggle with bilingual tasks, especially when it comes to Korean, which can lead to poor performance in real-world applications. Additionally, training these models often requires extensive resources and may not effectively leverage the knowledge of pre-existing models.

What's the solution?

VARCO-VISION addresses these challenges through a step-by-step training strategy that allows it to learn both language and visual information without losing the knowledge from its base model. It incorporates various capabilities such as grounding (understanding the context of images), referring (pointing out specific elements), and optical character recognition (OCR). The model is trained using five new Korean evaluation datasets to ensure it performs well in real-world scenarios.

Why it matters?

This research is significant because it expands the capabilities of AI in understanding and generating bilingual content, particularly for Korean and English. By providing a powerful tool for researchers and developers, VARCO-VISION opens up new opportunities for applications in education, translation, and content creation, ultimately helping to bridge language barriers in technology.

Abstract

In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at https://huggingface.co/NCSOFT/VARCO-VISION-14B.

View Paper