Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi

2025-02-26

Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Summary

This paper talks about Shakti VLM, a new family of AI models that can understand both images and text, designed to be more efficient and practical for businesses to use

What's the problem?

Current AI models that work with both images and text (called vision-language models or VLMs) need huge amounts of data to work well, which makes them expensive and difficult for many companies to use

What's the solution?

The researchers created Shakti VLM models that use clever design tricks to work just as well with less data. They made changes to how the AI pays attention to information, how it processes data, and how it understands the position of things in images. They also used a special three-step training process to make the models learn more efficiently

Why it matters?

This matters because it makes powerful AI that can understand both images and text more accessible to businesses. Companies can now use these advanced AI tools without needing as much data or computing power, which could help them automate tasks, understand documents better, and solve complex problems that involve both visual and text information more easily and cheaply

Abstract

We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

View Paper