Fully Open Source Moxin-7B Technical Report

Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Xingchen Xu, Yu Huang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang

2024-12-11

Fully Open Source Moxin-7B Technical Report

Summary

This paper talks about Moxin-7B, a fully open-source large language model (LLM) that aims to improve transparency and accessibility in AI development.

What's the problem?

While many powerful language models exist, such as GPT-4, they are often proprietary, meaning their inner workings and training data are not available to the public. This lack of transparency can lead to concerns about how these models work, their safety, and their ability to be reproduced for research purposes. Additionally, some open-source models do not fully comply with the principles of openness, which can hinder further innovation.

What's the solution?

The authors introduce Moxin-7B, which is developed according to the Model Openness Framework (MOF). This framework ensures that the model is fully transparent by providing access to everything from the training code and datasets to the final model checkpoints. Moxin-7B has been trained on a smaller dataset compared to other models but still performs well in various tasks. It achieves high scores in evaluations without needing excessive data, making it efficient and effective.

Why it matters?

This research is important because it sets a new standard for open-source AI models. By ensuring full transparency and providing all necessary resources for researchers and developers, Moxin-7B encourages collaboration and innovation in the AI community. This model can be used for various applications while promoting responsible AI practices, making advanced technology more accessible to everyone.

Abstract

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, and some use restrictive licenses whilst claiming to be "open-source," which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed in accordance with the Model Openness Framework (MOF), a ranked classification system that evaluates AI models based on model completeness and openness, adhering to principles of open science, open source, open data, and open access. Our model achieves the highest MOF classification level of "open science" through the comprehensive release of pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints. Experiments show that our model achieves superior performance in zero-shot evaluation compared with popular 7B models and performs competitively in few-shot evaluation.

View Paper