Phi-2 is a Transformer-based model trained on 1.4 trillion tokens from multiple passes on synthetic and web datasets for natural language processing and coding. Training Phi-2 took 14 days using 96 A100 GPUs. Despite not undergoing alignment through reinforcement learning from human feedback or fine-tuning, Phi-2 exhibited better behavior in terms of toxicity and bias compared to other models that underwent alignment. This is consistent with the team's previous work on Phi-1.5, which focused on tailored data curation techniques.
Phi-2's performance on academic benchmarks surpasses that of larger models such as Mistral and Llama-2. Despite its smaller size, Phi-2 achieves comparable or better results in multi-step reasoning tasks, coding, math, commonsense reasoning, language understanding, and more. It even outperforms the recently-announced Google Gemini Nano 2 model. The team acknowledges the challenges of model evaluation and the potential for public benchmarks to leak into the training data. They emphasize the importance of testing language models on concrete use cases.