GPT-4o System Card
OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos
2024-10-29

Summary
This paper presents GPT-4o, a powerful AI model that can process and generate text, audio, and images, making it capable of understanding and responding to various types of inputs quickly and efficiently.
What's the problem?
While AI models have made great strides in understanding language and generating content, they often struggle with processing different types of information simultaneously (like text, audio, and images) and can be slow or expensive to use. Additionally, ensuring that these models are safe and reliable is a significant concern.
What's the solution?
GPT-4o addresses these issues by being an 'omni model' that accepts multiple input types and generates outputs in various formats. It processes everything through a single neural network, allowing for faster response times—similar to human conversation speed—and improved performance in understanding audio and visual data. The model has been evaluated for its capabilities and limitations, with a focus on ensuring safety and ethical use. The paper also includes a system card that outlines the model's strengths, weaknesses, and safety measures.
Why it matters?
This research is important because it demonstrates advancements in AI technology that allow for more natural interactions with machines. By improving how AI models handle different types of information and ensuring they operate safely, GPT-4o could transform how we use AI in everyday life, from customer service to education and beyond.
Abstract
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.