GPT4 - omni is finally here!
GPT-4o (“o” for “omni”) is finally here! It marks a significant advancement in natural human-computer interaction. It accepts any combination of text, audio, and image inputs and generates any combination of text, audio, and image outputs. This introduction will be followed by an exploration of GPT-4o’s performance and speed, enhanced capabilities, unified model training, safety and preparedness measures, external evaluations, and future plans for releasing its various modalities. We will delve into how GPT-4o excels in vision and audio understanding, its integrated approach to processing multimodal data, and the safety protocols in place to ensure secure interactions. Additionally, we will discuss the model’s evaluations according to our Preparedness Framework, the results of extensive external red teaming, and the upcoming technical infrastructure and safety enhancements for its full range of functionalities.
Performance and Speed
GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, closely matching human response times in conversation. It matches GPT-4 Turbo in performance on English text and code, improves significantly on non-English text, and is faster and 50% cheaper in the API.
Enhanced Capabilities
GPT-4o excels in vision and audio understanding compared to existing models. Previously, Voice Mode in ChatGPT, which operated with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4), relied on a pipeline of three models. This setup caused GPT-4 to lose information, such as tone, multiple speakers, and background noises.
Unified Model Training
With GPT-4o, a single model is trained end-to-end across text, vision, and audio, processing all inputs and outputs through the same neural network. This integrated approach allows the model to better handle multimodal data.
Safety and Preparedness
Safety is integral to GPT-4o, achieved through filtered training data and post-training behavior refinement. New safety systems provide guardrails on voice outputs. Evaluations according to our Preparedness Framework and voluntary commitments show that GPT-4o does not exceed a Medium risk level in cybersecurity, CBRN, persuasion, and model autonomy.
External Evaluation and Continuous Improvement
GPT-4o has undergone extensive external red teaming with over 70 experts in fields like social psychology, bias and fairness, and misinformation. These evaluations help identify and mitigate risks, contributing to the model’s safety.
Future Plans and Modalities Release
Currently, we are releasing text and image inputs and text outputs publicly. Over the coming months, we will enhance the technical infrastructure, usability, and safety for the other modalities. Initially, audio outputs will be limited to preset voices and adhere to existing safety policies. More details will be shared in the forthcoming system card.