GPT4 – omni is finally here

GPT4 - omni is finally here!

GPT-4o (“o” for “omni”) is finally here! It marks a significant advancement in natural human-computer interaction. It accepts any combination of text, audio, and image inputs and generates any combination of text, audio, and image outputs. This introduction will be followed by an exploration of GPT-4o’s performance and speed, enhanced capabilities, unified model training, safety and preparedness measures, external evaluations, and future plans for releasing its various modalities. We will delve into how GPT-4o excels in vision and audio understanding, its integrated approach to processing multimodal data, and the safety protocols in place to ensure secure interactions. Additionally, we will discuss the model’s evaluations according to our Preparedness Framework, the results of extensive external red teaming, and the upcoming technical infrastructure and safety enhancements for its full range of functionalities.

Performance and Speed

GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, closely matching human response times in conversation. It matches GPT-4 Turbo in performance on English text and code, improves significantly on non-English text, and is faster and 50% cheaper in the API.

Enhanced Capabilities

GPT-4o excels in vision and audio understanding compared to existing models. Previously, Voice Mode in ChatGPT, which operated with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4), relied on a pipeline of three models. This setup caused GPT-4 to lose information, such as tone, multiple speakers, and background noises.

Unified Model Training

With GPT-4o, a single model is trained end-to-end across text, vision, and audio, processing all inputs and outputs through the same neural network. This integrated approach allows the model to better handle multimodal data.

Safety and Preparedness

Safety is integral to GPT-4o, achieved through filtered training data and post-training behavior refinement. New safety systems provide guardrails on voice outputs. Evaluations according to our Preparedness Framework and voluntary commitments show that GPT-4o does not exceed a Medium risk level in cybersecurity, CBRN, persuasion, and model autonomy.

External Evaluation and Continuous Improvement

GPT-4o has undergone extensive external red teaming with over 70 experts in fields like social psychology, bias and fairness, and misinformation. These evaluations help identify and mitigate risks, contributing to the model’s safety.

Future Plans and Modalities Release

Currently, we are releasing text and image inputs and text outputs publicly. Over the coming months, we will enhance the technical infrastructure, usability, and safety for the other modalities. Initially, audio outputs will be limited to preset voices and adhere to existing safety policies. More details will be shared in the forthcoming system card.

GPT4 – omni is finally here