GPT-4o Redefines AI Conversations, Processing Text, Audio, and Visuals in Record Time

Today, the artificial intelligence tech giant OpenAI launched its latest high-end large language model, GPT-4o (GPT-4 Omni). It’s a combination of three models designed to provide more precise and prompt output.

Previously, when using Voice Mode in ChatGPT, it required converting voice prompts into text prompts with a simple model, like GPT-3.5 or GPT-4 these models then used the text to generate text-based responses, with another simple model converting the text response into voice output.

This process consumed a lot of energy and involved three separate models, leading to potential information loss. As a result, GPT-4/3.5 couldn’t recognize tone, multiple speakers, or background noises, limiting its ability to output laughter, singing, or expressing emotion.

Demonstration

To address this, OpenAI developed GPT-4o with complete training in text, vision, and audio, resulting in a single model capable of handling input and output without depending on other models. The company is further working on improving its capabilities.

GPT-4o can reason about audio, visual, and text in real-time, and can process over 50 different languages with greatly improved speed and quality. Compared to response times of 2.8 seconds for GPT-3.5 Voice and 5.4 seconds for GPT-4, GPT-4o can respond in just 232 milliseconds (barely a fifth of a second), with responses closely resembling human interaction.

Another notable feature is its ability to accept combinations of text, audio, and images as input, and generate any combination of text, audio, and images as output, enhancing human-computer interaction.

Previously, when OpenAI released a new version of its ChatGPT model, it typically placed it behind a paywall. However, this time, GPT-4o will be available to all users for free (with limited numbers), while paid users will enjoy five times the amount of calls.

Source