ChatGPT gains voice conversations and image recognition in major update

OpenAI rolls out multimodal capabilities to its chatbot, letting subscribers speak queries and upload photos for AI analysis, moving closer to a conversational assistant like the one in Her.

OpenAI today announced a major upgrade to ChatGPT that lets users speak queries aloud and receive spoken responses, and upload or snap photos for the chatbot to analyze. The update, rolling out to paid subscribers today, adds five synthetic voices named Juniper, Ember, Sky, Cove, and Breeze, alongside image recognition that can identify objects, books, plants, and more. The move marks a step toward making ChatGPT a more natural, conversational assistant akin to the AI in the film Her.

The voice and vision features work by converting input to text using speech or image recognition before generating a response. In early tests, ChatGPT correctly identified a Japanese maple tree and a compostable fork, but refused to identify people in photos. Voice responses had noticeable latency in prerelease versions. The update only supports English for now and is limited to the $20-per-month subscription version of ChatGPT.

Some researchers see the update as a milestone. UC Berkeley’s Trevor Darrell noted that multimodal models should outperform text-only ones. But privacy concerns are already bubbling: OpenAI says it does not currently collect voice data for training, and users can opt out of image data use, but turning off chat history disables voice. The chatbot also retains existing guardrails, refusing to answer questions about building bombs or age-inappropriate dates.

The upgrade positions ChatGPT to compete directly with Apple’s Siri or Amazon’s Alexa.

The record

The room reactsas it happened

Trevor Darrell

A UC Berkeley professor and Prompt AI cofounder said multimodal models are expected to outperform single-modality ones, noting that a language-only model will only learn language.

Jim Glass

An MIT speech technology professor said speech is the easiest way to generate language and that many academic groups are testing voice interfaces with large language models.

Sandhini Agarwal

An OpenAI AI policy researcher said users can opt out of training with their data, but WIRED found that turning off chat history disabled voice capabilities.

One year later — open only if you can handle spoilers

Multimodal ChatGPT sparked a wave of copycat features across rivals, but voice latency and privacy controls remained pain points. Later updates improved speed and expanded language support, while the ban on person identification stayed consistent across subsequent versions.

Replay thisPost on X Reddit HN LinkedIn