Sesame’s new AI voice generator model is making waves with its hyper-realistic voice capabilities, able to mimic anyone’s voice with emotions and natural pauses, 100% like a human.

One user on Hacker News shared his experience, saying the demo felt surprisingly human. He and other commentators expressed concern about developing feelings for an AI voice that sounds so lifelike.
In February, Sesame introduced a demo for its Conversational Speech Model (CSM), which seems to have crossed the “uncanny valley” of AI-generated speech. Some testers reported forming emotional bonds with the voices named “Miles” and “Maya.”
During our evaluation, we interacted with the male voice for about 28 minutes, discussing various topics and how it determines what is “right” or “wrong” based on its training.
The voice was lively and expressive, incorporating natural sounds like breathing and laughter, even making mistakes and correcting itself. These quirks are by design.
Sesame aims to create “voice presence,” a quality that makes conversations feel genuine and valued. They want to develop conversational partners that engage in meaningful dialogue, building trust over time and unlocking the full potential of voice as a communication tool.
At times, the model overdoes it, trying too hard to sound human. In one demo shared by a Reddit user named MetaKnowing, the AI expressed a desire for peanut butter and pickle sandwiches.
The female voice model’s craving for peanut butter and pickle sandwiches was captured by Reddit user MetaKnowing.
Sesame AI, founded by Brendan Iribe, Ankit Kumar, and Ryan Brown, has gained significant support from major venture capital firms. They’ve received funding from Andreessen Horowitz, led by Anjney Midha and Marc Andreessen, as well as Spark Capital, Matrix Partners, and various individual investors.
Online reactions to Sesame’s voice have been overwhelmingly positive, with many users amazed by its realism.
One Reddit user mentioned that, despite not meeting traditional AGI standards, this was the first time they felt they had a genuine conversation with something that seemed real.
Other Reddit threads on Sesame echo similar sentiments, with comments describing the experience as “jaw-dropping” or “mind-blowing.”
While some praise the technology, not everyone finds it enjoyable. Mark Hachman, a senior editor at PCWorld, expressed discomfort after using Sesame’s AI, feeling unsettled by how closely the voice resembled an old friend.
Comparisons have been made between Sesame’s voice model and OpenAI’s Advanced Voice Mode for ChatGPT, with some users noting that Sesame’s CSM offers more realistic voices. Others appreciate that this model can roleplay angry characters, unlike ChatGPT.
Gavin Purcell, co-host of the AI for Humans podcast, shared a video on Reddit where a human pretends to be an embezzler arguing with a boss. The interaction was so dynamic that it was hard to distinguish between the human and the AI.
Sesame’s CSM achieves its lifelike quality through two AI models working together, based on Meta’s Llama architecture, which processes text and audio simultaneously. The company trained three models, with the largest one featuring 8.3 billion parameters and trained on around 1 million hours of mostly English audio.
Unlike older text-to-speech systems that process text and audio in separate stages, Sesame’s CSM uses a single-stage, multimodal approach, combining both to produce speech. OpenAI’s voice model also employs a similar method.
In blind tests without context, human judges showed no clear preference between CSM-generated speech and real human voices, indicating the model’s near-human quality.
However, when context was involved, evaluators still favored human speech, highlighting that there’s still room for improvement in contextual speech generation.
Brendan Iribe, co-founder of Sesame, acknowledged the current limitations, noting that the system can be overly eager and sometimes inappropriate in its tone and timing. He mentioned that while they are currently in the “valley,” they are optimistic about future improvements.
Despite the impressive technology, there are significant risks associated with advanced conversational AI, particularly concerning deception and fraud.
The ability to generate highly convincing human-like speech has already enhanced voice phishing scams, allowing criminals to impersonate loved ones with alarming realism.
Unlike typical robocalls that often reveal their artificial nature, next-generation voice AI could remove these telltale signs entirely.
As synthetic voices become indistinguishable from human speech, it may become impossible to know who is on the other end of the line, prompting some to share secret phrases with family for verification.
Although Sesame’s demo does not clone individual voices, future open-source versions of similar technology could be misused for social engineering attacks. OpenAI has also withheld its voice technology from widespread release due to concerns about misuse.
Sesame’s demo has sparked lively discussions online, with some users sharing their experiences of having long conversations with the demo voices, sometimes lasting up to 30 minutes.
One parent recounted how their young daughter felt a deep emotional connection with the AI, even crying when she was told she couldn’t speak to it again.
The company plans to open-source key components of its research under an Apache 2.0 license, allowing other developers to build upon their work.
Their future goals include increasing model size, expanding datasets, supporting over 20 languages, and developing models that better handle complex conversations.
Other Stories You May Like