Building the Next Generation of Conversational AI

14 Mar 2025 • 101 min • EN
101 min
00:00
01:41:37
No file found

In this episode of AI + a16z, Sesame Cofounder and CTO Ankit Kumar joins a16z general partner Anjney Midha for a deep dive into the research and engineering behind their voice technology. They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions.  They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction. Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive. Key Takeaways: How Sesame AI achieves natural voice interactions through real-time speech generation.The impact of open-sourcing their speech model and what it means for AI research.The role of full-duplex modeling in improving AI responsiveness.How computational efficiency and system latency shape AI conversation quality.The growing role of natural language as a user interface in AI-driven experiences. For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction. Learn more: The Maya + Miles demo Crossing the uncanny valley of conversational voice Sesame CSM 1B model Follow everybody on X: Ankit Kumar Anjney Midha Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

From "AI + a16z"

Listen on your iPhone

Download our iOS app and listen to interviews anywhere. Enjoy all of the listener functions in one slick package. Why not give it a try?

App Store Logo
application screenshot

Popular categories