PersonaPlex uses two inputs to define conversational behavior: a voice prompt that captures vocal characteristics, speaking style, and prosody, and a text prompt that describes the role, background information, and conversation context. These inputs are processed jointly to create a coherent persona. The model is built on the Moshi architecture and has 7 billion parameters, with a dual-stream configuration that allows listening and speaking to occur concurrently, enabling natural conversational dynamics.
PersonaPlex has been trained on a blend of real and synthetic conversations, including 7,303 real conversations from the Fisher English corpus and 39,322 synthetic assistant role conversations. The model demonstrates strong generalization to text prompts well outside its training distribution and maintains a persona coherent with the text prompt throughout extended interactions. PersonaPlex outperforms other conversational AI agents on conversational dynamics, response and interruption latency, and task adherence in both question-answering assistant and customer service roles.


