Interaction Models: A Scalable Approach to Human-AI Collaboration
Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,633 words · 5 segments analyzed
Today, we’re announcing a research preview of interaction models: models that handle interaction natively rather than through external scaffolding. We think interactivity should scale alongside intelligence; the way we work with AI should not be treated as an afterthought. Interaction models let people collaborate with AI the way we naturally collaborate with each other—they continuously take in audio, video, and text, and think, respond, and act in real time. We train an interaction model from scratch. To ensure real-time responsiveness, we adopt a multi-stream, micro-turn design. Our research preview demonstrates qualitatively new interaction capabilities, as well as state-of-the-art combined performance in intelligence and responsiveness. The collaboration bottleneck AI labs often treat the ability for AI to work autonomously as the model’s most important capability.Kwa, T., West, B., Becker, J., et al. Measuring AI Ability to Complete Long Tasks. METR, 2025. As a result, today’s models and interfaces aren’t optimized for humans to remain in the loop.A recent frontier model card states: “Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived [our model] as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities.” Autonomous interfaces are valuable, but in most real work, users can’t fully specify their requirements upfront and walk away—good results benefit from a collaborative process where the human stays in the loop, clarifying and giving feedback along the way. However, humans increasingly get pushed out not because the work doesn’t need them, but because the interface has no room for them. Instead, people are most effective when they can collaborate with AI the same way we do with other people: messaging, talking, listening, seeing, showing, and interjecting as needed—and for the model to do the same.Communication gets better with: (a) Copresence: people can interact with what others are interacting with; (b) Contemporality: people receive information as it’s produced by others with instant feedback; (c) Simultaneity: people receive and produce information at the same time.
Clark H. and Brennan S., “Grounding in Communication,” in Perspectives on Socially Shared Cognition, 1991., The evanescence of orality for its participatory (cf. objectively distanced) nature. Today’s computers and mediums of knowledge work have similar interactive properties. Ong, W. J.. In Orality and Literacy: The technologizing of the word, 1982. In order to resolve this, we need to move beyond the current turn-based interface for the models. Today’s models experience reality in a single thread.We are referring to commercial general-purpose frontier models—there are smaller-scale or specialized models like Moshi, PersonaPlex, Nemotron VoiceChat, or GPT-Realtime-Translate. Until the user finishes typing or speaking, the model waits with no perception of what the user is doing or how the user is doing it. Until the model finishes generating, its perception freezes, receiving no new information until it finishes or is interrupted. This creates a narrow channel for human-AI collaboration that limits how much of a person’s knowledge,“Metis, with the premium it places on practical knowledge, experience, and stochastic reasoning…is the mode of reasoning most appropriate to complex material and social tasks where the uncertainties are so daunting that we must trust our (experienced) intuition and feel our way.” Scott, J. C: Métis. In Seeing like a State: How certain schemes to improve the human condition have failed, 1998., “A little reflection will show that there is…a body of very important but unorganized knowledge…: the knowledge of the particular circumstances of time and place.” Hayek, F. A. “The use of knowledge in society.” The American Economic Review, 1945. intent, and judgement can reach the model, and how much of the model’s work can be understood. Picture trying to resolve a crucial disagreement over email rather than in person. At Thinking Machines, we believe we can solve this bandwidth bottleneck by making AI interactive in real time across any modality. This enables AI interfaces to meet humans where they are, rather than forcing humans to contort themselves to AI interfaces. Most existing AI models bolt on interactivity with a harness: stitching components together to emulate interruptions, multimodality, or concurrency.
Most real-time commercials speech systems use voice-activity-detection components to detect turn boundaries. However, “the bitter lesson”Sutton R. The Bitter Lesson, 2019. suggests that these hand-crafted systems will be outpaced by the advance of general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator. Capabilities Having interactivity be part of the model unlocks a variety of capabilities that would otherwise need to be implemented in the harness.
Seamless dialog management. The model tracks implicitly whether the speaker is thinking, yielding, self-correcting, or inviting a response. There is no separate dialog management component.
Verbal and visual interjections. The model jumps in as needed depending on the context, not only when the user finishes speaking.
Simultaneous speech. The user and the model can speak concurrently (e.g. live translation)
Time-awareness. The model has a direct sense of elapsed time.
Simultaneous tools calls, search, and generative UI. While speaking and listening to the user, the model can concurrently search, browse the web, or generate UI—weaving back results into the conversation as needed.
In a longer real session, all of this happens continuously, creating an experience that feels more like collaborating and less like prompting.
Our approach
Time-aligned micro-turn basedInteraction is grounded in time with continuous input and output streams split into micro-turns Turn-based models see an alternating token sequence. Time-aware interaction models see a continuous stream of micro-turns, so silence, overlap, and interruption remain part of the model's context.
An interaction model is in constant two-way exchange with the user—perceiving and responding at the same time. Some domains take such interactivity as a given—the physical world demands that robotics and autonomous vehicles operate in real time. Audio full-duplex modelsMoshi, PersonaPlex, nemotron-voicechat, Seeduplex. are another example where interaction is bidirectional and continuous. Applying the same principle, we set out to build an interaction model native to this regime—one that perceives and responds in the same continuous loop, across audio, video, and text.
The result is a system architected around two ideas: a time-aware interaction model that maintains real-time presence, and an asynchronous background model that handles sustained reasoning, tool use, and longer-horizon work. System overview The interaction model is in constant exchange with the user. When a task requires deeper reasoning than can be produced instantaneously, the interaction model delegates to a background model that runs asynchronously.This approach builds upon prior work like Qwen-omni, KAME, MoshiRAG. The interaction model remains present throughout — answering follow-ups, taking new input, holding the thread — and integrates background results into the conversation as they arrive.
real-timeuserinteractionmodelbackgroundmodelcontextresponsetool callsbrowsingetc The user continuously interacts with the interaction model, while the background model performs asynchronous tasks. Both systems share their context.
This split lets the user benefit from both responsiveness as well as the full extent of intelligence: the planning, tool-use, and agentic workflows of reasoning models at the response latency of non-thinking ones. Note that both the background and interaction models are intelligent — on its own, the interaction model is also competitive on both interactive and intelligence benchmarks The interaction model Our starting point is continuous audio and video — modalities that are inherently real-time. Text can wait, but a live conversation cannot. By designing around the hardest case first, we arrive at an architecture that is natively multimodal, time-aware, and capable of handling concurrent input and output streams across all modalities. Several design choices make this possible. Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.
Human perceptioninput 0 input 1 input 2 input 3 input 4 output 0 output 1 output 2 output 3Model token sequence Human perception preserves concurrent input and output streams, while the model receives a single interleaved token sequence.
With this design, there are no artificial turn boundaries that the model must adhere to. In contrast, most existing real-time systems require a harness that predicts turn boundaries in order for the turn-based models to feel real-time and responsive.Moshi, PersonaPlex, and Nemotron Voicechat are examples of full duplex systems that do not use harnesses to detect turns. They are smaller scale models focused on latency rather than intelligence benchmarks. This harness is made out of components like voice-activity-detection (VAD) that are meaningfully less intelligent than the model itself. This precludes a variety of interaction modes like proactive interjections (“interrupt when I say something wrong”) or reactions to visual cues (“tell me when I’ve written a bug in my code”). Moreover, the model can do things like speak while listening (“translate from spanish to english live”) or watching (“live-commentate this sports game”). Thus, all of these different interaction modes that require special harnesses today become special-cases of what the model can do and improve in quality as we scale up model size and training data. Encoder-free early fusion. Rather than processing audio and video through large, standalone encoders, we opt for a system with minimal pre-processing. Many omnimodal models require training a separate encoder (e.g. Whisper-like) or decoder (e.g. TTS model-like). We instead take in audio signals as dMel (Bai, et al. 2024) and transform it via a light-weighted embedding layer. Images are split into 40x40 patches which are encoded by an hMLP (Touvron et al. 2022). For the audio decoder we use a flow head (Lipman at al. 2022). All components are co-trained from scratch together with the transformer.