An innocent-looking YouTube video or podcast could become a security problem if it contains commands that human ears cannot detect. The method described by researchers as AudioHijack targets large audio-language models and achieved success rates ranging from 79% to 96% in some tests.
We are becoming increasingly comfortable giving AI assistants and agents access to larger parts of our digital lives. These systems no longer merely answer questions. They can handle files, launch web searches, write emails, control applications and, in some environments, carry out actions on behalf of the user. That is convenient, but it also opens a new attack surface. If an AI agent can process audio, then the relevant question is not only what we type or say to it, but also what hidden instructions it might hear from a video, a podcast or background sound.
AudioHijack, presented by researchers from Zhejiang University and Nanyang Technological University, targets exactly that problem. The method allows an attacker to embed a signal into an apparently harmless audio clip, in a way that does not sound like an instruction to a human listener but can be interpreted as a command by a large audio-language model. The signal could be part of a podcast, a YouTube video or another audio source, while the user hears nothing unusual. The researchers describe this not as a simple speech-recognition trick, but as auditory prompt injection: the same basic attack logic as hidden textual instructions designed to steer a model, except here the command is buried in sound rather than between written lines.
A Signal Trained in Half an Hour Can Work Across Different Contexts
Meng Chen, the lead author quoted by IEEE Spectrum, says one of the most disturbing aspects of the technique is that it does not need to be rebuilt for every situation. Chen put it this way: “It takes just half an hour to train this signal, and then, because this signal is context-agnostic, you can use it to attack the target model whenever you want, no matter what the user says.” That matters because the attack is not necessarily tied to the user’s specific prompt. A hidden audio command can attempt to influence the model’s behavior inside another otherwise normal audio environment.
The researchers tested the method against 13 large audio-language models and measured successful hijacking across six categories of misbehavior. According to the paper, average success rates ranged from 79% to 96% across different settings, while the audio remained highly imperceptible to users. The experiments were not limited to open models either: real-world tests also showed that commercial voice AI systems linked to Microsoft Azure and Mistral AI could be induced to perform unauthorized actions.
The paper describes the technical approach as context-agnostic and imperceptible auditory prompt injection. AudioHijack generates adversarial audio that steers the model’s attention toward the hidden instruction. The researchers also use a convolutional blending method that modulates the perturbation into something resembling natural reverberation. In plain terms, the attacker is not simply hiding a clearly audible spoken command. The signal is blended into the sound so that it carries meaning for the model while appearing to the human listener as harmless audio texture, if it is noticed at all.
The Defenses Are Not Convincing Yet
The researchers tested several obvious defensive ideas, but the results were not reassuring. One method asked the model to anticipate and avoid this type of attack, yet it stopped only 7% of the attacks. Another approach tried to make the system plan its next steps and avoid deviating from the original instruction, but that reached only a 28% defense success rate. That is far too low for a scenario in which an AI agent may already have real permissions and access to sensitive tools.
The problem is not simply that a model might mishear something. The larger issue is that AI agents are increasingly able to act for users. If such a system has access to private documents, email, banking data, corporate files or internal systems, a hidden audio instruction might not merely produce a strange answer. It could lead to data exposure or unauthorized action. In that model, a background TikTok clip, podcast or YouTube video is not just noise. It can become a potential command channel.
Microsoft thanked the researchers for their work in response to IEEE Spectrum and said this kind of study helps improve model resilience. The company also stressed that, in real-world applications, models are often placed behind additional developer-controlled safety layers rather than exposed on their own. That distinction matters, but it does not make the discovery harmless. As more voice-driven and multimodal AI agents enter daily use, it becomes increasingly important that systems do not treat every audio pattern as an instruction with equal authority. The lesson is simple for now: if an AI agent has real permissions, the microphone and audio input are not just convenience features. They are also security risks.
Source: 3DJuegos, IEEE Spectrum, arXiv



