AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations


Static-Speech Scenario

  • Existing efforts in audio adversarial attacks only focus on the scenarios where an adversary has prior knowledge of the entire speech input so as to generate an adversarial example by aligning and mixing the audio input with corresponding adversarial perturbation.

Streaming-speech Scenario

  • In this work we consider a more practical and challenging attack scenario where the intelligent audio system takes streaming audio inputs (e.g., live human speech) and the adversary can deceive the system by playing adversarial perturbations simultaneously.
  • This change in attack behavior brings great challenges, preventing existing adversarial perturbation generation methods from being applied directly. In practice, (1) the adversary cannot anticipate what the victim will say: the adversary cannot rely on their prior knowledge of the speech signal to guide how to generate adversarial perturbations; and (2) the adversary cannot control when the victim will speak: the synchronization between the adversarial perturbation and the speech cannot be guaranteed.

Attack Design

  • To address these challenges, in this paper we propose AdvPulse, a systematic approach to generate subsecond audio adversarial perturbations, that achieves the capability to alter the recognition results of streaming audio inputs in a targeted and synchronization-free manner.
  • To circumvent the constraints on speech content and time, we exploit penalty-based universal adversarial perturbation generation algorithm and incorporate the varying time delay into the optimization process. We further tailor the adversarial perturbation according to environmental sounds to make it inconspicuous to humans. Additionally, by considering the sources of distortions occurred during the physical playback, we are able to generate more robust audio adversarial perturbations that can remain effective even under over-the-air propagation.
  • We conduct experiments on two representative types of intelligent audio systems (i.e., speaker recognition and speech command recognition) in various realistic environments. The results show that our attack can achieve an average attack success rate of over 89.6% in indoor environments and 76.0% in inside-vehicle scenarios even with loud engine and road noises.

Audio Sample

The generated adversarial perturbations have the following two important properties:
  • Universal: the adversarial perturbations are expected to work on any speech input from that class.
  • Synchronization-free: the adversarial perturbations can remain effective regardless of the injection timing.

Attacking Speaker Recognition System

1. Orignal speaker: spk-2, target speaker: spk-3 (without environmental sound mimicking)

spk-2 Audio 1
spk-2 Audio 2
Adversarial Perturbation
spk-3 Audio

2. Orignal speaker: spk-7, target speaker: spk-9 (with environmental sound mimicking)

spk-7 Audio 1
spk-7 Audio 2
Adversarial Perturbation
Environmental Sound Template
spk-9 Audio

Attacking Speech Command Recognition System

1. Original command: "left", target command: "right" (without environmental sound mimicking)

"left" Audio 1
"left" Audio 2
Adversarial Perturbation
"right" Audio

2. Original command: "stop", target command: "go" (with environmental sound mimicking)

"stop" Audio 1
"stop" Audio 2
Adversarial Perturbation
Environmental Sound Template
"go" Audio

Attack Demo

Live Speech Attack Scenario Demo

By injecting a short adversarial perturbation, the adversary can make the speech command uttered by the victim to be misrecognized as any target command. Moreover, the adversarial perturbation can be disguised as situational environmental sound template (e.g., bird singing, phone notification) to make the attack more unnoticable.




  • For more information, please refer to our paper (in proceedings of ACM CCS'20).

Paper Presentation