Listening Human Behavior: 3D Human Pose Estimation with Acoustic Signals

Listening Human Behavior. We propose 3D human pose estimation given only low-level acoustic signals with a single pair of microphones and loudspeakers. Given an audio feature frame, our method estimates 3D human pose sequences.

Abstract

Given only acoustic signals without any high-level information, such as voices or sounds of scenes/actions, how much can we infer about the behavior of humans?

Unlike existing methods, which suffer from privacy issues because they use signals that include human speech or the sounds of specific actions, we explore how low-level acoustic signals can provide enough clues to estimate 3D human poses by active acoustic sensing with a single pair of microphones and loudspeakers.

This is a challenging task since sound is much more diffractive than other signals and therefore covers up the shape of objects in a scene. Accordingly, we introduce a framework that encodes multichannel audio features into 3D human poses. Aiming to capture subtle sound changes to reveal detailed pose information, we explicitly extract phase features from the acoustic signals together with typical spectrum features and feed them into our human pose estimation network.

Also, we show that reflected or diffracted sounds are easily influenced by subjects' physique differences (e.g., height and muscularity), which deteriorates prediction accuracy. We reduce these gaps by using a subject discriminator to improve accuracy. Our experiments suggest that with the use of only low-dimensional acoustic information, our method outperforms baseline methods.

Method

Overview of our framework. We encode multichannel audio features (spectrum + phase) into 3D human poses using a CNN-based architecture with adversarial subject-invariant learning.

Active Acoustic Sensing

We use a minimal setup with a single pair of ambisonics microphone and loudspeakers to actively sense the environment using time-stretched pulse (TSP) signals.

Phase Feature Extraction

We explicitly extract phase features representing the time difference of arrival (TDOA), capturing subtle shifts caused by human body occlusion of acoustic signals.

CNN-based Pose Estimation

Our 1D convolutional neural network learns non-linear mappings from multichannel audio features to 3D body joint locations.

Subject-Invariant Learning

We apply adversarial learning with a subject discriminator to create physique-invariant features, reducing over-fitting to individual subjects' physical characteristics.

Results

Recording Environments

We record data in two environments: (i) an anechoic room with minimal reverberation and (ii) a classroom with significant ambient noise, to evaluate robustness.

Audio Features

Visualization of the extracted acoustic features. Phase features capture subtle time-of-arrival differences caused by human body reflections and diffractions.

Qualitative Results

Qualitative comparison of predicted 3D poses. Our method accurately captures human poses from acoustic signals alone.

Additional qualitative results demonstrating the effectiveness of our acoustic-based pose estimation across different actions.

Pose Estimation in the Dark

Pose estimation in a completely dark room. The top row shows RGB images where almost nothing is visible. Despite the absence of any visual information, our acoustic-based method (bottom, blue) successfully estimates 3D human poses that closely match the ground truth (middle, red). This demonstrates a key advantage of acoustic sensing: it works regardless of lighting conditions.

Subject-Invariant Features

The intermediate outputs of three subjects. Samples from three different subjects (2 males, 1 female) were dimensionally reduced into 2D. Without any subject discriminator (left) and with L_d (middle), large differences among subjects are visible. These differences are successfully removed with our proposed L_std (right).

Key Contributions

New Task

We are the first to tackle 3D human pose estimation given only low-level acoustic signals, without any high-level semantic information such as speech or action sounds.

Novel Framework

We propose a CNN-based framework with explicit phase feature extraction and adversarial subject-invariant learning for robust acoustic-based pose estimation.

New Dataset

We create AcousticPose3D, a new dataset recorded in both anechoic and noisy environments with synchronized motion capture data, enabling future research in acoustic human sensing.

Dataset: AcousticPose3D

We release AcousticPose3D, a dataset recorded in both anechoic and noisy environments with synchronized acoustic signals and motion capture data. The dataset consists of the following three components:

Due to a collaborative research agreement with our research partner, the dataset captured in the Anechoic chamber room cannot be made publicly available.

BibTeX

@inproceedings{shibata2023listening,
    title={Listening Human Behavior: 3D Human Pose Estimation
           with Acoustic Signals},
    author={Shibata, Yuto and Kawashima, Yutaka and Isogawa, Mariko
            and Irie, Go and Kimura, Akisato and Aoki, Yoshimitsu},
    booktitle={Proceedings of the IEEE/CVF Conference on
               Computer Vision and Pattern Recognition (CVPR)},
    year={2023}
}