Can we listen to human poses?
Given only acoustic signals without any high-level information, such as voices or sounds of scenes/actions, how much can we infer about the behavior of humans?
Unlike existing methods, which suffer from privacy issues because they use signals that include human speech or the sounds of specific actions, we explore how low-level acoustic signals can provide enough clues to estimate 3D human poses by active acoustic sensing with a single pair of microphones and loudspeakers.
This is a challenging task since sound is much more diffractive than other signals and therefore covers up the shape of objects in a scene. Accordingly, we introduce a framework that encodes multichannel audio features into 3D human poses. Aiming to capture subtle sound changes to reveal detailed pose information, we explicitly extract phase features from the acoustic signals together with typical spectrum features and feed them into our human pose estimation network.
Also, we show that reflected or diffracted sounds are easily influenced by subjects' physique differences (e.g., height and muscularity), which deteriorates prediction accuracy. We reduce these gaps by using a subject discriminator to improve accuracy. Our experiments suggest that with the use of only low-dimensional acoustic information, our method outperforms baseline methods.
Overview of our framework. We encode multichannel audio features (spectrum + phase) into 3D human poses using a CNN-based architecture with adversarial subject-invariant learning.
We use a minimal setup with a single pair of ambisonics microphone and loudspeakers to actively sense the environment using time-stretched pulse (TSP) signals.
We explicitly extract phase features representing the time difference of arrival (TDOA), capturing subtle shifts caused by human body occlusion of acoustic signals.
Our 1D convolutional neural network learns non-linear mappings from multichannel audio features to 3D body joint locations.
We apply adversarial learning with a subject discriminator to create physique-invariant features, reducing over-fitting to individual subjects' physical characteristics.
We record data in two environments: (i) an anechoic room with minimal reverberation and (ii) a classroom with significant ambient noise, to evaluate robustness.
Visualization of the extracted acoustic features. Phase features capture subtle time-of-arrival differences caused by human body reflections and diffractions.
Qualitative comparison of predicted 3D poses. Our method accurately captures human poses from acoustic signals alone.
Additional qualitative results demonstrating the effectiveness of our acoustic-based pose estimation across different actions.
Pose estimation in a completely dark room. The top row shows RGB images where almost nothing is visible. Despite the absence of any visual information, our acoustic-based method (bottom, blue) successfully estimates 3D human poses that closely match the ground truth (middle, red). This demonstrates a key advantage of acoustic sensing: it works regardless of lighting conditions.
The intermediate outputs of three subjects. Samples from three different subjects (2 males, 1 female) were dimensionally reduced into 2D. Without any subject discriminator (left) and with Ld (middle), large differences among subjects are visible. These differences are successfully removed with our proposed Lstd (right).
We are the first to tackle 3D human pose estimation given only low-level acoustic signals, without any high-level semantic information such as speech or action sounds.
We propose a CNN-based framework with explicit phase feature extraction and adversarial subject-invariant learning for robust acoustic-based pose estimation.
We create AcousticPose3D, a new dataset recorded in both anechoic and noisy environments with synchronized motion capture data, enabling future research in acoustic human sensing.
We release AcousticPose3D, a dataset recorded in both anechoic and noisy environments with synchronized acoustic signals and motion capture data. The dataset consists of the following three components:
Raw acoustic signals captured by microphones in anechoic and classroom environments.
Download3D human motion data captured by a motion capture system, synchronized with acoustic signals.
DownloadJoint annotations and metadata for training and evaluation of pose estimation models.
DownloadDue to a collaborative research agreement with our research partner, the dataset captured in the Anechoic chamber room cannot be made publicly available.
@inproceedings{shibata2023listening,
title={Listening Human Behavior: 3D Human Pose Estimation
with Acoustic Signals},
author={Shibata, Yuto and Kawashima, Yutaka and Isogawa, Mariko
and Irie, Go and Kimura, Akisato and Aoki, Yoshimitsu},
booktitle={Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR)},
year={2023}
}