Paper Abstract and Keywords |
Presentation |
2023-02-28 15:55
Self-Supervised Learning With Spatial Audio-Visual Recording for Sound Event Localization and Detection Yoto Fujita (Kyoto Univ.), Yoshiaki Bando (AIST), Keisuke Imoto (Doshisha Univ./AIST), Masaki Onihsi (AIST), Yoshii Kazuyoshi (Kyoto Univ.) EA2022-89 SIP2022-133 SP2022-53 |
Abstract |
(in Japanese) |
(See Japanese page) |
(in English) |
This paper describes an unsupervised pre-training method for sound event localization and detection (SELD) on multi-channel acoustic signals. The prevailing approach to SELD tasks involves training deep neural networks (DNNs) through supervised learning to estimate the activation, class, and direction of sound events. However, creating training data requires a significant amount of effort, and there are limitations to improving estimation accuracy and generalization performance. To address these issues, this study proposes a method of pre-training DNNs using spatially-informed virtual reality (VR) content that is publicly available on the internet. By using the VR content's
equirectangular 360 degrees images and first-order ambisonics (FOA) signals, the DNN can be pre-trained in an unsupervised manner. In this type of content, it is believed that the activation, class, and direction of sound sources correspond to the temporal changes, appearance, and position of sound sources in the images. Thus, by contrastive learning where the audio embeddings for each direction and the local visual embeddings for the corresponding direction are close if they come from the same content and direction (positive example), and far away if they do not (negative example), the feature latent feature space for event class and direction in sound and images can be obtained. Using the audio feature extractor, a DNN for SELD with an output layer was constructed and fine-tuned using a small amount of training data. The effectiveness of the proposed method was evaluated by transferring the pre-trained audio feature extractor to the SELD dataset STARSS22, using 100 hours of spatially-informed video and acoustic signals for pre-training. |
Keyword |
(in Japanese) |
(See Japanese page) |
(in English) |
Sound event localization and detection / Audio-visual learning / Self-supervised learning / Contrastive learning / / / / |
Reference Info. |
IEICE Tech. Rep., vol. 122, no. 389, SP2022-53, pp. 78-82, Feb. 2023. |
Paper # |
SP2022-53 |
Date of Issue |
2023-02-21 (EA, SIP, SP) |
ISSN |
Online edition: ISSN 2432-6380 |
Copyright and reproduction |
All rights are reserved and no part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Notwithstanding, instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. (License No.: 10GA0019/12GB0052/13GB0056/17GB0034/18GB0034) |
Download PDF |
EA2022-89 SIP2022-133 SP2022-53 |
|