IEICE Technical Committee Submission System
Conference Paper's Information
Online Proceedings
[Sign in]
Tech. Rep. Archives
 Go Top Page Go Previous   [Japanese] / [English] 

Paper Abstract and Keywords
Presentation 2023-02-28 15:55
Self-Supervised Learning With Spatial Audio-Visual Recording for Sound Event Localization and Detection
Yoto Fujita (Kyoto Univ.), Yoshiaki Bando (AIST), Keisuke Imoto (Doshisha Univ./AIST), Masaki Onihsi (AIST), Yoshii Kazuyoshi (Kyoto Univ.) EA2022-89 SIP2022-133 SP2022-53
Abstract (in Japanese) (See Japanese page) 
(in English) This paper describes an unsupervised pre-training method for sound event localization and detection (SELD) on multi-channel acoustic signals. The prevailing approach to SELD tasks involves training deep neural networks (DNNs) through supervised learning to estimate the activation, class, and direction of sound events. However, creating training data requires a significant amount of effort, and there are limitations to improving estimation accuracy and generalization performance. To address these issues, this study proposes a method of pre-training DNNs using spatially-informed virtual reality (VR) content that is publicly available on the internet. By using the VR content's
equirectangular 360 degrees images and first-order ambisonics (FOA) signals, the DNN can be pre-trained in an unsupervised manner. In this type of content, it is believed that the activation, class, and direction of sound sources correspond to the temporal changes, appearance, and position of sound sources in the images. Thus, by contrastive learning where the audio embeddings for each direction and the local visual embeddings for the corresponding direction are close if they come from the same content and direction (positive example), and far away if they do not (negative example), the feature latent feature space for event class and direction in sound and images can be obtained. Using the audio feature extractor, a DNN for SELD with an output layer was constructed and fine-tuned using a small amount of training data. The effectiveness of the proposed method was evaluated by transferring the pre-trained audio feature extractor to the SELD dataset STARSS22, using 100 hours of spatially-informed video and acoustic signals for pre-training.
Keyword (in Japanese) (See Japanese page) 
(in English) Sound event localization and detection / Audio-visual learning / Self-supervised learning / Contrastive learning / / / /  
Reference Info. IEICE Tech. Rep., vol. 122, no. 389, SP2022-53, pp. 78-82, Feb. 2023.
Paper # SP2022-53 
Date of Issue 2023-02-21 (EA, SIP, SP) 
ISSN Online edition: ISSN 2432-6380
Copyright
and
reproduction
All rights are reserved and no part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Notwithstanding, instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. (License No.: 10GA0019/12GB0052/13GB0056/17GB0034/18GB0034)
Download PDF EA2022-89 SIP2022-133 SP2022-53

Conference Information
Committee SP IPSJ-SLP EA SIP  
Conference Date 2023-02-28 - 2023-03-01 
Place (in Japanese) (See Japanese page) 
Place (in English)  
Topics (in Japanese) (See Japanese page) 
Topics (in English)  
Paper Information
Registration To SP 
Conference Code 2023-02-SP-SLP-EA-SIP 
Language Japanese 
Title (in Japanese) (See Japanese page) 
Sub Title (in Japanese) (See Japanese page) 
Title (in English) Self-Supervised Learning With Spatial Audio-Visual Recording for Sound Event Localization and Detection 
Sub Title (in English)  
Keyword(1) Sound event localization and detection  
Keyword(2) Audio-visual learning  
Keyword(3) Self-supervised learning  
Keyword(4) Contrastive learning  
Keyword(5)  
Keyword(6)  
Keyword(7)  
Keyword(8)  
1st Author's Name Yoto Fujita  
1st Author's Affiliation Kyoto University (Kyoto Univ.)
2nd Author's Name Yoshiaki Bando  
2nd Author's Affiliation National Institute of Advanced Industrial Science and Technology (AIST)
3rd Author's Name Keisuke Imoto  
3rd Author's Affiliation Doshisha University/National Institute of Advanced Industrial Science and Technology (Doshisha Univ./AIST)
4th Author's Name Masaki Onihsi  
4th Author's Affiliation National Institute of Advanced Industrial Science and Technology (AIST)
5th Author's Name Yoshii Kazuyoshi  
5th Author's Affiliation Kyoto University (Kyoto Univ.)
6th Author's Name  
6th Author's Affiliation ()
7th Author's Name  
7th Author's Affiliation ()
8th Author's Name  
8th Author's Affiliation ()
9th Author's Name  
9th Author's Affiliation ()
10th Author's Name  
10th Author's Affiliation ()
11th Author's Name  
11th Author's Affiliation ()
12th Author's Name  
12th Author's Affiliation ()
13th Author's Name  
13th Author's Affiliation ()
14th Author's Name  
14th Author's Affiliation ()
15th Author's Name  
15th Author's Affiliation ()
16th Author's Name  
16th Author's Affiliation ()
17th Author's Name  
17th Author's Affiliation ()
18th Author's Name  
18th Author's Affiliation ()
19th Author's Name  
19th Author's Affiliation ()
20th Author's Name  
20th Author's Affiliation ()
Speaker Author-1 
Date Time 2023-02-28 15:55:00 
Presentation Time 20 minutes 
Registration for SP 
Paper # EA2022-89, SIP2022-133, SP2022-53 
Volume (vol) vol.122 
Number (no) no.387(EA), no.388(SIP), no.389(SP) 
Page pp.78-82 
#Pages
Date of Issue 2023-02-21 (EA, SIP, SP) 


[Return to Top Page]

[Return to IEICE Web Page]


The Institute of Electronics, Information and Communication Engineers (IEICE), Japan