Paper Abstract and Keywords |
Presentation |
2023-03-14 16:55
Automatic Dataset Collection and Formatting Techniques for Machine Learning Systems Takako Kawaguchi, Toshiyuki Kurabayashi, Haruto Tanno (former NTT) SS2022-57 |
Abstract |
(in Japanese) |
(See Japanese page) |
(in English) |
In recent years, as machine learning models have become larger and larger, the scale of data required for training has also been increasing. For this reason, more and more users are collecting training data from the abundant information resources on the web. However, web pages have a variety of screen configurations, and the location of data varies greatly depending on the configuration. Therefore, when extracting similar information from multiple Web pages with different screen configurations at once, it is necessary to set the extraction location for each screen configuration and provide examples. In this paper, we propose a technique that can extract similar data regardless of the differences in screen configurations of web pages. The proposed model allows users to retrieve desired data from multiple web pages with different screen configurations by simply indicating the data to be extracted and few examples of web pages on which the data resides. This is because the model takes into account the similarity of the text strings of data extracted in the input-output examples and the partial locations where the data exists, thereby limiting the impact of differences in the screen configuration of the entire Web page. In addition, we conducted evaluation experiments on the proposed model and showed that it can extract targets from multiple web pages with different screen configurations from the example web page. |
Keyword |
(in Japanese) |
(See Japanese page) |
(in English) |
Machine Learning Engineering / Machine Learning / Web Scraping / Crawling / Data Collection / Regular Expressions / Data Extraction / Information Extraction |
Reference Info. |
IEICE Tech. Rep., vol. 122, no. 432, SS2022-57, pp. 61-66, March 2023. |
Paper # |
SS2022-57 |
Date of Issue |
2023-03-07 (SS) |
ISSN |
Online edition: ISSN 2432-6380 |
Copyright and reproduction |
All rights are reserved and no part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Notwithstanding, instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. (License No.: 10GA0019/12GB0052/13GB0056/17GB0034/18GB0034) |
Download PDF |
SS2022-57 |
Conference Information |
Committee |
SS |
Conference Date |
2023-03-14 - 2023-03-15 |
Place (in Japanese) |
(See Japanese page) |
Place (in English) |
|
Topics (in Japanese) |
(See Japanese page) |
Topics (in English) |
|
Paper Information |
Registration To |
SS |
Conference Code |
2023-03-SS |
Language |
Japanese |
Title (in Japanese) |
(See Japanese page) |
Sub Title (in Japanese) |
(See Japanese page) |
Title (in English) |
Automatic Dataset Collection and Formatting Techniques for Machine Learning Systems |
Sub Title (in English) |
|
Keyword(1) |
Machine Learning Engineering |
Keyword(2) |
Machine Learning |
Keyword(3) |
Web Scraping |
Keyword(4) |
Crawling |
Keyword(5) |
Data Collection |
Keyword(6) |
Regular Expressions |
Keyword(7) |
Data Extraction |
Keyword(8) |
Information Extraction |
1st Author's Name |
Takako Kawaguchi |
1st Author's Affiliation |
Nippon Telegraph and Telephone Corporation (former NTT) |
2nd Author's Name |
Toshiyuki Kurabayashi |
2nd Author's Affiliation |
Nippon Telegraph and Telephone Corporation (former NTT) |
3rd Author's Name |
Haruto Tanno |
3rd Author's Affiliation |
Nippon Telegraph and Telephone Corporation (former NTT) |
4th Author's Name |
|
4th Author's Affiliation |
() |
5th Author's Name |
|
5th Author's Affiliation |
() |
6th Author's Name |
|
6th Author's Affiliation |
() |
7th Author's Name |
|
7th Author's Affiliation |
() |
8th Author's Name |
|
8th Author's Affiliation |
() |
9th Author's Name |
|
9th Author's Affiliation |
() |
10th Author's Name |
|
10th Author's Affiliation |
() |
11th Author's Name |
|
11th Author's Affiliation |
() |
12th Author's Name |
|
12th Author's Affiliation |
() |
13th Author's Name |
|
13th Author's Affiliation |
() |
14th Author's Name |
|
14th Author's Affiliation |
() |
15th Author's Name |
|
15th Author's Affiliation |
() |
16th Author's Name |
|
16th Author's Affiliation |
() |
17th Author's Name |
|
17th Author's Affiliation |
() |
18th Author's Name |
|
18th Author's Affiliation |
() |
19th Author's Name |
|
19th Author's Affiliation |
() |
20th Author's Name |
|
20th Author's Affiliation |
() |
Speaker |
Author-1 |
Date Time |
2023-03-14 16:55:00 |
Presentation Time |
25 minutes |
Registration for |
SS |
Paper # |
SS2022-57 |
Volume (vol) |
vol.122 |
Number (no) |
no.432 |
Page |
pp.61-66 |
#Pages |
6 |
Date of Issue |
2023-03-07 (SS) |
|