Paper Abstract and Keywords |
Presentation |
2004-10-19 13:30
Information Extraction from Web Pages Using a Tree Edit Distance Measure Tetsuji Kuboyama (Univ. of Tokyo), Tetsuhiro Miyahara (Hiroshima City Univ.) |
Abstract |
(in Japanese) |
(See Japanese page) |
(in English) |
Recent research efforts on extracting information from Web pages have mainly focused on semi-automatic and automatic approaches to generating Web wrappers. This paper aim at establishing a structure-based approach to finding a common structured pattern from semistructured data such as HTML documents and XML documents through approximate tree matching by a tree edit distance measure for generating Web wrappers. The common structured pattern is generated by finding a similarity among parsed trees of Web pages, and merging these trees by alignment of trees. Each node of the pattern tree is weighted according to its frequency of occurrence in the tree. We present a method for generating Web wrappers from manually edited Web pages including a number of grammatical mistakes in HTML, redundant or missing fragments. |
Keyword |
(in Japanese) |
(See Japanese page) |
(in English) |
semi-structured data / information extraction / tree edit distance / / / / / |
Reference Info. |
IEICE Tech. Rep., vol. 104, no. 345, DE2004-117, pp. 19-24, Oct. 2004. |
Paper # |
DE2004-117 |
Date of Issue |
2004-10-12 (DE, DC) |
ISSN |
Print edition: ISSN 0913-5685 |
Download PDF |
|
Conference Information |
Committee |
DE DC |
Conference Date |
2004-10-18 - 2004-10-19 |
Place (in Japanese) |
(See Japanese page) |
Place (in English) |
Tokyo Institute of Technology |
Topics (in Japanese) |
(See Japanese page) |
Topics (in English) |
Data Enginieering, Dependability, etc. |
Paper Information |
Registration To |
DE |
Conference Code |
2004-10-DE-DC |
Language |
English (Japanese title is available) |
Title (in Japanese) |
(See Japanese page) |
Sub Title (in Japanese) |
(See Japanese page) |
Title (in English) |
Information Extraction from Web Pages Using a Tree Edit Distance Measure |
Sub Title (in English) |
|
Keyword(1) |
semi-structured data |
Keyword(2) |
information extraction |
Keyword(3) |
tree edit distance |
Keyword(4) |
|
Keyword(5) |
|
Keyword(6) |
|
Keyword(7) |
|
Keyword(8) |
|
1st Author's Name |
Tetsuji Kuboyama |
1st Author's Affiliation |
The Univeristy of Tokyo (Univ. of Tokyo) |
2nd Author's Name |
Tetsuhiro Miyahara |
2nd Author's Affiliation |
Hiroshima City University (Hiroshima City Univ.) |
3rd Author's Name |
|
3rd Author's Affiliation |
() |
4th Author's Name |
|
4th Author's Affiliation |
() |
5th Author's Name |
|
5th Author's Affiliation |
() |
6th Author's Name |
|
6th Author's Affiliation |
() |
7th Author's Name |
|
7th Author's Affiliation |
() |
8th Author's Name |
|
8th Author's Affiliation |
() |
9th Author's Name |
|
9th Author's Affiliation |
() |
10th Author's Name |
|
10th Author's Affiliation |
() |
11th Author's Name |
|
11th Author's Affiliation |
() |
12th Author's Name |
|
12th Author's Affiliation |
() |
13th Author's Name |
|
13th Author's Affiliation |
() |
14th Author's Name |
|
14th Author's Affiliation |
() |
15th Author's Name |
|
15th Author's Affiliation |
() |
16th Author's Name |
|
16th Author's Affiliation |
() |
17th Author's Name |
|
17th Author's Affiliation |
() |
18th Author's Name |
|
18th Author's Affiliation |
() |
19th Author's Name |
|
19th Author's Affiliation |
() |
20th Author's Name |
|
20th Author's Affiliation |
() |
Speaker |
Author-1 |
Date Time |
2004-10-19 13:30:00 |
Presentation Time |
30 minutes |
Registration for |
DE |
Paper # |
DE2004-117, DC2004-32 |
Volume (vol) |
vol.104 |
Number (no) |
no.345(DE), no.347(DC) |
Page |
pp.19-24 |
#Pages |
6 |
Date of Issue |
2004-10-12 (DE, DC) |