Paper Abstract and Keywords |
Presentation |
2006-07-28 09:30
Auto-Detection of Encoding in Legacy Documents of Indic Script
-- Non-Standardized Legacy Encodings in Khmer Fonts -- Toshiya Suzuki (Hiroshima Univ.), Dai Sato (Tohoku Univ.) |
Abstract |
(in Japanese) |
(See Japanese page) |
(in English) |
Although Unicode text layout systems are introduced into modern text processing softwares, still legacy character encodings are widely used for
Indic scripts in South and South East Asia to work with systems missing intelligent text layout functionalities.
Some de-jure or de-facto legacy standards are used for some scripts, but there are scripts whose encodings was not standardized before ISO 10646. If the document uses the fonts with unstandardized encodings, usually the text data extraction from the document is difficult.
In this paper, we take a concrete example of such script: Khmer script. It is suppored by Unicode standard, not fully supported by applications, and no legacy encodings were standardized.
We investigate the encodings in the freely distributed Khmer TrueType fonts and propose algorithm to identify which encoding is used in the fonts. By the algorithm, the text extraction of the document including legacy TrueType fonts can be automated. |
Keyword |
(in Japanese) |
(See Japanese page) |
(in English) |
Indic script / Khmer script / TrueType / font / legacy encoding / Unicode / character encoding / auto detection |
Reference Info. |
IEICE Tech. Rep., vol. 106, pp. 1-8, July 2006. |
Paper # |
|
Date of Issue |
2006-07-21 (OIS) |
ISSN |
Print edition: ISSN 0913-5685 |
Download PDF |
|
Conference Information |
Committee |
LOIS IPSJ-DC |
Conference Date |
2006-07-28 - 2006-07-28 |
Place (in Japanese) |
(See Japanese page) |
Place (in English) |
Faculty of Engineering Yamagata University |
Topics (in Japanese) |
(See Japanese page) |
Topics (in English) |
|
Paper Information |
Registration To |
IPSJ-DC |
Conference Code |
2006-07-OIS-IPSJ-DD |
Language |
Japanese |
Title (in Japanese) |
(See Japanese page) |
Sub Title (in Japanese) |
(See Japanese page) |
Title (in English) |
Auto-Detection of Encoding in Legacy Documents of Indic Script |
Sub Title (in English) |
Non-Standardized Legacy Encodings in Khmer Fonts |
Keyword(1) |
Indic script |
Keyword(2) |
Khmer script |
Keyword(3) |
TrueType |
Keyword(4) |
font |
Keyword(5) |
legacy encoding |
Keyword(6) |
Unicode |
Keyword(7) |
character encoding |
Keyword(8) |
auto detection |
1st Author's Name |
Toshiya Suzuki |
1st Author's Affiliation |
Hiroshima University (Hiroshima Univ.) |
2nd Author's Name |
Dai Sato |
2nd Author's Affiliation |
Tohoku University (Tohoku Univ.) |
3rd Author's Name |
|
3rd Author's Affiliation |
() |
4th Author's Name |
|
4th Author's Affiliation |
() |
5th Author's Name |
|
5th Author's Affiliation |
() |
6th Author's Name |
|
6th Author's Affiliation |
() |
7th Author's Name |
|
7th Author's Affiliation |
() |
8th Author's Name |
|
8th Author's Affiliation |
() |
9th Author's Name |
|
9th Author's Affiliation |
() |
10th Author's Name |
|
10th Author's Affiliation |
() |
11th Author's Name |
|
11th Author's Affiliation |
() |
12th Author's Name |
|
12th Author's Affiliation |
() |
13th Author's Name |
|
13th Author's Affiliation |
() |
14th Author's Name |
|
14th Author's Affiliation |
() |
15th Author's Name |
|
15th Author's Affiliation |
() |
16th Author's Name |
|
16th Author's Affiliation |
() |
17th Author's Name |
|
17th Author's Affiliation |
() |
18th Author's Name |
|
18th Author's Affiliation |
() |
19th Author's Name |
|
19th Author's Affiliation |
() |
20th Author's Name |
|
20th Author's Affiliation |
() |
Speaker |
Author-1 |
Date Time |
2006-07-28 09:30:00 |
Presentation Time |
25 minutes |
Registration for |
IPSJ-DC |
Paper # |
OIS2006-10 |
Volume (vol) |
vol.106 |
Number (no) |
no.195 |
Page |
pp.1-8 |
#Pages |
8 |
Date of Issue |
2006-07-21 (OIS) |