IEICE Technical Committee Submission System
Conference Paper's Information
Online Proceedings
[Sign in]
Tech. Rep. Archives
 Go Top Page Go Previous   [Japanese] / [English] 

Paper Abstract and Keywords
Presentation 2006-07-28 09:30
Auto-Detection of Encoding in Legacy Documents of Indic Script -- Non-Standardized Legacy Encodings in Khmer Fonts --
Toshiya Suzuki (Hiroshima Univ.), Dai Sato (Tohoku Univ.)
Abstract (in Japanese) (See Japanese page) 
(in English) Although Unicode text layout systems are introduced into modern text processing softwares, still legacy character encodings are widely used for
Indic scripts in South and South East Asia to work with systems missing intelligent text layout functionalities.
Some de-jure or de-facto legacy standards are used for some scripts, but there are scripts whose encodings was not standardized before ISO 10646. If the document uses the fonts with unstandardized encodings, usually the text data extraction from the document is difficult.
In this paper, we take a concrete example of such script: Khmer script. It is suppored by Unicode standard, not fully supported by applications, and no legacy encodings were standardized.
We investigate the encodings in the freely distributed Khmer TrueType fonts and propose algorithm to identify which encoding is used in the fonts. By the algorithm, the text extraction of the document including legacy TrueType fonts can be automated.
Keyword (in Japanese) (See Japanese page) 
(in English) Indic script / Khmer script / TrueType / font / legacy encoding / Unicode / character encoding / auto detection  
Reference Info. IEICE Tech. Rep., vol. 106, pp. 1-8, July 2006.
Paper #  
Date of Issue 2006-07-21 (OIS) 
ISSN Print edition: ISSN 0913-5685
Download PDF

Conference Information
Committee LOIS IPSJ-DC  
Conference Date 2006-07-28 - 2006-07-28 
Place (in Japanese) (See Japanese page) 
Place (in English) Faculty of Engineering Yamagata University 
Topics (in Japanese) (See Japanese page) 
Topics (in English)  
Paper Information
Registration To IPSJ-DC 
Conference Code 2006-07-OIS-IPSJ-DD 
Language Japanese 
Title (in Japanese) (See Japanese page) 
Sub Title (in Japanese) (See Japanese page) 
Title (in English) Auto-Detection of Encoding in Legacy Documents of Indic Script 
Sub Title (in English) Non-Standardized Legacy Encodings in Khmer Fonts 
Keyword(1) Indic script  
Keyword(2) Khmer script  
Keyword(3) TrueType  
Keyword(4) font  
Keyword(5) legacy encoding  
Keyword(6) Unicode  
Keyword(7) character encoding  
Keyword(8) auto detection  
1st Author's Name Toshiya Suzuki  
1st Author's Affiliation Hiroshima University (Hiroshima Univ.)
2nd Author's Name Dai Sato  
2nd Author's Affiliation Tohoku University (Tohoku Univ.)
3rd Author's Name  
3rd Author's Affiliation ()
4th Author's Name  
4th Author's Affiliation ()
5th Author's Name  
5th Author's Affiliation ()
6th Author's Name  
6th Author's Affiliation ()
7th Author's Name  
7th Author's Affiliation ()
8th Author's Name  
8th Author's Affiliation ()
9th Author's Name  
9th Author's Affiliation ()
10th Author's Name  
10th Author's Affiliation ()
11th Author's Name  
11th Author's Affiliation ()
12th Author's Name  
12th Author's Affiliation ()
13th Author's Name  
13th Author's Affiliation ()
14th Author's Name  
14th Author's Affiliation ()
15th Author's Name  
15th Author's Affiliation ()
16th Author's Name  
16th Author's Affiliation ()
17th Author's Name  
17th Author's Affiliation ()
18th Author's Name  
18th Author's Affiliation ()
19th Author's Name  
19th Author's Affiliation ()
20th Author's Name  
20th Author's Affiliation ()
Speaker Author-1 
Date Time 2006-07-28 09:30:00 
Presentation Time 25 minutes 
Registration for IPSJ-DC 
Paper # OIS2006-10 
Volume (vol) vol.106 
Number (no) no.195 
Page pp.1-8 
#Pages
Date of Issue 2006-07-21 (OIS) 


[Return to Top Page]

[Return to IEICE Web Page]


The Institute of Electronics, Information and Communication Engineers (IEICE), Japan