##################################################################### jcode.pl: Perl library for Japanese character code conversion Copyright (c) 1995-2000 Kazumasa Utashiro Internet Initiative Japan Inc. 3-13 Kanda Nishiki-cho, Chiyoda-ku, Tokyo 101-0054, Japan Copyright (c) 1992,1993,1994 Kazumasa Utashiro Software Research Associates, Inc. Use and redistribution for ANY PURPOSE are granted as long as all copyright notices are retained. Redistribution with modification is allowed provided that you make your modified version obviously distinguishable from the original one. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED. Original version was developed under the name of srekcah@sra.co.jp February 1992 and it was called kconv.pl at the beginning. This address was a pen name for group of individuals and it is no longer valid. The latest version is available here: ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/ ;; $rcsid = q$Id: jcode.pl,v 2.13 2000/09/29 16:10:05 utashiro Exp $; ##################################################################### PERL4 INTERFACE: &jcode'getcode(*line) Return 'jis', 'sjis', 'euc' or undef according to Japanese character code in $line. Return 'binary' if the data has non-character code. When evaluated in array context, it returns a list contains two items. First value is the number of characters which matched to the expected code, and second value is the code name. It is useful if and only if the number is not 0 and the code is undef; that case means it couldn't tell 'euc' or 'sjis' because the evaluation score was exactly same. This interface is too tricky, though. Code detection between euc and sjis is very difficult or sometimes impossible or even lead to wrong result when it includes JIS X0201 KANA characters. So JIS X0201 KANA is ignored for automatic code detection. &jcode'convert(*line, $ocode [, $icode [, $option]]) Convert the contents of $line to the specified Japanese code given in the second argument $ocode. $ocode can be any of "jis", "sjis" or "euc", or use "noconv" when you don't want the code conversion. Input code is recognized automatically from the line itself when $icode is not supplied (JIS X0201 KANA is ignored in code detection. See the above descripton of &getcode). $icode also can be specified, but xxx2yyy routine is more efficient when both codes are known. It returns the code of input string in scalar context, and a list of pointer of convert subroutine and the input code in array context. Japanese character code JIS X0201, X0208, X0212 and ASCII code are supported. X0212 characters can not be represented in SJIS and they will be replased by "geta" character when converted to SJIS. See next paragraph for $option parameter. &jcode'xxx2yyy(*line [, $option]) Convert the Japanese code from xxx to yyy. String xxx and yyy are any convination from "jis", "euc" or "sjis". They return *approximate* number of converted bytes. So return value 0 means the line was not converted at all. Optional parameter $option is used to specify optional conversion method. String "z" is for JIS X0201 KANA to X0208 KANA, and "h" is for reverse. $jcode'convf{'xxx', 'yyy'} The value of this associative array is pointer to the subroutine jcode'xxx2yyy(). &jcode'to($ocode, $line [, $icode [, $option]]) &jcode'jis($line [, $icode [, $option]]) &jcode'euc($line [, $icode [, $option]]) &jcode'sjis($line [, $icode [, $option]]) These functions are prepared for easy use of call/return-by-value interface. You can use these funcitons in s///e operation or any other place for convenience. &jcode'jis_inout($in, $out) Set or inquire JIS start and end sequences. Default is "ESC-$-B" and "ESC-(-B". If you supplied only one character, "ESC-$" or "ESC-(" is prepended for each character respectively. Acutually "ESC-(-B" is not a sequence to end JIS code but a sequence to start ASCII code set. So `in' and `out' are somewhat misleading. &jcode'get_inout($string) Get JIS start and end sequences from $string. &jcode'cache() &jcode'nocache() &jcode'flush() Usually, converted character is cached in memory to avoid same calculations have to be done many times. To disable this caching, call &jcode'nocache(). It can be revived by &jcode'cache() and cache is flushed by calling &jcode'flush(). &cache() and &nocache() functions return previous caching state. --------------------------------------------------------------- &jcode'h2z_xxx(*line) JIS X0201 KANA (so-called Hankaku-KANA) to X0208 KANA (Zenkaku-KANA) code conversion routine. String xxx is any of "jis", "sjis" and "euc". From the difficulty of recognizing code set from 1-byte KATAKANA string, automatic code recognition is not supported. &jcode'z2h_xxx(*line) X0208 to X0201 KANA code conversion routine. String xxx is any of "jis", "sjis" and "euc". $jcode'z2hf{'xxx'} $jcode'h2zf{'xxx'} These are pointer to the corresponding function just as $jcode'convf. --------------------------------------------------------------- &jcode'tr(*line, $from, $to [, $option]) &jcode'tr emulates tr operator for 2 byte code. Only 'd' is interpreted as an option. Range operator like `A-Z' for 2 byte code is partially supported. Code must be JIS or EUC, and first byte have to be same on first and last character. CAUTION: Handling range operator is a kind of trick and it is not perfect. So if you need to transfer `-' character, please be sure to put it at the beginning or the end of $from and $to strings. &jcode'trans($line, $from, $to [, $option) Same as &jcode'tr but accept string and return string after translation. --------------------------------------------------------------- &jcode'init() Initialize the variables used in this package. You don't have to call this when using jocde.pl by `do' or `require' interface. Call it first if you embedded the jcode.pl at the end of your script. ##################################################################### PERL5 INTERFACE: Current jcode.pl is written in Perl 4 but it is possible to use from Perl 5 using `references'. Fully perl5 capable version is future issue. Since lexical variable is not a subject of typeglob, *string style call doesn't work if the variable is declared as `my'. Same thing happens to special variable $_ if the perl is compiled to use thread capability. So using reference is generally recommented to avoid the mysterious error. jcode::getcode(\$line) jcode::convert(\$line, $ocode [, $icode [, $option]]) jcode::xxx2yyy(\$line [, $option]) &{$jcode::convf{'xxx', 'yyy'}}(\$line) jcode::to($ocode, $line [, $icode [, $option]]) jcode::jis($line [, $icode [, $option]]) jcode::euc($line [, $icode [, $option]]) jcode::sjis($line [, $icode [, $option]]) jcode::jis_inout($in, $out) jcode::get_inout($string) jcode::cache() jcode::nocache() jcode::flush() jcode::h2z_xxx(\$line) jcode::z2h_xxx(\$line) &{$jcode::z2hf{'xxx'}}(\$line) &{$jcode::h2zf{'xxx'}}(\$line) jcode::tr(\$line, $from, $to [, $option]) jcode::trans($line, $from, $to [, $option) jcode::init() ##################################################################### SAMPLES Convert any Kanji code to JIS and print each line with code name. while (defined($s = <>)) { $code = &jcode'convert(*s, 'jis'); print $code, "\t", $s; } Convert all lines to JIS according to the first recognized line. while (defined($s = <>)) { print, next unless $s =~ /[\033\200-\377]/; (*f, $icode) = &jcode'convert(*s, 'jis'); print; defined(&f) || next; while (<>) { &f(*s); print; } last; } The safest way of JIS conversion. while (defined($s = <>)) { ($matched, $icode) = &jcode'getcode(*s); if (@buf == 0 && $matched == 0) { print $s; next; } push(@buf, $s); next unless $icode; while (defined($s = shift(@buf))) { &jcode'convert(*s, 'jis', $icode); print $s; } while (defined($s = <>)) { &jcode'convert(*s, 'jis', $icode); print $s; } last; } print @buf if @buf;