My things in UTF8

01 Jun 2005 / Edit / History

My things in UTF8

我准备把我的Linux平台的编码从zh_CN.GB2312改为zh_CN.UTF-8，希望能够一切顺利。

我的系统是Debian/Linux, sid, 2.6。

基本配置

bash的设置没有改，.xsession是.xinitrc的软链接。 .xinitrc内容如下：

# .xinitrc
source $HOME/.bash_profile

LANG=zh_CN.UTF-8
LC_MESSAGES=en_US
LC_TIME=en_US
export LANG LC_MESSAGES LC_TIME

export GDK_USE_XFT=1
export GTK_IM_MODULE='scim'
export XMODIFIERS="@im=SCIM"
scim -d

xscreensaver -no-splash &
esd &amp;amp;

icewmbg &
icewmtray &
exec icewm

文件名

可以用convmv来转换

convmv -f gb2312 -t utf8 -r --notest *

-r表示包含所有子目录不加--notest就表示只看看有什么需要转换的，不做实际转换

文件内容

可以用iconv，不过我更喜欢用emacs.用emacs打开文件
```
C-x <return> f utf-8 <return> C-x C-s
```
搞定。
或者用vim(Thanks to fishy9@SMTH)
```
:set fenc=utf8
:w
```

xterm

理论上uxterm, rxvt-unicode-ml, mlterm, gnome-terminal都应该支持的，但似乎uxterm, rxvt总有莫名奇妙的问题，所以我用mlterm(gnome-terminal的粘贴风格我不喜欢)

输入法

Emacs

只需加如下两句

(setq current-language-environment "Chinese-GB")
(prefer-coding-system 'utf-8)

这时你用

C-h v coding-category-list

可以看到coding-category-list等于

(coding-category-utf-8
coding-category-iso-8-2 coding-category-big5 coding-category-iso-7-else
coding-category-iso-8-1 coding-category-utf-16-be
coding-category-utf-16-le coding-category-iso-7-tight
coding-category-iso-7 coding-category-iso-8-else
coding-category-emacs-mule coding-category-raw-text
coding-category-sjis coding-category-ccl coding-category-binary)

其中coding-category-iso-8-2就是chinese-iso-8bit，也就是gb2312

LaTeX

CJK

cjk-latex支持UTF8，但需要cyberb(debian没有提供)，只好拿ttf-arphic-gbsn00lp伪装一个。

首先安装tetex-bin, cjk-latex, ttf-arphic-gbsn00lp，然后运行脚本unisong

unisong含两个文件，Makefile和c70song.fd

Makefile:

all: build
build:
   ln -s /usr/share/fonts/truetype/arphic/gbsn00lp.ttf unisong.ttf
   ttf2tfm unisong.ttf -w unisong@Unicode@ > unisong.log
   rm -f unisong.map
   for i in *.tfm; do        base=`basename $$i .tfm`;       echo "$${base} <$${base}.enc > unisong.map;     done

install:
   install -d /usr/share/texmf/fonts/tfm/arphic/unisong/
   cp *.tfm /usr/share/texmf/fonts/tfm/arphic/unisong/
   install -d /usr/share/texmf/dvips/arphic/
   cp *.enc /usr/share/texmf/dvips/arphic/
   cp unisong.map /etc/texmf/dvips/
   cp unisong.map /usr/share/texmf/dvips/omega/
   cp unisong.ttf /usr/share/texmf/fonts/truetype/arphic/
   grep '^unisong@Unicode@ unisong.ttf' /etc/ttf2pk/ttfonts.map || echo "unisong@Unicode@ unisong.tff" >> /etc/ttf2pk/ttfonts.map
   cp -f c70song.fd /usr/share/texmf/tex/latex/CJK/UTF8/
   grep "^map +unisong.map" /etc/texmf/pdftex/pdftex.cfg || echo "map +unisong.map" >> /etc/texmf/pdftex/pdftex.cfg
   mktexlsr
clean:
   rm -f unisong*.tfm
   rm -f unisong*.enc
   rm -f unisong.map
   rm -f unisong.ttf
   rm -f unisong.log
.PHONY: build

c70song.fd(cjk-latex自带的c70song.fd，再将cyberb改成unisong就可以了):

% This is the file c70song.fd of the CJK package
%   for using Asian logographs (Chinese/Japanese/Korean) with LaTeX2e
%
% created by Werner Lemberg
%
% Version 4.5.1 (17-Jun-2002)

\def\fileversion{4.5.1}
\def\filedate{2002/06/17}
\ProvidesFile{c70song.fd}[\filedate\space\fileversion]


% character set: Unicode U+0080 - U+FFFD
% font encoding: Unicode

\DeclareFontFamily{C70}{song}{\hyphenchar \font\m@ne}

\DeclareFontShape{C70}{song}{m}{n}{<-> CJK * unisong}{}
\DeclareFontShape{C70}{song}{bx}{n}{<-> CJKb * unisong}{\CJKbold}

\endinput

OK, make; su; make install就可以了

然后把你的latex文章转成UTF8编码，把\begin{CJK*}{GB}{song}改成 \begin{CJK*}{UTF8}{song}就可以了。

non-CJK

如果在英文文档中，你想直接输入Å，而不是去输入\AA的话，可以使用latex-ucs来帮你实现。

安装完latex-ucs后，在tex文件的头部加入

\usepackage{ucs}
\usepackage[utf8]{inputenc}

就可以在正文中输入任何unicode字符了。

如果在文件头加入

\usepackage[C10,T1]{fontenc}
\usepackage[cjkgb]{ucs}
\usepackage[utf8]{inputenc}

那么输入中文也没有问题，不过就不能享受CJK带来的其他好处了。

FTP

很多FTP的文件名是GB2312编码的。解决方法有：

用mozilla系列浏览器来访问FTP，可以自行设置编码。
用lftp，在$HOME/.lftp/rc或/etc/lftp.conf文件中加入(感谢：yabozj AT zju.edu.cn)
```
set ftp:charset "gbk"
set file:charset "UTF-8"
```

MP3 ID3 tag

xmms 里边一堆的乱码，看来是ID3的编码的问题，没办法，又写了一个脚本。注意，这个脚本依赖于pyid3lib, 你可以到 http://pyid3lib.sourceforge.net/ 下载，如果你用debian的话，可以直接用 apt-get install python-id3lib 安装

#!/usr/bin/python
# mp3iconv.py
import os
import pyid3lib

def texticonv(text, fcode, tcode):
try:
   text.decode(tcode)
except UnicodeDecodeError:
   try:
       newtext = text.decode(fcode)
   except UnicodeDecodeError:
       return False, None
   newtext = newtext.encode(tcode)
   return True, newtext
   os.rename(fname, newfname)
return False, None


def mp3iconv(fname, fcode='gb2312', tcode='utf8'):
tag = pyid3lib.tag(fname)
needupdate = False
for key in ['artist', 'title', 'album']:
   try:
       text = getattr(tag, key)
   except AttributeError:
       continue
   r, newtext = texticonv(text, fcode, tcode)
   if r:
       setattr(tag, key, newtext)
       needupdate = True
if needupdate:
   tag.update()

def main():
import sys
assert len(sys.argv) > 1
for x in sys.argv[1:]:
   mp3iconv(x)

if __name__ == '__main__':
main()

Tags:

LI Daobing's blog

My things in UTF8

My things in UTF8

基本配置

文件名

文件内容

xterm

输入法

Emacs

LaTeX

CJK

non-CJK

FTP

MP3 ID3 tag

Tags

友情链接

LI Daobing's blog

My things in UTF8

My things in UTF8

基本配置

文件名

文件内容

xterm

输入法

Emacs

LaTeX

CJK

non-CJK

FTP

MP3 ID3 tag

Related Posts:

Tags

友情链接