<Japanese version of this document>

Setup of w3m with Multiple character Encoding Extension


Start up flow

The original w3m (denoted by w3m below, and assume that the executable file has the same name) reads its configuration options from

in this order. Later specifications override previous ones.

The w3m with multiple character encoding extension (denoted by w3mmee, and assume that the executable file has the same name) needs to know realm of automatic detection of encoding scheme, encodings which your terminal accepts, conversion manner of encoding and character set, messages localized for your language, and so on. Hence its startup flow is somewhat complecated.

First w3mmee examine value of the environment variable ``W3MLANG'' (or ``LANG'' if ``W3MLANG'' is unset). It lowers cases of alphabets in the value, and regards the value as in the form:

<language code>+"_"(under score)+<country code>+"."(period)+<encoding>
For instance, if ``W3MLANG'' has value ``ja_JP.UTF-8'', w3mmee will get

From these components, w3mmee composes file names:

  1. $LIB_DIR/w3mconfig
  2. $LIB_DIR/w3mconfig.<language code>
  3. $LIB_DIR/w3mconfig_<country code>
  4. $LIB_DIR/w3mconfig.<encoding>
  5. $LIB_DIR/w3mconfig.<language code>_<country code>
  6. $LIB_DIR/w3mconfig.<language code>.<encoding>
  7. $LIB_DIR/w3mconfig_<country code>.<encoding>
  8. $LIB_DIR/w3mconfig.<language code>_<country code>.<encoding>
and reads configuration options from these files in this order.

Next it reads expositions of options displayed in the option setup panel, from the files:

  1. $LIB_DIR/w3mmessages
  2. $LIB_DIR/w3mmessages.<language code>
  3. $LIB_DIR/w3mmessages_<country code>
  4. $LIB_DIR/w3mmessages.<encoding>
  5. $LIB_DIR/w3mmessages.<language code>_<country code>
  6. $LIB_DIR/w3mmessages.<language code>.<encoding>
  7. $LIB_DIR/w3mmessages_<country code>.<encoding>
  8. $LIB_DIR/w3mmessages.<language code>_<country code>.<encoding>
in this order. Only the lines of the form:
<option name>+"="(equal sign)+<exposition>
are recognized as definitions of expostions. Spaces at beginning of lines, at end of lines, before equal signs, and after equal signs, are removed.

Finally

$HOME/.w3mmee/config

per user configuration file of the same format as $LIB_DIR/w3mconfig*,

$HOME/.w3mmee/messages

per user message setup file of the same format as $LIB_DIR/w3mmessages*,

are read and evaluated in the same manner as $LIB_DIR/w3mconfig* and $LIB_DIR/w3mmessages*, respectively.


Mapping between locale codeset names and MIME charset names

Contents of this section is applicative only when you configured w3mmee to use gettext().

When return value of gettext() function contains non US-ASCII characters, encoding of such characters must be converted to internal one. Gettext() determines encoding of its output based on codeset name in current locale, while w3mmee uses MIME charset name. Unfortunately a codeset name and a MIME charset name for an encoding scheme differ from each other in general, so w3mmee needs mapping table between them.

Though such table is already built into w3mmee, it is quite possible that the table is insufficient in your environment. Then you can tell additional correspondences to w3mmee with files

  1. $LIB_DIR/locale2mime
  2. $HOME/.w3mmee/locale2mime
each line of which must be of the form
<MIME charset name>+"="(equal sign)+<lang. spec>[+","(comma)+...]
where you may add optional spaces around "=" and ",". <lang. spec> must be a string of the form
<language code>+"_"+<country code>+"."+<codeset name>
where any (but not all) of <language code>, "_"+<country code>, or "."+<codeset name> may be omitted.


New options concerning character encoding

The followings are the list of new configuration options concerning character encoding added by multiple character encoding extension.

mylang <string>

Specifies your language. Currently, value of this option is used only to restrict realm of encoding schemes for autodetection.

For example, assume that you have specified as

mylang cjk
and try to read a document with no charset specification. Then w3mmee try to find encoding scheme among

You can also specify comma seprated list of names of character encoding schemes. In this case, the encoding schemes are used as candidates for autodetections.

mylang_charset <string>

Specifies encoding scheme of a document, of which w3mmee fails to autodetect encoding scheme.

tty_charset <string>

Specifies encoding scheme of terminal I/O.

tty_initial_charset <string>

Using this option is deprecated. Please use tty_initial_input_charset and tty_initial_output_charset instead.

tty_initial_input_charset <string>

When ISO 2022 conforming encoding scheme is specified with tty_charset, initial state of intermediate buffers of that encoding for input stream from tty can be modified to that of encoding scheme specified with this option.

tty_initial_output_charset <string>

When ISO 2022 conforming encoding scheme is specified with tty_charset, initial state of intermediate buffers of that encoding for output stream to tty can be modified to that of encoding scheme specified with this option.

tty_input_converters <string>

Specifies conversions of encoding scheme and character set of terminal input.

Please use this option only if you completely understand behavior of the support library used by multiple character encoding extension.

tty_output_converters <string>

Specifies conversions of encoding scheme and character set of terminal output.

Please use this option only if you completely understand behavior of the support library used by multiple character encoding extension.

tty_fallback_converters <string>

Unless terminal can display a character or replacement string is specified for the character, conversions specified by this option are applied to the character.

Please use this option only if you completely understand behavior of the support library used by multiple character encoding extension.

input_charset <string>

Specifies encoding scheme of a document which contains no charset sepcification, and makes w3mmee to stop autodetection of encoding scheme.

input_converters <string>

Specifies conversion of encoding scheme and character set of characters input from network or a local file.

Please use this option only if you completely understand behavior of the support library used by multiple character encoding extension.

output_charset <string>

When a document contains no charset specification and w3mmee fails to autodetect encoding scheme of the document, w3mmee assumes that name of encoding scheme of the document is that specified by this option.

If the document contains a form requiring input of text, argument passed to the action of the form after conversion to the encoding. Currently this is the only case affected by this option.

output_converters <string>

Specifies conversion of encoding scheme and character set of characters output to network or a local file.

Please use this option only if you completely understand behavior of the support library used by multiple character encoding extension.

process_charset <string>

Specifies encoding schemes for strings which may be passed to a local process, such as arguments for bookmark registration program.

<string> must be a space seprated list of charset specifications of the following form:

<sep1>+<regular expression for process name>+<sep2>+<charset>
or
<charset>
Each space separated token is treated as first form if the first character is non-alphanumeric. Otherwise it is regared as second form. In first form, if <sep1> is "(", "{", "[", "<", or "^", <sep2> must be ")", "}", "]", ">", or "$", respectively. Otherwise <sep2> must equal <sep1>. <sep1> and <sep2> are treated as part of regular expression, only if they are "^" and "$", respectively. Second form is an abbreviation of
"^.*$"+<charset>

A process name given, regular expressions are matched against the name in order. The charset corresponding to the expression of which match succeeded first is adopted.

tty_character_conversion <character range> <replacement string>

Specifies characters which your terminal can't handle. Instead of any character in the range, w3mmee output to terminal the first matching one in the list:

  1. the character itself if no string specification or if the string is "NULL" (without quotes, case sensitive),
  2. the string specified by this option unless it is "REJECT" (without quotes, case sensitive),
  3. the string specified by the option tty_character_replacement,
  4. the character "?" (question mark).

In case that options of this type appear twice, and that one includes another, more specific one is adopted. Or if the ranges overlap, only overlapping range is overwritten by the latter specification.

tty_character_replacement <string>

Specifies default replacement string for characters which your terminal can't handle.

view_buf <string>

Specifies a format string for messages representing documentations in buffers with mouse support disabled (including the case that mouse support was disabled when configured).

view_buf_with_mouse <string>

Specifies a format string for messages representing documentations in buffers with mouse support enabled.

omitted <string>

Specifies replacement string when middle part of a long URI is omitted.

ul_marks <string>

Specifies comma separated list of strings leading items of <ul> construct.

ul_type_disc <string>

Specifies a string leading items of <ul> of which type attribute is "disc".

ul_type_circle <string>

Specifies a string leading items of <ul> of which type attribute is "circle".

ul_type_square <string>

Specifies a string leading items of <ul> of which type attribute is "square".

small_img_alt <string>

Specifies replacement string for small images.

hr_rule <string>

Specifies a string used to draw <hr>.

menu_frame <string>

Specifies a comma separated list of menu frame components starting with left-top corner, left to right, and top to bottom.

rule <string>

Specifies a comma separated list of table borders in the order:

  1. center,
  2. left edge,
  3. top,
  4. left-top corner,
  5. right edge,
  6. vertical bar,
  7. right-top corner,
  8. bottom,
  9. left-bottom corner,
  10. horizontal bar,
  11. right-bottom corner.

rule_bold <string>

Specifies a comma separated list of table bold face borders in the order:

  1. center,
  2. left edge,
  3. top,
  4. left-top corner,
  5. right edge,
  6. vertical bar,
  7. right-top corner,
  8. bottom,
  9. left-bottom corner,
  10. horizontal bar,
  11. right-bottom corner.

message_about_config_save <string>

The option setup panel has an additional item to choose whether new setup will be saved to $HOME/.w3mmee/config. This option specifies an exposition of this configuration option.

charset_cname <string>

Specifies a canonical name of non-standard charset names in the form

<canonical name>+"="(equal sign)+<comma spearated list of charset names>
No space is allowed around equal sign or comma. Charset names are case insensitive.

For example, to treat a page containing charset specification ``charset=SHIFT-JIS'' as if its charset is ``Shift_JIS'', please add the line

charset_cname shift_jis=shift-jis
to your config file.

If there are two options of this type defining the same canonical name, the latter overrides the former.

unicode_width <string>

Specifies the name of a character width table. Recognized names are as follows (names are case insensitive).

xterm
The same as that in xterm-147. Xterm of newer version may have a different one.
EastAsianWidth_AmbiguousToNarrow, eaw_a2n
Conforming UAX #11, and characters marked as ``Ambiguous'' are assinged with width 1.
EastAsianWidth_AmbiguousToWide, eaw_a2w
Conforming UAX #11, and characters marked as ``Ambiguous'' are assinged with width 2.


New miscellaneous options

The followings are the list of new configuration options not concerning character encoding. Since original w3m does not recoginize for various reasons (because my patch was rejected, or I have not ported yet related codes to original w3m for my laziness), they are listed in this document.

accept_encoding  <encoding name> <media type> <argv[0]> <path to command>

Binds value <encoding name> of HTTP header field "content-encoding", MIME type <media type>, and a filter program to decode contents encoded with method identified by the name <encoding name>. For this option to be functional, you further need to bind <media type> with a file name extesion by adding a line

<media type> <the extension>
to the file $HOME/.mime.types.

In case that options of this type appear twice or more, and that encoding names coincide, last specification is adopted.

language_extension <string>

Specifies a comma separated list of file extensions which stand for content languages.

If a file has multiple extensions, the extensions listed in this option is skipped when w3mmee determines content type of the file.

search_across_lines <boolean value>

Specifies whether regular expression search across multiple lines is enabled or not.

concurrent <number>

Specifies maximum of number of processes to load documents.

concurrent_per_server <number>

Specifies maximum of number of processes to load documents from each server.

follow_redirection <number>

Specify how many redirections should be followed.

request_header <string>

Specify optional HTTP request header to be added. The headers

Host, Pragma, Cache-Control, Content-Length

are always assigned with values generated by w3mmee, and your specifications are ignored. The headers

UserArgent, Accept, Accept-Encoding, Accept-Language

ara assigned with values generated by w3mmee unless you explicitly specify them. The headers

Content-Type, Referer

are assigned with values which you specify only if there is no other appropriate value. The headers

Cookie, Cookie2,

are assigned with values which you specify only if cookie support in w3mmee is disabled by compile option, by command line option, or by configuration option. Otherwise w3mmee decides their values.

In case that options of this type appear twice or more, and that header names coincide, last specification is adopted.

http_version <string>

Specify version of each HTTP request. Acceptable value is "1.1" or "1.0" (without double quotation marks). Any other value is silently ignored, and version is set to "1.1".

anchor_num_style <string>

Specify style of refering anchors in formatted dump of a document. It is passed to sprintf function toghether with number (starting with 1) in the list of all links in the document. So it must contain one and only one sprintf conversion specification "%d".

img_num_style <string>

Specify style of refering images in formatted dump of a document. It is passed to sprintf function toghether with number (starting with 1) in the list of all links in the document. So it must contain one and only one sprintf conversion specification "%d".

label_withinpage_style <string>

Specify style of optional line number and columns information of links to labels within the same document in formatted dump of a document. It is passed to sprintf function toghether with line number and columns (both starting with 1). So it must contain just two sprintf conversion specifications "%d".

link_num_url <string>

When make link references in a formated output of a document, <string> is used as URL of the document.

scroll_amount <number>

When a cursor moving command is issued and cursor goes outside current view, view scrolls <number> lines or columns.

mailcap_entry <string>

Specify a mailcap entry of maxmal priority, which is intended to change an external viewer temprarily.

Options of this type can appear more than once.

browsecap_entry <string>

Specify a browsecap entry of maxmal priority, which is intended to change an external browser temprarily.

Options of this type can appear more than once.

wrap_line <boolean>

Specify whether to wrap a line wider than screen width or not.

line_truncated <character>

Specify the indicator of truncated lines.

line_continued <character>

Specify the indicator of continued lines.

preload_image <boolean>

Specify whether to load inline images before actually displayed or not.

img_valign <position>

Specify default virtical alignment of inline images. <position> must be one of D (stands for "default"), T (stands for "top"), M (stands for "middle"), or B (stands for "bottom"). D is almost the same as B, but somewhat differs for smalle images.

table_valign <position>

Specify default virtical alignment in table. <position> must be one of T (stands for "top"), M (stands for "middle"), or B (stands for "bottom").

when_redirected <behaviour>

Specify behaviour when HTTP request with method other than GET or HEAD is redirected with HTTP response code 301 or 302. <behaviour> must be one of

0
always follows redirection with original request method,
1
always follows redirection with GET method,
2
always ignore redirection,
3
query at run time.

frame_color <color>

Specify color of frame borders.

auto_pixel_per_char <boolean>

Specify whether or not number of pixels per character can be auto-detected.

auto_pixel_per_line <boolean>

Specify whether or not number of pixels per line can be auto-detected.

try_extensions <string>

Specifies a comma separated list of file extensions. When it has failed to open a local file, w3mmee appends each of the extensions to the name of the file, and retries to open a file with the new name.

You can specify "*" (asterisk without quotes) as an item in the list, which is expanded to the comma separated list of all the file extensions bound to content encoding methods (".Z,.bz2,.gz" by default, see accept_encoding option).

edit_remote_source <boolean>

Specify whether or not you want to edit cached sources of remote pages.

remove_traling_spaces <boolean>

Specify whether or not trailing spaces of each formatted line should be removed.


Enhancement of string expansion in mailcap entry

w3mmee recognizes following additional %-escapes on string expansion in mailcap entry.

%h

The host part of URL.

%p

The port part of URL.

%u

The whole URL.

%{<test>?<yes>:<no>}

First %<test> is tested whether it expands to something. Please notice that "%" is prepended to the beginning of <test>. If it really expands to anything including empty string, <yes> is processed. Otherwise <no> is processed. If <yes> is omitted, it is treated as if <test> is copied to that place. If <no> is omitted and if expansion of <test> fails, whole escape is replaced with empty string.


browsecap -- External browser capability file

w3mmee includes a mechanism to determine an external browser invoked on a URL automatically based on the scheme part of the URL. Bindings of external browsers and schemes are given by "browsecap" files. w3mmee trys to scan two files

  1. $LIB_DIR/browsecap
  2. $HOME/.w3mmee/browsecap
and makes binding table in the same manner as for "mailcap" files.

File format is also the same as "mailcap" files. Only exception is that the first field of each entry must be of the form

<scheme>+"/"(slash)+<method>
where currently supported <method> is "post", "get", or "download". <method> part may be "*" (asterisk), which is treated as a usual wildcard. In case that <method> part is "post", arguments which should be passed to a CGI program, is passed to a matched external browser as its standard input.

If relevant URL contains query string and if the query string includes a component like <word>=<value>, an escape sequence of the form %{<word>} expands to <value>. Further the escape sequence %? expands to whole of the query string (the first question mark is exclusive).

The browsecap facility is also used to determine an editor used to edit the source file of a buffer, the formatted image of a buffer, value of a input control of text type of a form element, or contents of a textarea control of a form element. An entry is adopted for this purpose if the first field of it matches "x-w3m-edit/buffer", "x-w3m-edit/screen", "x-w3m-edit/inputtext", or "x-w3m-edit/textarea", respectively.

Parser of mailcap and browsecap entries in w3mmee recognizes new flags "x-w3m-internal", "x-w3m-cgioutput", "x-w3m-match=<regexp>", and "x-w3m-nc-match=<regexp>".

If the flag "x-w3m-internal" is set in an entry, the entry is restricted to internal use such as determining process of an enditor described above. I recommend to set this flag in entries for such editors.

If the flag "x-w3m-cgioutput" is set, the program determined by the entry is treated as if it is a CGI program, that is, various environment variables are set before calling the program and lines before the first empty line in output of the program are parsed as HTTP response header.

Flags "x-w3m-match=<regexp>" and "x-w3m-nc-match=<regexp>" are only recognized in browsecap. They are exclusive, and if both are set for one entry, the latter one is atopted. If one of them is set, <regexp> is matched against the whole URL (in case-insensitive manner for "x-w3m-nc-match=<regexp>"), and only when match have succeeded, the entry is adopted. When "test=..." is also set, the results are ANDed to determine whether or not to adopt the entry.


Character ranges

The first argument of tty_accept_character or of tty_reject_character must be of the following form. For Unicode characters,

"U+"+<hexadecimal notation of Unicode>.
or
"U+"+<hexadecimal notation of Unicode of starting character in the range>+ "-"+<hexadecimal notation of Unicode of ending character in the range>
For non-Unicode characters,
"I+"+<internal representation of character>
or
"I+"+<internal representation of starting character in the range>+ "-"+<internal representation of ending character in the range>

``Internal representation'' of non-Unicode character is computed as follows. First determine an integer S after ISO 2022 classification of character set:

Then, for 94, 96, or 94x94 set, let F be the final octet of designating sequence in ISO 2022 encoding. For 94 set which needs further itermediate octet 2/1 in its designating sequence, further add 0x40 to F. For non-ISO 2022 character set, the support library assigns each character set with an integer to identify the set. We adopt that integer as F.

Finally order all the codepoints representable in the character set, and assign all codepoints with numbers C starting with 0, in that order.

Hexadecimal notations S, F, C joined with ``+'' (plus sign) compose ``internal representation''.

F and C are optional, and their default values are