[an error occurred while processing the directive]
How It Works
This document describes the algorithm used by Apache-RUS
to determine the encoding in which the document should be given to the client. In a certain sense,
here we repeat the description of configuration directives,
but the current order of their description corresponds to the behavior of the server, and detailed
explanation of directives is omitted.
The document describes the latest versions
(PL20 and PL21). Configuration of older versions (PL16 and earlier), including directive names,
is different and described in a separate document.
The points where the behavior of old and new versions differs essentially are specially noted.
Hereafter, the terms charset and encoding are used as synonyms.
Preliminary Notes
The main purpose of the recoding module is to perform correct conversion from the
"on-disk charset" (the storage encoding) to the "client's charset"
(the transfer encoding) when the document is given to the client and to perform the reverse
conversion when information is received from the clients (submitted forms, etc.).
All possible ways of such a conversion should be described in the server configuration
by the directives CharsetDecl
(the existence of the code table is declared to the server) and
CharsetRecodeTable
(conversion from one encoding to another is described). All encodings and recodings available
should be described only in the configuration of the server/virtual server.
Description of the CharsetDecl and CharsetRecodeTable directives in
.htaccess/<Directory> is forbidden by an obvious reason: such a description requires
that the server must reinitialize the recoding tables each time this directory is addressed, and
a lot of superfluous actions are thus undertaken. All other Charset... directives may be
specified wherever desirable.
The storage encoding (the one used for storing files on disk) should be specified (maybe
separately for each directory) by the CharsetSourceEnc
directive, which describes all files in a directory, or by the
CharsetByExtension directive. The latter has
a higher priority.
Determination of the Client's Encoding by Accept-Charset:/Accept
If the HTTP headers include the Accept-Charset: SomeCharset or Accept: text/x-cyrillic-SomeCharset
header and at least one of the requested charsets is known to the server (that is, described in the
CharsetDecl or
CharsetAlias directive), the server will send the document
in accordance with the charset requested. If the server knows several charsets among the requested ones,
the one with the highest priority will be selected. If several charsets in the request have the same
maximum prioity, the one mentioned prior to others in the
CharsetPriority directive will be chosen.
If this directive is absent, the result of choice among these highest-priority charsets is ambiguous.
If the Accept-Charset (Accept) header specifies only charsets that are unknown to the server and does not
include the wildcard (*), the server behavior depends on the
CharsetErrReject flag. If this flag is set to On, the client
will receive an error message; if it is set to Off, the server will try to determine the client's charset
using other parameters.
Determination of the client's encoding according to the Accept-Charset header cannot
be cancelled completely: the HTTP standard would be violated. However, there are some
particular cases when the action of Accept-Charset should be cancelled. For example,
Netscape Communicator 4.x in the default configuration sends the
"Accept-Charset: iso-8859-1,*,utf-8" header; accordingly, if you have described
Charset iso-8859-1, then the user with NC 4.x will always see iso-8859-1. To
cancel determination of encoding by Accept-Charset in such specific cases, you may use
the CharsetBrokenAccept directive.
Determination of the Client's Encoding by Other Parameters
If the AcceptCharset/Accept headers are absent in the request or the server cannot select the encoding
according to them, it will try to determine the user's encoding by three parameters:
- By the port number (supported since ver. PL20.2). If the TCP port addressed coincides with one of
those described by the CharsetByPort directive,
the encoding specified by this directive will be selected.
- By the server hostname. If the hostname of the server (virtual server) starts from a charset name
(CharsetDecl) or alias
(CharsetAlias), this encoding will be selected as the client's one.
- By the URL prefix. If the URL starts from /charset-name/path/to/file.html or
/~user/charset-name/path/to/file.html, this charset will be selected.
- By the type of the user's software (the HTTP header User-Agent).
If the User-Agent header contains a substring described in the
CharsetAgent directive, the corresponding
encoding will be chosen. If there are several matching substrings, the longest one is considered.
If there are several matching substrings of equal length, the result of selection among them is
ambiguous.
The order in which these methods work is specified by the
CharsetSelectionOrder directive
(in versions prior to PL16, you could only "reverse" the order of
DirPrefix/UserAgent and cancel charset determination by the hostname prefix or
directory prefix).
The required degree of matching between the server hostname/filename prefix and the
name/alias of some encoding may be controlled by the
CharsetStrictURIMatch directive.
In the Off mode (the default one), the server selects the encoding by the hostname/directory
if the beginning of the server/directory name matches the name/alias of some charset.
In the On mode, the checking is more rigorous: the charset name should coincide with the full
name of the server or its host part (for selection by the hostname) and, accordingly, with the
full name of the directory (for selection by the directory name).
If the Server Failed
If the server failed to determine the client's encoding, the document will be given to the client
in the encoding determined by the
CharsetDefault directive.
If CharsetDefault is not specified, the charset mentioned as the first one in the
CharsetPriority directive will be used.
SSI
Since all directives (except for CharsetDecl and CharsetRecodeTable)
may be present wherever desitable, documents may be stored in any mixture of encodings.
Some complications may be caused by ServerSideIncludes. The rule is simple: a file (even included
via SSI) adheres to the rules of the directory in which it is physically situated.
The HTTP header Content-Type: text/html; charset=...
The server provides the substring "; charset=CharsetName" in the Content-Type: header
depending on the
CharsetMatchLanguage directive.
If it is set to On, charset=... is provided if the following three conditions are simultaneously
satisfied:
- The client's browser is not a
Bad Agent
- The MultiViews option (support of multilingual representation) is set to On
- The document language described by the AddLanguage directive is the same
is the language of the charset, which is described by the
CharsetDecl directive.
If the CharsetMatchLanguage option is set to Off, then charset=... is provided for
all documents.
[an error occurred while processing the directive]