|
The following scenario illustrates a typical client-server interaction using HTML and the web:

Source: Publish Dynamic Applications on the Web by David R. McClanahan, Databased Web Advisor, April 1997
Scenario 1
The Web browser requests an HTML file from the server, and displays the HTML file with entry fields created by the FORM element.
Questions
- If the server has different language versions of the same HTML file, how does the server know which language version to send to the browser?
- How does the Web browser know the code page used to encode the HTML file?
Answers
When the Web browser requests a specific HTML file from the server via HTTP 1.1 protocol (note HTTP 1.0 would not work here) during the initial content negotiation phase the browser can tell the server which language(s) and code page(s) it can accept.
Example
Client GET foo.html HTTP/1.1
Accept-Language: zh, en;q=0.5
Accept-Charset: big5, x-euc-tw;q=0.5
Accept: */*
Server 200 OK
Content-Type: text/html; charset=big5
Content-Language: zh
Content-Length: 1042
... data ...
The Accept-Language parameter tells the server what language(s) is acceptable to the browser, and the Accept-Charset parameter tells the server what code page(s) is acceptable to the browser. In the above example, the Web browser asks the server to send it the HTML file called foo.htmlencoded in Chinese (although English is acceptable as an alternative). The HTML file should be encoded in the Big5 code page, although EUC-TW code page is also acceptable as an alternative. If the server cannot satisfy the browser's specified language(s) or code page(s), it should send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed. If no Accept-Language is presented, the default is that any language is acceptable. If no Accept-Charset is presented, the default is that any code page is acceptable.
Note that zh is used to denote the Chinese language, but does not specify whether it is Simplified Chinese or Traditional Chinese. This means the user is comfortable with any form of Chinese. The user could have set the browser's preferred language to zh-CN (for Simplified Chinese) or zh-TW (for Traditional Chinese).
To see Accept-Language in action, set your browser so that English is your first preference. (In Netscape Navigator v7, you can do this via the Edit-->Preferences-->Navigator-->Languages interface .) Then clear the browser's memory and disk caches, and go to http://www.alis.com. You'll see their home page in English. Now set your browser such that French is your first preference, clear the caches again and reload the Alis home page. You'll now see their home page in French!
In the first case, Navigator is sending to www.alis.com the HTTP header:
Accept-Language: en,fr
and in the second case, Navigator is sending:
Accept-Language: fr,en
For Navigator v7.0 running on Windows XP, it generates:
Accept-Charset: UTF-8,*
by default, but you can change this default via:
Edit-->Preferences-->Navigator-->Languages-->Default Character Encoding.
Internet Explorer v6.0 running on Windows XP does not generate any Accept-Charset header, even when HTTP 1.1 is asked to be used.
The server typically would store different language versions of the same HTML file, but would not store the same HTML file encoded in different code pages multiple times. See HTML Documents Coded Character Sets Guidelines for a list of recommended code pages used to encode the HTML file, depending on the language.
Suppose the Accept-Charset specifies code page X but the HTML file is encoded in code page Y, the server should perform a code page conversion of the HTML file from Y to X on the fly prior to sending it back to the requesting browser.
According to the HTML 4 specification, the Web browsers (user agents in HTML terminology) must NEVER assume any default encoding code page, and should use the following algorithms (in decreasing priority order) to determine the encoding code page of the HTML file:
- The browser already knew the code page used to encode the HTML file.
- If HTTP 1.1 (RFC 2616) protocol was used, the server would respond to the encoding code page of the HTML file in the charset parameter of the Content-Type header.
Example
Server 200 OK
Content-Type: text/html; charset=big5
Content-Language: zh
Content-Length: 1042
... data ...
Here is a direct quote in Section 3.4.1 of RFC 2616:
HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document.
How to make the server send out the appropriate charset information? See www.w3.org/International/O-HTTP-charset.html for some answers.
The content author can specify the charset inside the HTML file via the META element.
<!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<META http-equiv="Content-Type"
content="text/html; charset=big5">
<TITLE>...</TITLE>
:
</HTML>
The META declaration must only be used when the character encoding is organized such that US-ASCII characters stand for themselves at least until the META element is parsed.
Also, if the server needs to convert the HTML file into a different code page prior to sending it to the browser, the server should also update the META element's charset value accordingly.
- The charset attribute on an HTTP element that designates an external resource, such as:
<A href="/zh/tw/foo.html" charset="big5">...</A>
- Heuristic algorithm such as those used to determine the various Japanese encodings.
- User definable.The browser respects whatever the user has selected in the Options: Document Encoding menu.
XHTML v1.0 compatible Web browsers issue:
Traditionally, the character encoding of an HTML document is either specified by a Web server via the charset parameter of the HTTP Content-Type header, or via a meta element in the document itself. In an XML document, the character encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="zh-cn"?> ). In order to portably present documents with specific character encodings, the best approach is to ensure that the Web server provides the correct headers. If this is not possible, an XHTML document that wants to set its character encoding explicitly must include both the XML declaration of an encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=zh-cn" />). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence.
For dynamic forms generated by the server (such as from a CGI script or Server-Side Include function), the server can tell the browser what code page(s) it can accept via the Accept-Charset attribute of the FORM element.
Example< /STRONG>
<FORM Accept-Charset="big5, x-euc-tw" Type= ...>
In the above example, the server tells the browser it can only accept data encoded in either Big 5 or EUC-TW, but not both. The browser should then act accordingly to ensure all future user data sent to the server are encoded in either Big 5 or EUC-TW code page.
Note that each HTML file is limited to a single code page, thus making sure the code pages used support all the characters in the file. Unicode (UTF-8) is a good choice if you have multilingual data in the HTML file. Even if the HTML file contains only one language and script, Unicode, especially UTF-8, is still the encoding of choice because all major browsers support Unicode (UTF-8). Otherwise an American user may need to download and install Chinese Big5 support when browsing a Web page encoded in Big5 for example.
Scenario 2
The user enters some data into the HTML FORM entry fields and clicks the Submit button. The browser formats the input data into a message and sends it to the Web server via either the [HTTP] GETor POST method.
Using The GET or POST (ENCTYPE="application/x-www-form-urlencoded") method
Both the GET method and the POST method with ENCTYPE="application/x-www-form-urlencoded" (default) append the user data to the URL of the application (as specified in the METHOD attribute), using a restricted subset of the 7-bit US ASCII code page. Since HTML 4 and RFC 1738 allow only ASCII characters, browsers encode non-ASCII characters using escape sequences %HH...%HH, where HH are two hexadecimal digits.
Example
The two double-byte Kanji characters that denote "Japan" will appear in the URL-encoded string as %93%FA %96%7B when encoded using IBM PC code page 932.
Question
How does the browser let the server's application know what code page was used to encode the data, especially the %HH code points?
Answer
When an HTML file with a charset defined (via the META element) contains a FORM, the FORM data is submitted in that specified charset, however the input text fields--<INPUT TYPE=TEXT> and <TEXTAREA>--are handled by the native platform. The result can be confusing. Consider the following scenario:
On a system running ISO-8859-1, use the browser to open an HTML file with a FORM. The element in the file specifies the charset to be ISO-8859-2. The file contains the code point X'A3', which the browser properly displays as the Latin capital letter L with stroke. The user enters the code point X'A3' via an input method editor (IME) to the input field, but now the pound sterling symbol appears on the screen. How is it that the same code point X'A3' appears as Latin capital letter Lwith stroke character outside the input fields and as the pound sterling symbol inside an input field?
The reason for the above behavior is that the operating system is responsible for the input and output of the input text fields, and the browser is responsible for the rest of the HTML file. Thus the operating system interprets the X'A3' code point using ISO-8859-1 (which is the pound sterling symbol), while the browser interprets the X'A3' code point using ISO-8859-2 (which is the capital L with stroke character). Just before sendingthe data to the server, however, in this case the browser will convert the input to ISO-8859-2.
Unfortunately, there is no structured or standardized mechanism to communicate the code page information. IBM engineers have further researched this problem by looking at the specifications of HTML 3.2, HTML 4.0, and HTTP 1.1, plus Microsoft and Netscape Web sites. They specifically looked for information on how the code page encoding via the FORM GET method is supposed to be handled in HTML 3.2 and HTML 4.0. They has also investigated the Netscape suggestion of coding a hidden field to specify the encoding code page for the GET method. The engineers did not find any definitive statement as to how the browser is supposed to handle the encoding of FORM data sent back to the server. By performing some experiments and by reading between the lines, the investigators concluded that both Microsoft IE and Netscape Navigator encode the FORM data using the same character encoding specified in the META element of the HTML.
After further experimentation, the browser was found to encode FORM data using the current active setting of character encoding. For example, if the HTML file has a meta tag that states it is encoded in ISO 8859-1, the FORM data will also be sent encoded in ISO 8859-1. If the user changes the browser's encoding to ISO 8859-2 say while viewing the HTML file, the FORM data will then be sent in ISO 8859-2. Since you would not know in advance in what language/script the user would enter the FORM data--a Chinese customer may, for example, enters his name in Chinese into the Name field, while a French customer may enter her name in French into the same Name field--using UTF-8 as the HTML file encoding is a good choice. It will preserve the integrity of all the major scripts of the world.
- Scenario A:
- The English HTML file is sent to a browser on a Greek Windows 9x system running Windows ANSI code page 1253. There is no META element with the charset information present in the HTML file.
- The English HTML file is presented correctly (because code page 1253 contains all the English characters).
- The user enters Greek data (encoded in 1253) into the form fields. The browser will send the user data to the server encoded in 1253. Character integrity is preserved.
- Scenario B:
- The English HTML file is sent to a browser on a Greek Windows 9x system running Windows ANSI code page 1253. The META element in the HTML file specifies charset=iso-8859-1.
- The browser will use ISO/IEC 8859-1 to display the HTML file. The file will be presented correctly.
- The user enters Greek data (encoded in 1253) into the form fields. The browser will convert the user data from 1253 to 8859-1 prior to sending then to the server, with potential loss in character integrity since some of the characters in 1253 have no equivalent in 8859-1.
- Scenario C:
- The English HTML file is sent to a browser on a Greek Windows 9x system running Windows ANSI code page 1253. The META element in the HTML file specifies charset=utf-8.
- The browser will use UTF-8 to display the HTML file. The file will be presented correctly.
- The user enters Greek data (encoded in 1253) into the form fields. The browser will convert the user data from 1253 to UTF-8 prior to sending them to the server. Character integrity is preserved (since UTF-8 supports all the characters in 1253).
The server processes the data by calling the appropriate application, and sends the MIME results back to the browser. Since the server's active code page may be (and will be) different from the code page used to encode the browser's submitted data, some servers will automatically convert the browser's submitted data to the server's code page before giving them to the invoked server application. If the data is in the form of application/x-www-form-urlencoded , then the invoked application must decode the data in order to retrieve the original name-value pairs using the following steps:
- Convert the data back into the Web browser's code page.
- Decode the data--such as search for the "&" character (which acts as the name-value pairs delimiter) and the "=" character (which separates a name from its value), and convert the + character back into space.
- Convert the decoded name-value pairs into the server's code page and process the data.
See CGI Form data processing on Host environment for details and some sample C code.
In MS IE 5.0+ under Internet Options-->Advanced, there is a line item called Always send URLs as UTF-8 that is checked by default. If the HTML file header also contains the element specifying the charset as UTF-8 or no charset defined, then the browser uses UTF-8 to encode the data. Remember that whether or not the charset is received from the server, the browser always uses current setting of encoding (View->Character Coding ) to send the FORM data back to server.
How IBM WebSphere Application Server (WAS) solves the problem
WAS contains a file called bootstrap.properties located in the AppServer/properties directory, whose content is initialized by the customer webmaster. During the initialization of the servlet engine, it configures/bootstraps itself using the information in this file.
For URL-encoded FORM data, the Java servlets need to decode the data and then convert them to Unicode. The Java Servlet Development Kit (Servlet Specification Version 2.2) from Sun always assumes ISO/IEC 8859-1 is used to encode the FORM data (see Java Servlet Development Kit "bug" below for more information). WAS uses the following algorithm to detect the encoding code page:
- Check the default.client.encoding entry in the bootstrap.properties file.
- If the entry exists, then its value denotes the encoding code page. Otherwise,
- Check the Accept-Charset in the HTTP protocol. If it exists, then its value denotes the encoding code page. Otherwise,
- Check the Accept-Language in the HTTP protocol. If it exists, then the default code page for the Accept-Language denotes the encoding code page. (WAS has an internal table that maps each language to the most popular PC code page.) Otherwise,
- The encoding code page is assumed to be the file.encoding value returned by the JVM.
Using The POST (ENCTYPE="multipart/form-data") method
The POST method with ENCTYPE="multipart/form-data" is preferred because the value part of each name-value pair is encapsulated in the body part of a multipart MIME body, and sent as an HTTP 1.1 entity (see section 7 of RFC1867). Each body part can (and should) be labeled with an appropriate Content-Type, including a charset parameter that specifies the character encoding scheme. Every character in the HTML Document Character Set (which is ISO/IEC 10646) can be represented using this method.
Example
Content-Type:
multipart/form-data;
charset=iso-8859-1; boundary=AaB03x
----------------------------AaB03x
Content-Disposition: form-data; name="surname"
Cheng
----------------------------AaB03x
Content-Disposition: form-data; name="given-name"
Alexis
----------------------------AaB03x
Note: Netscape Navigator v7 and Microsoft IE v6 do use the charset specified in the META element of the HTML file and HTTP 1.0 to POST the FORM data to the server, but don't generate the charset parameter in the MIME header.
Recommendations
Use the POST method with ENCTYPE="multipart/form-data" to send user-entered data from the client to the server, even though Netscape Navigator v4.7 and Microsoft IE v5 currently use HTTP 1.0 and do not specify the charset parameter in the MIME header. The browsers can change their behaviors at any future release.
If your HTML file does not contain any FORM for the user to input data, then follow the recommendations in the HTML Documents Coded Character Sets Guidelines to encode the HTML file.
If your HTML file contains one or more FORMs for the user to input data, then always use UTF-8 to encode the FORM data in order to prevent any data corruption. Include the following in the HTML file:
:
<META http-equiv="Content-Type"
content="text/html; charset=utf-8">
:
<FORM method="POST"
enctype="multipart/form-data" ...>
:
At the server end, your application can examine the incoming UTF-8 FORM data and take appropriate action. Given that our current translation centers typically return the translated HTML files encoded in non-UTF-8 code pages, you'll need to convert the translated HTML files to UTF-8 yourself.
For quite some time, it has been difficult to configure a server to send out appropriate charset headers. However, this situation has been improved recently. For example, see www.apache.org/docs/mod/mod_mime.html#addcharset for the AddCharset directive in Apache 1.3.10 and later, or www.w3.org/Jigsaw/RelNotes.html#2.1.1 for equivalent support in Jigsaw.
New submit method in XForm
XForms is an XML application that represents the next generation of forms for the Web. It provides a new submit method in "post" as application/xml. This format permits the expression of the instance data as XML that is straightforward to process with off-the-shelf XML processing tools. In addition, this format can submit binary content. And the encoding charset of submitted data is defined as an XML declaration, for example, <?xml version="1.0" encoding="zh-cn"?>.
Java Servlet Development Kit (JSDK) "bug"
Note: The bug described in this section occurred in Java Servlet Specification Version 2.2 or lower. The JSDK that implements Servlet Specification Version 2.3 or above has added a new method, setCharacterEncoding(...) in javax.Servlet.ServletRequest class, to address this problem.
In the October 1998 issue of The VisualAge Magazine, an article entitled Writing internationalized servlets with VisualAge for Java describes a code page problem with the Java Servlet Development Kit (JSDK) v2. The following scenario illustrates the problem:
A Japanese browser sends the following URL-encoded form data to a Java servlet on a Web server:
http://...?abc=%90%A2
where X'90A2' is the double-byte code point of a Japanese Kanji character in Shift JIS. The servlet calls the Java method, HttpServletRequest getParameters(...), which returns the String object with values X'0090' and X'00A2', instead of the Unicode equivalent X'4E16'. The reason is that the servlet has no way to know the prior code page that was used to encode the incoming data, thus it assumes the data--X'90A2' in this case--are in Latin 1 (ISO 8859-1), and of course the Unicode equivalent to Latin 1 characters is just a prefix X'00' in front of the 8859-1 code point.
The following experiment demonstrates the problem.
Software configuration
- [US English] Windows NT 4.0 with service pack 3
- JDK 1.1.7
- JSDK 2.0
- Environment variables:
- PATH=...d:\jsdk\bin;d:\jdk1.1.7\bin;...
- CLASSPATH=.;d:\jsdk\lib\jsdk.jar
Procedures:
- Run the servletrunner.exe included in JSDK 2.0.
- Open a browser with the URL http://cycheng:8080/servlet/TestServlet?abc=%90%A2
where X'90A2' is the double-byte code point of a Japanese Kanji character in Shift JIS, and TestServelet.javais:
import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.util.*;
public class TestServlet extends HttpServlet
{
public void doGet( HttpServletRequest req,
HttpServletResponse res )
throws IOException
{
Enumeration params;
String name, value;
res.setContentType("text/html");
PrintWriter pw = new PrintWriter( res.getOutputStream() );
pw.println( "<HTML><HEAD>" );
pw.println( "<TITLE>Test Servlet</TITLE>" );
pw.println("<metahttp-equiv=\"Content=
Type\"Content=\"text/html\";
charset=\"Shift-jis\">" );
pw.println( "<BODY>" );
pw.println( "<H1>Test Servlet</H1>" );
pw.println( "<P>" );
params = req.getParameterNames();
while(params.hasMoreElements())
{
name = (String)params.nextElement();
value = req.getParameter( name );
char ca[] = new char[2];
value.getChars( 0,2,ca,0 );
pw.println( "(int)value = " + (int)ca[0] + " " +
(int)ca[1] + "<P>" );
pw.println( "(char)value = " + ca[0] + " " + ca[1] + "<P>" );
String s1 = value.substring( 0,1 );
String s2 = value.substring( 1,2 );
if ( s1.compareTo("\u4e16") == 0 )
pw.println( "Parameter value is: X'\u4e16' [Unicode]" );
else
if ( s1.compareTo("\u0090") == 0 && s2.compareTo("\u00a2") == 0 )
pw.println( "Parameter value is: X'0090' and X'00A2' [???]" );
else
if ( s1.compareTo("\u90a2") == 0 )
pw.println( "Parameter value is: X'90A2' [Shift JIS]" );
else
pw.println( "No match!" );
pw.println( "<P>Parameter name is: <EM>" + name + "</EM>" );
pw.println( "<BR>Parameter value is: <EM>" + value + "</EM>" );
}
pw.println("</BODY></HTML>");
pw.flush();
pw.close();
} // end of doGet()
}
|
Browser output
Test Servlet
(int)value = 144 162
(char)value = ? ¢
Parameter value is: X'0090' and X'00A2' [???]
Parameter name is: abc
Parameter value is: ?¢
|
|