ࡱ > ~ Y q bjbjt+t+ 6 A A ;w ] " " " " 8 \ > 8q ^ l > Z n Ip Kp Kp Kp Kp Kp Kp $ s u 0 op z z z op . " " . . . z " R Ip " " " " z Ip . . n4 Z Ck t " Ip L M" ^+ o z Unicode and IBM WebSphere On the Development and Deployment of Unicode Based Multilingual Web Applications in IBM WebSphere Application Server Kentaro Noji Globalization Center of Competency Yamato Software Laboratory IBM Japan, Ltd. Debasish Banerjee WebSphere Development IBM Rochester IBM Corporation Abstract. With the advent and popularity of the Internet-based e-commerce products, the need to develop multilingual Unicode-based applications is becoming increasingly important. The IBM WebSphere application server is very well suited for the development and deployment of multilingual Unicode-based applications, both traditional and Web-based. The globalization mechanism embedded in the Web container of the WebSphere application server allows one to develop internationalized 0S e r v l e t s a n d J S P s t o s e r v e d o c u m e n t s i n a n y l a n g u a g e a n d c o d e s e t o f c h o i c e , 0i n c l u d i n g U n i c o d e - b a s e d m u l t i l i n g u a l d o c u m e n t s . T h e W e b c o n t a i n e r p r o v i d e s u n i q u e f e a t u r e s f o r c o d e s e t c u s t o m i z a t i o n a n d f i n e - t u n i n g . A s y s t e m a d m i n i s t r a t o r c a n m a p l a n g u a g e n a m e s t o c o d e s e t s o f c h o i c e , i n c l u d i n g U n i c o d e , a n d t h e I A N A c o d e s e t n a m e s o f A s i a n i d e o g r a p h i c l a n g u a g e s c a n b e f i n e - t u n e d t o c o r r e s p o n d t o t h e J a v a "! D e v e l o p m e n t K i t ( J D K ) c o n v e r t e r s o f c h o i c e . T h e p r e s e n t p a p e r d e s c r i b e s s o m e i m p o r t a n t t e c h n i c a l c o n s i d e r a t i o n s b e h i n d t h e 0d e v e l o p m e n t a n d d e p l o y m e n t o f m u l t i l i n g u a l U n i c o d e - b a s e d J a v a "! 2 E n t e r p r i s e E d i t i o n ( J 2 E E ) c o m p l i a n t W e b 0a p p l i c a t i o n s . W e b S p h e r e ' s u n i q u e g l o b a l i z a t i o n m e c h a n i s m i n c l u d i n g t h e c o d e 0s e t c u s t o m i z a t i o n i s a l s o e x p l a i n e d w i t h a c c o m p a n y i n g e x a m p l e s o f a S e r v l e t 0a n d a J S P f o r s e r v i n g m u l t i l i n g u a l U n i c o d e - b a s e d d o c u m e n t s . T h e o n g o i n g a n d f u t u r e i n t e r n a t i o n a l i z a t i o n w o r k i n W e b S p h e r e a p p l i c a t i o n s e r v e r i s a l s o h i g h l i g h t e d . I n t r o d u c t i o n T h e I B M W e b S p h e r e A p p l i c a t i o n S e r v e r , V e r s i o n 4 . 0 , p r o v i d e s a J a v a "! 2 E n t e r p r i s e E d i t i o n ( J 2 E E ) 1 . 2 [ 7 ] c o m p l i a n t e n v i r o n m e n t f o r t h e d e v e l o p m e n t a n d d e p l o y m e n t o f e n t e r p r i s e a p p l i c a t i o n s c o v e r i n g a w i d e - v a r i e t y o f b a c k - e n d s a n d f r o n t - e n d s . I d e a l l y , a l l t h e b u s i n e s s a n d p r e s e n t a t i o n l o g i c s h o u l d u s e U n i c o d e [ 1 1 ] f o r u n i f o r m a n d u n r e s t r i c t e d p r o c e s s i n g a n d r e p r e s e n t a t i o n o f c h a r a c t e r s f r o m a n y l a n g u a g e i n t h e w o r l d . I n d e e d , a l l t h e J a v a "! b a s e d s e r v e r - s i d e b u s i n e s s c o m p o n e n t s d e p l o y e d i n W e b S p h e r e i n t e r n a l l y u s e U n i c o d e , a n d U n i c o d e i s t h e p r o c e s s c o d e s e t o f J a v a . Unfortunately not all the back-ends (databases, transaction processing monitors, etc.) and frontends (application clients GUIs, browsers, etc.) use Unicode, so they may not have the Unicode handling or presentation capabilities. To interface with legacy applications, WebSphere application components may also have to use native code sets. Internet-based eCommerce applications are becoming increasingly popular, and IBM WebSphere, Version 4.0, offers a powerful environment for hosting such applications. The users of an eCommerce application can be located in any country and can potentially use any code set, including Unicode, for communicating with the server-side business logic. Clearly, a globalized server-side Web application should provide support for multiple code sets, and it should be able to receive and send data in any selected code set including Unicode. IBM WebSpheres Web container provides a unique customizable and fine-tunable code set selection mechanism for hosting Servlets and JSPs, the two J2EE server-side Web components. The present paper describes the motivation and actual implementation behind this code set selection mechanism, along with appropriate examples. Section 2 illustrates a general globalized eCommerce environment. Section 3 describes the code set selection mechanism embedded inside IBM WebSpheres Web container. Section 4 contains examples illustrating the code set selection mechanism. Section 5 mentions the future globalization intentions of IBM WebSphere, and finally Section 6 presents our conclusions. A few configuration files and configuration procedures appear in the Appendices. A Globalized eCommerce Environment Figure 1 illustrates a typical large eCommerce deployment scenario, which may have clients and servers situated in various geographically distinct locations. A Web browser can access any Web server application program, and a server-side Web application should be able to communicate with any browser client located anywhere in the world. IBM WebSphere Application Server can naturally assume the role of servers like A, B, C or D. EMBED MSDraw.Drawing.8.2 Figure 1. A large eCommerce deployment scenario Servers A and C serve multilingual Web content to the requesting Web clients, while servers B and D only participate in intra-server communications, and can process and serve multilingual content to other servers. To communicate effectively and reliably in a multilingual environment a receiver should know the code set of the incoming request. If all the server-side components are written in Java, the intra-server communication will take place in Unicode, and no special consideration is needed for code set determination. But for a server like A or C that communicates with clients, it is strictly necessary to determine the input and output code sets associated with requests and responses. Ascertaining Code Sets in IBM WebSphere Servlets and JSPs usually communicate with the clients using the HTTP protocol 0[ 2 ] . T h i s s e c t i o n d e s c r i b e s t h e w a y b y w h i c h t h e I B M W e b c o n t a i n e r ( V e r s i o n 4 . 0 ) a t t e m p t s t o d e t e r m i n e t h e i n p u t a n d o u t p u t c o d e s e t s a s s o c i a t e d w i t h H T T P - b a s e d c o m m u n i c a t i o n s b e t w e e n b r o w s e r c l i e n t s a n d S e r v l e t s o r J S P s . 3 . 1 C o d e s e t o f a n H T T P R e q u e s t HTTP input data can be encoded in any valid IANA[3] code set. Inside a Servlet or a JSP, the HTTP input data is usually obtained by invoking the getParameter() family of methods available in the javax.servlet.ServletRequest interface. The entire request body can also be obtained using the java.io.BufferedReader object returned by the javax.servlet.ServletRequest.getReader() method. All the above methods return data encoded in UCS-2 (Javas internal process code set) variant of Unicode, and the Web container has to convert the input HTTP data to UCS-2. To perform a proper conversion the Web container has to know the encoding of the input HTTP request so that it can invoke an appropriate JDK converter for conversion to UCS-2. Theoretically speaking, an HTTP request may have a Content-Type header optionally containing a charset attribute. For example, an HTTP client can transmit the header Content-type text/html; charset=ISO-8859-2 along with a GET request. The Web container can then easily convert the ISO-8859-2 encoded data to UCS-2. Unfortunately like all the other HTTP headers, this Content-Type header is also optional, and the presence of the charset component in a Content-Type header is optional too. In fact, neither Netscape nor Microsoft Internet Explorer, the two most popular browsers, transmit Content-Type HTTP headers containing any charset attribute. The question naturally arises: In the absence of any explicit code set information in the HTTP request, how can a Web container perform an appropriate UCS-2 conversion? Web containers available in the market have followed various ad-hoc strategies to arrive at a value of the input code set, though some of them are arguably wrong. Some of the strategies that we have seen or have heard of are: If available, use the value of the Accept-Charset HTTP header as the value of the input encoding. This approach is incorrectAccept-Charset is not intended to specify the encoding of the input request. Use the default JDK converte r f o r c o n v e r s i o n t o U C S - 2 . T h e a p p r o a c h a s s u m e s t h e i n p u t c o d e s e t t o b e i d e n t i c a l t o t h a t o f t h e f i l e . e n c o d i n g s y s t e m p r o p e r t y o f t h e W e b c o n t a i n e r s J a v a "! V i r t u a l M a c h i n e ( J V M ) , a n d i t m a y n o t w o r k i n m u l t i l i n g u a l e n v i r o n m e n t s . I t m a y a l s o c r e a t e t r ouble in EBCDIC environments (System/390). Always use the ISO-8859-1 ( UCS-2 converter. Obviously, this approach may not work for non-Latin1 clients. 3.2 Deciding on the Input Code Set If the input request does not explicitly specify the code set value using the Content-Type HTTP header, there is no simple but definitive way to arrive at a value of the input encoding. A Web container can only apply heuristic strategies to arrive at a reasonable value of the input code set using indirect avenues. The following sketches the heuristic strategy followed by the IBM Web container. The strategy is divided into four sequential steps. If the Web container decides on the input code step at a particular step, the succeeding steps are skipped. Step 3.2.1 If the Content-Type HTTP header is present and contains the charset attribute, the value of the charset attribute is the input code set. Step 3.2.2 Try to determine the input code set from the locale associated with the HTTP request. The locale of the javax.servlet.http.HttpServletRequest object may be determined from the Accept-Language HTTP header [2, 6, 7]. The input locale is mapped to a code set using encoding.properties, an IBM WebSphere- provided properties file for mapping locales to IANA char sets. Figure 2 illustrates a sample mapping. Appendix A shows a typical encoding.properties file. Locale NameIANA Charset NameenISO-8859-1csISO-8859-2jaShift_JISkoEUC-KRzhGB2312zh_TWBig5 Figure 2. Sample mapping rules in encoding.properties Step 3.2.3 Look for default.client.encoding, a Web container-specific JVM system property. If present, use that value as the input code set. Step 3.2.4 As the final recourse, just use ISO-8859-1 as the input code set. 3.3 Deciding on the Output Code Set Quite similar to the input request, on the output side, a Servlet has to convert UCS-2 encoded data before sending it to the browsers. If a Servlet or a JSP developer explicitly specifies a charset attribute by invoking the 0j a v a x . s e r v l e t . S e r v l e t R e s p o n s e . s e t C o n t e n t T y p e ( ) m e t h o d , t h e o u t p u t c o d e s e t i s k n o w n . I n t h e a b s e n c e o f a S e r v l e t R e s p o n s e . s e t C o n t e n t T y p e ( ) i n v o c a t i o n , a g a i n t h e r e i s n o c l e a r w a y t o a r r i v e a t a v a l u e f o r t h e o u t p u t c o d e s e t . T o d e c i d e t h e v a l u e o f t h e output encoding, the IBM Web container follows the following heuristic strategy. If the Web container decides on the output code step at a particular step, the succeeding steps are skipped. Step 3.3.1 If the Servlet or JSP developer has explicitly specified a charset attribute, use the value of the attribute as the output code set. Step 3.3.2 If the Servlet or JSP developer has explicitly invoked javax.servlet.ServletResponse.setLocale() API, use encoding.properties to map the specified locale to a code set. Step 3.3.3 Use ISO-8859-1 as the value of the output code set. 3.4 Fine-Tuning Code Set Converters The code set names used in Internet protocols must be registered in the IANA charset database. For certain language environments, the official IANA charset names may have more than one JDK converter associated with them. For example, the most popular code set in Japanese PC environments is Shift-JIS, and there exist a large number of Shift-JIS converters. In fact, JDK presently supports Cp943, C p 9 4 3 C , C p 9 4 2 , C p 9 4 2 C , S J I S , a n d M S 9 3 2 c o n v e r t e r s . A l l o f t h e s e c o n v e r t e r s a r e f o r U C S - 2 w S h i f t - J I S c o n v e r s i o n s . T h e s e c o n v e r t e r s a r e v e r y s i m i l a r b u t n o t i d e n t i c a l . F i g u r e 3 d e p i c t s f o u r v a r i a n t s o f U C S - 2 ( S h i f t _ J I S c o n v e r s i o n s f o r t h e \ u 2 0 1 5 \ u f f5e\u2225\uff0d\uffe4\u2014\u301c\u2016\u2212\u00a6 string using the native2ascii command of JDK V1.3. Figure 3. Sample Conversions JDK equates Shift-JIS to MS932, but some Web container installations may want to use Cp943C or SJIS for conversion to or from UCS-2. For fine-tuning the selection of input and output code set converters, IBM WebSphere provides converter.properties, a properties files for mapping IANA charset names to JDK converters. Figure 4 depicts a sample mapping, and a typical converter.properties file appears in Appendix A. IANA Charset NameJDK ConverterShift_JISCp943CEUC-JPCp33722C Figure 4. Sample mapping rules in converter.properties To take converter.properties into consideration, the following fine-tuning step is added in our input and output code set determination strategies. Fine-Tuning Step Search converter.properties for a match with the IANA code set name. If there is a match, use the corresponding JDK converter for conversions to and from UCS-2; otherwise use the original IANA name as the JDK converter. 3.5 Customization The IBM Web container determines the input and output code sets based on the various internationalization configuration parameters as detailed in Sections 3.2, 3.3, and 3.4. All of these internationalization configuration parameters are customizable by system administrators. Both encoding.properties, the mapping from locale to IANA charset, and converter.properties, the mapping from IANA charset to JDK converters, are exposed as properties files, and both can be altered to suit specific Web container installations. For example, in a Japanese PC-based environment, the ja ( Shift_JIS mapping should suffice, whereas in a Linux client environment, the mapping should be changed to ja ( EUC-JP. If all the Japanese Web content is encoded in UTF-8, the mapping rule must be changed to ja ( UTF-8 for that particular installation. In a pure Unicode-based environment, all Web input is encoded in UTF-8. The IBM Web container can easily set the input code set to be UTF-8 for specific languages. The system administrator simply has to use the UTF-8 in the encoding.properties file for the appropriate languages. Entries for new locales can also be added easily. The default.client.encoding Web container property should be used as a catch-all, and it is recommended that it be set as UTF-8. The input code set for any unusual locale (for example, various Indic locales) will then automatically default to UTF-8. Certain environments may need customization of the converter.properties file. As mentioned in Section 3.4, in Japanese environments, the Shift_JIS code set corresponds to more than one JVM converter. In fact, Shift-JIS can really be considered to be a vendor unique code set, where the actual character sets and the Shift_JIS w U C S - 2 m a p p i n g s d e p e n d o n t h e v e n d o r - s p e c i f i c i m p l e m e n t a t i o n s . I f o n e n e e d s t o f o l l o w t h e J I S ( J a p a n e s e I n d u s t r y S t a n d a r d ) o r t h e U T C ( U n i c o d e T e c h n i c a l C o m m i t t e e ) s t a n d a r d S h i f t _ J I S c o d e s e t c o n v e r s i o n r u l e s , i t m a y s u f f i c e t o m a p t h e S h i f t _ J I S e n t r y of converter.properties to the SJIS converter. As a side effect, some vender specific characters defined in Microsoft Windows or for the Macintosh may simply disappear. Figure 5 shows some NEC-defined characters, which will be filtered out by JDKs SJIS converter. EMBED PBrush Figure 5. Some NEC special characters filtered out by Java SJIS converter If a particular installation needs to use an IBM-defined code conversion rule, especially for using IBM back-end data storage (DB2, IMS, etc), Shift_JIS should be mapped to Cp943C, or some important characters may be corrupted in the Web application. Examples This section briefly describes illustrative examples using a Servlet and a JSP serving data in Unicode. The Unicode data is represented as escaped Unicode sequences. The variable unicode_data in Examples 1 and 2 represents arbitrary data from a Shift_JIS database. The unicode_data string is displayed as a Shift_JIS encoding using the IANA charset parameter explicitly specified in the setContentType() call. Figures 6 and 7 show the results as displayed in MS Internet Explorer without and with fine-tuning. Example 1. Servlet public class Sample extends HttpServlet{ String unicode_data = "\u96fb\u8a71(Phone)\uff17\uff12\uff13\u2212\uff13\uff12\uff15\uff16"; // unicode_data is an example of a telephone number in Unicode. Normally, a Unicode string is // is transmitted via JDBC, HTTP communication and so on. Here we present a simulation using an // escaped sequence. public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException{ response.setContentType("text/html; charset=Shift_JIS"); // Unicode_data is converted to PrintWriter pw = response.getWriter(); // Shift_JIS using JDK converter pw.println(""); pw.println("