DECEMBER 2002 LOCALISATION.central 11 LOCALISATION FOCUS
Internationalisation (i18n) and Localisation (l10n) are intricately linked processes. A lot of attention has been paid by l10n service providers to their internal processes, and some of their larger customers have worked hard on their own internal processes. A similar amount of attention has been paid to low level issues of enabling software to support various character encodings by developers. However, in my experience, neither of these advances addresses the needs of most compa- nies that are attempting to deliver international prod- ucts for the first time, companies that may not have the necessary time, resources, or experience to develop stable processes.
For these companies, the working assumption is that creating and delivering international products are possible, and should not be difficult. After all, many other companies have obviously done it, they reason. It is hard to argue with this kind of logic, yet time after time it is a struggle for companies to “go international” and integrate best practices into their existing processes.
The hidden areas
In this article, I will discuss one of the hidden areas where i18n and l10n processes and techniques overlap and cause trouble. While there are other issues I don’t address here, understanding these issues and applying the results will go a long way towards making sure your international efforts are as effective as possible.
Most important in the success of any localisation effort is a thorough understanding of the state of inter- nationalisation of the product and project. I18n and l10n are tightly linked and tasks involved in each of these are often performed by separate groups (see the article at 189205 for a brief explanation of the differences of i18n and l10n).
The two parts of the project, i18n and l10n often are separated. They may be done by dif- ferent teams, even in very widely separated parts of a compan’s org chart. For example, i18n may be part of Product Development while localisa- tion may be part of Operations. It is common for Sales to be the group driving and sometimes even owning both parts, even though they are far from the more traditional charter of a Sales team.
Marketing groups may be on their own when it comes to localising outward facing websites. Who actually performs the work? The people involved in i18n and l10n may be doing so as out- sourcers or part of the main company, or a com- bination of both, and anyone could be located any place in the world. These issues introduce a severe communication problem that needs to be man- aged carefully.
Measures of success
There is an old saying – “You can’t manage what you can’t measure”, and an enormous part of the techniques and management is without met- rics at all. I propose at least two places where met- rics would make processes work much smoother. Space doesn’t allow me to propose specific met- rics here, but I hope to encourage everyone to con- sider these issues carefully.#p#分页标题#e#
The consensus definition of i18n is usually along the lines of “prepare a generic, locale neutral prod- uct that is ready for l10n”.
This generic definition is crafted to ignore the vast differences in coding styles, systems, and tools used to create a product that needs localisation.
Coherence and cohesiveness
What does it mean for products to be interna- tionalised? How can we describe the level of i18n in a way that we can track it over time, and even- tually correlate it to some sort of metrics of the subsequent l10n efforts?
The vast majority of software products consist of a number of components that interact with each other. Depending on the skill of the software archi- tects, these components may be nicely layered. The technical terms for this are coherence and cohesive- ness, terms that refer to how well the blocks fit together, yet stand independently.
Architecture and i18n
Many software engineers and managers, without a formal training in computer science, tend to be unaware of these concepts. Yet, their importance to QA, maintenance, and new feature additions is well known in academic circles.
Certain aspects of these features are important in i18n work. Let’s use a semi-fictional example of soft- ware and examine its architecture with respect to internationalisation issues.
What I call a software module may range in size from a few lines of code to complex subsystems. The important aspect for our purpose is that a module always accepts input from another module, trans- forms the input to some output format, and then passes the output either back to the first module or to a different third module.
Character encoding
Software modules often need to pass plain text data among themselves. As localisation profession- als familiar with character encoding are aware, there are large numbers of character encodings in common use and they are not always compatible with each other. Managing this technical single issue is a core part of internationalisation engineering.
Because the software modules are often created independently of each other, each may have its own expectation of how the textual data is formatted at the input, internal processing, and at output stages. What this means, is that even though each mod- ule may (or may not) be internally consistent with respect to how it manages text encoding, the system Localisation and Internationalisation:
Dealing With
the Difficulties Separating the inter- nationalisation and localisation process- es too much leads to extra work for com- panies trying to “go international.”
BARRY CAPLAN
explains how to minimise losses and maximise results.
Barry CaplanLOCALISATION.central DECEMBER 2002 12 LOCALISATION FOCUS
as a whole may not be. In fact, it probably isn’t. Textual data may undergo various encoding transformations along its overall path, and these may not be consistent, cor- rect, or sufficient.
In practice, it is rare that a software team will be able to describe this situation in detail either after or especially before international- isation commences. But it is also clear that a thorough understanding of the data path of locale-related data is important. This is a ripe area for identifying and applying metrics.#p#分页标题#e#
Website application
Our semi-fictional example is of a web application that is used to receive and respond to customer emails in high volume. This example illustrates many of the prin- ciples we are concerned with. Before we look at how to measure i18n status, let’s take a look at what the system does.
Data arrives at the system as ordinary email messages, via the standard SMTP pro- tocol. Email messages are then retrieved from a mail server via the POP protocol, just as most of our desktop mail clients do. In this case though, the messages are broken down into constituent parts, and the parts stored in a database, which might be Oracle, SQLServer, Postgres, or any number of other database systems.
Once the email message is in the database, a Customer Service Representative (CSR), using a web browser based interface, can dis- play the messages and compose a response to the original sender. The response may include text that is created by some sort of natural language processing modules that analyse the data. The final response is stored back in the database, and then sent on its way via SMTP, just as any ordinary email is.
The modules
With that brief description of functional- ity, let’s identify and analyse the actual mod- ules that the message goes through. Remember we are just focusing on the text for now. Similar analysis needs to be done on other locale-related data, such as dates and times, domain names and anything else a par- ticular application needs to be aware of. Here are some of the modules that are external and not under the control of the software developers (beyond the choice of which version to use):
Inbound SMTP server. This software is supposed to receive messages according to a series of internet specifications, called RFCs. But the RFCs have conflicting and subtle issues regarding character encoding. Without identifying and tracking these, all downstream data flow may be suspect!
POP Server. The POP server is also sup- posed to behave according to another set of RFC specifications. These RFCs may or may not impact character encoding integrity. With- out a careful understanding of these issues, downstream data may also be suspect.
Network transport layers. HTTP, etc. These protocols also introduce their own ideas of what character encodings are iden- tified, expected and allowed.
Choice of Web server. The ways a browser interacts with a web server to iden- tify what the user’s preferred locale settings are, are complex and ill defined. They may vary by server implementation as well.
Web server scripting language. The web pages that the CSR sees are necessarily gen- erated dynamically. This means there are programmes that communicate with the web server that are responsible for generating the HTML that is sent to the browser. The choice of programming languages and tools for this task introduce another set of vari- ables about which character encodings are supported (perhaps correctly and completely, perhaps not).#p#分页标题#e#
Browser scripts (Javascript). The HTML that is ultimately sent to the browser may contain Javascript for client side processing of data. Javascript implementations vary across browsers though in some key ways with respect to text handling.
Server and client side Java Virtual Machine. Similarly to the client-side Javascript implementation issues, the choice of Java as a programming tool, either on the server or the client raises issues. Java relies on Java Virtual Machine (JVM) to provide services that traditional programmes have received from the operating system directly. However, there are variations in the services that are available and the correctness of the implementations across different versions of various JVMs.
Database drivers. The layer of software directly responsible for communicating with a database is called database driver. This module is responsible for converting text data as presented by the application, to the form that the database can actually use. It also converts in the other direction when records are retrieved from the database. The supported conversions may be limited, and again, the available support may not be always correct.
Database Character encoding capabili- ties. Independent of the database driver, the set of character encodings that can be stored in a database is limited.
Client side browser. Different browsers support locale-related display and format- ting issues differently, if at all. Client side fonts. Independent of the browser’s algorithms for displaying text, a font may or may not be available for a given character and encoding.
Client and server operating systems. The locale-related services that an operating sys- tem, which all of the above modules must run on, are far from standardised. There are usually a lot more issues to be discovered in modules close to the operating system.
Each of these modules is certain to have issues related to whether or not text pass- ing through will behave as expected. In some cases, the specifications are ambiguous. In some cases, the implementation may be incorrect. In others, data may be left in a format that is difficult to use. And other modules present in the system may be an out of date version.
This is just a description for the plain text data. Similar analyses need to be done for all of the types of locale-sensitive data you may have identified in your system, which will surely be numerous.
Without understanding in detail how the code your developers are responsible for interact with and take advantage of the ser- vices present in modules you can’t change, it is impossible to say what the level of inter- nationalisation is in your project.
Recommendations
For each type of locale-sensitive data in your system, create a detailed dataflow dia- gram for all modules that the data passes through. Draw each module as a node on a graph, and connect the nodes with all input and output data flows.
Label each connection with an arrow indi- cating the direction and information about the state of the data at that point, such as the encoding scheme for text. Label each node with the module name and the type of transformation (or “NONE”) that takes place there. #p#分页标题#e#
If more than one transformation takes place, then split the node and add new con- nections until there is only a single trans- formation in each node.
System analysts among the readers will recognise what results as a Dataflow diagram in the Yourdon tradition. Except this time, instead of using it to design a new system, we are using it to understand an existing one. This type of understanding of the system’s architecture pays big dividends in under- standing where i18n development effort needs to occur, where the trouble spots are, and how complete the i18n effort is.
A spreadsheet or other tracking system can be set up with entries for each element (node, connector) on each data flow dia- gram. Bug reports, project schedules, etc. can be tracked to a specific point in the chart. Over time, this type of data can be analysed to see where there are tricky spots in the code or overall data flow turned out to be. Finally,http://www.ukassignment.org/daixieEssay/daixieyingguoessay/ once this data analysis has taken place, it can be used to predict trou- bles with the localisation process itself.
Barry Caplan has more than 10 years
analysing complex software systems for inter- nationalisation issue and building cross-func- tional global software teams that produce. He can be reached at for help with your i18n and l10n process issues. Be sure to register at his new web publica- tion for “News/Tools/ Process for Global Software”.
|