Software localization

The educational technology and digital learning wiki
Jump to navigation Jump to search

Draft

This article or section is currently under construction

In principle, someone is working on it and there should be a better version in a not so distant future.
If you want to modify this page, please discuss it with the person working on it (see the "history")

<pageby nominor="false" comments="false"/>

Introduction: What do do mean by localization ?

Software localization (or localisation) means translation of a software interface and messages to another language plus adaptation of some formats (e.g. measures, dates and currency) plus adaptation to local cultures.

Software localization is a complex process. According to [ Susan Armstrong], (simple) translation implies terminology research, editing, proofreading and page layout. Localisation tends to involve additional activities, such as: Multilingual project management, Advice on translation strategy, Conversion of file formats in which material is held, Alignment and maintenance of translation memories, Software and online help engineering and testing, Multilingual product support.

What is software internationalization, localization, globalization ?

In a first step, software should be internationalized (I18N), i.e. desiged in a way that it can be adapted to various languages and regions (not just one) without engineering changes to the programming logic. In principle, it is stuff that has to be done once. But sometimes I18N is done in stages, e.g. often developers forget that some languages are more verbose than others (need more space) or work in other directions (e.g. right-to-left).

In a second step, software has to localized, i.e. translated and adapted to local cultures. Localization (L10N) means adaption of a product following the needs of a particular population in a precise geographic region. Needs include linguistic, cultural and ergonomic aspects. (Le grand dictionnaire terminologique). McKethan and White (2005) define localization as “the process of adapting an internationalized product to a specific language, script, cultural, and coded character set environment. In localization, the same semantics are preserved while the syntax may be changed.” The authors further argue that “Localization goes beyond mere translation. The user must be able to not only select the desired language, but other local conventions as well. For instance, one can select German as a language, but also Switzerland as the specific locale of German. Locale allows for national or locale-specific variations on the usage of format, currency, spellchecker, punctuation, etc., all within the single German language area.”

Gregory M. Shreve (retrieved 16:27, 21 January 2010 (UTC)) adds adaptation of "non-textual materials" like colors, logos and pictures. “Localization is the process of preparing locale-specific versions of a product and consists of the translation of textual material into the language and textual conventions of the target locale and the adaptation of non-textual materials and delivery mechanisms to take into account the cultural requirements of that locale.”

The W3C Internationalzation web site also argues that

localization is often a substantially more complex issue. It can entail customization related to:

  1. Numeric, date and time formats
  2. Use of currency
  3. Keyboard usage
  4. Collation and sorting
  5. Symbols, icons and colors
  6. Text and graphics containing references to objects, actions or ideas which, in a given culture, may be subject to misinterpretation or viewed as insensitive.
  7. Varying legal requirements
  8. and many more things.
Localization may even necessitate a comprehensive rethinking of logic, visual design, or presentation if the way of doing business (eg., accounting) or the accepted paradigm for learning (eg., focus on individual vs. group) in a given locale differs substantially from the originating culture.

Usually, language localization extends to national subcultures. E.g. German would decline in De-de (German for Germany) and other versions for Switzerland (de-ch) and Austria (de-at), etc. National versions of a language include different words, different spellings (like "localization" vs. "localisation"), and sometimes different grammar. Conversely, a multi-lingual country like Switzerland would have German Swiss (de-ch), French Suiss (fr-ch) and Italian-Swiss (it-ch). These country-specific localizations have in common the way date/time, decimals and currency is represented. Software translators may adapt the following strategies:

  • Create a generic translation for one language, e.g. fr, and then add specific local variants like fr_fr, fr_be on top of it.
  • Create only one translation, e.g. fr_fr, and have users cope with it. Data/time and decimal/currency representation differences should be handled though. Most often, this is the case in free open source software.
  • Sometimes, localizations only exist for major sub-cultures, e.g. a de-de and de-ch version. Other german speaking cultures like de-at (too close to German) or de-lu and de-li (too small) are not covered.

Finally, the term Software globalization (G11N), also known as National Language Support, refers to the combination of software internationalization (I18N) and localization (L10N).

Abbreviations

Let's recapitulate and explain the abbreviations for internationalization, localization and globalization. The "funny" kind of abbreviations used in the localization community are so-called numeronyms as we shall explain for the I18N.

  • Internationalization, known as I18N. The 18 stands for the number of letters between the first i and last n in internationalization.
  • Localization, known as l10n (or L10N) , is composed of the l of localization, followed by 10 letters (ocalizatio) and the final n of localization.
  • Software globalization, known as G11N refers to: I18N + L10N
  • Finally, GILT stands for Globalization, Internationalization, Localization and Translation, i.e. all matters related to localization.

Localization technology

Most localization technology used is so-called Computer-assisted translation software. In addition, localization needs specific tools to deal with editing (extracting, managing, mergin) language strings in the computer code. Furthermore software localization must integrate various media, the software itself, in-context help, help systems, manuals, tutorials etc.

Issues

In this sections we shall explore some important issues. We shall stress the importance of ergonomic aspects as the #1 priority of localization. By ergonomic aspects we mean both "surface usability" (users can understand the meaning of UI interface elements and system messages) and cognitive ergonomics (user can get meaningful tasks done with the system).

Ergonomy considerations apply to both the products (software, manuals), and to the programmer and translator environments (tools, community, etc.)

Infrastructure and people

In a larger project, the list of types of participants can be quite long: E.g. Gregory M. Shreve identifies: Project managers, Translators (Generic), Localization Translators (Specialists), Terminologists, Internationalization/Localization Engineers (Software Background), Proofreaders, QA specialists, Testing engineers, Multilingual Desktop publishing specialists. These job descriptions implicitly define the kinds of tasks that are part of the localization process.

In a volonteer-based open source project we might see the following roles (that my be cumulated by the same person)

  • one person to coordinate software development and interface to the translation effort
  • one person to coach and coordinate software string translators and to interface with developers
  • one person to coach and coordinate manual and tutorial writing, including a glossary
  • volonteer translators for various translation tasks
  • volonteer users that provide feedback about the UI ergonomics, spelling, in-context help, etc.

Open source projects don't have the funding to pay professional translators. This situation has disadvantages but also some advantages.

Disadvantages: Quality of the translation, completion (untranslated strings for new versions, missing languages, etc.)
Advantages: Meaninfulness of the translation. Often translators are users, i.e. they have know how of the tool which "normal affordable" translators do not have. According to LISA, a second interesting fact is that “the natural tendency of the localizer is to magnify the difference between [two expressions in the source language that refer to the same thing, just to be on the safe side. The result is that localizers tend to find the problems in a product that were not previously noted, problems that may well have confused users in the source language as well.”

I (16:27, 21 January 2010 (UTC)) believe that there ought to be some strategies to improve volonteer translation efforts and to profit from emerging insights. The main issues might be:

(1) Motivating people to helping to translate and to continue helping

(2) Make sure that translation is usable by providing a decent enough translation support environment

(3) Make sure to profit from translator's domain knowledge, plus inconsistencies he may detect in the original source language interface and messages.

Localization targets

I18N and L10N should be thought of in terms of the whole system. A software product may include:

User documentation (includes several genres and available through several media, e.g. within the software itself, HTML, paper/PDF)
Manuals
Short contextual help
Glossary
Tutorials
....
Software (user)
Menus and Icons (visual command language)
Command languages (sometimes)
Messages (various output)
....
Software documentation
Documentation of language constants (and other useful elements)
Developer manuals (maybe)

Since translation work is often split up between many volonteers with domain knowledge, there ought to be some potential for synergy that FOSS project managers and developers might try to exploit and develop. E.g. people writing the tutorials also should make comments about the meaningfulness of system messages as well as other usability issues, and not just try to adapt the user to the system. A good focus object would be a multi-lingual glossary/thesaurus, i.e. a coordinated terminology project.

Terminology management / Glossary

Terminology management refers to the fact that critical and specialized terms must be clearly defined in the source language and be linked to target languages. With respect to software localization, this should be done independently of translation of UI elements and system messages.

The Localization Industry Standards Association (LISA) defines terminology management as follows: “Quality translation relies on the correct use of specialized terms. It improves reader understanding and reduces the time and costs associated with translation. Special terminology management systems store terms and their translations, so that terms can be translated consistently. Full-featured systems go beyond simple term lookup, however, to contain information about terms, such as part of speech, alternate terms and synonyms, product line information, and usage notes. They are generally integrated with translation memory systems and word processors to improve translator productivity.” (retrieved 15:51, 27 January 2010 (UTC)).

Silvia Pavel defines terminological data as “Any information relating to a term or the concept designated. Note: The more common terminological data include the entry term, equivalents, alternate terms, definitions, contexts, sources, usage labels, and subject fields.” This can lead to rather large records, e.g. have a look at the term Apprentissage/Learning in the Termium Plus database.

Terminology management is not a simple formal question of agreeing on some ways of translating words. It's related to fundamental aspects of language use in the sense of language as action and that requires common grounding. (Clark, 1996).

See the example section below for some online software that can help establish common lexical items, including ways to support process of meaning negotiation between various actors.

On the code side

Encoding
  • Nowadys, projects systematically should use some form of Unicode, e.g. UTF-8.
Space and Layout
  • Space of text fields: Some languages are more verbose and one must plan for that, by using wider icons, menu items, user input fields and such. A better solution is to use a "fluid" design, i.e. the size of boxes is adapted to text it contains.
Language messages

All output messages to the user must be defined as a kind of named constant that the programmers will use. Alternatively, it can be decided that source langage strings (plus a documentation string nearby in the code) are the constants. The name of the constant should be meaningful and available to translators (since at some point here must a common way to talk about a precise translation item). Languages files should be separate for the translators (e.g. in a textfile, a localization format, or through an online database).

The most difficult issues related to language messages have to with the fact that parts of the message will be generated at runtime. In other words, some messages can take parameters and therefore must include placeholders for these. E.g. in the Mediawiki project messages can handle plural, gender and special grammatical transformations for agglutinative languages like Finiish, i.e. depending on how a word is used a suffix may added, plus the base can be modified.

Reviewing / Evaluation

Localized content should be tested in target. In other words: Using the software for real or almost real by real target persons does provide good feedback. If this is not an option, the reviewer must be able to "imagine" a real situation, i.e. receive enough contextual information.

In addition, the reviewer has to develop a list of criteria for evaluation (a quality model). Criteria may emerge from testing the software in a real situation with real tasks.

In an open-source project like LAMS that allows for frequent updating one could ask volonteer teachers to provide feedback and to ask their learners to do the same (at least write down meaningless/difficult to understand) UI elements and system messages. If this is not possible, translators should at least play all the roles: Administer the system, design both a complex and a simple sequence, use it as teacher (including the monitoring) tool, and experience it as learner. Another option could be "problem/suggestions" button to be installed optionally by the administrator in the interfaces and that would allows user to complain and make suggestions in an easy, quick and non-obtrusive way. This idea needs to be fleshed out.

Examples of software that collects user information

  • The "report Broken Web Site" and "Report Web Forgery" buttons in Firefox (under the Help menu)
  • The dialog you get once MS Help dialogs end up without solutions, e.g. three questions: (1) was it useful, (2) why ? and (3) additional information. Now I would like to know what kind of volume / 1000 users this generates and how it is being used.
  • Some google help files have a similar interface. (1) Was this information helpful, followed by (2) Easy to understand? (3) Complete with enough details ?
  • Google translate. E.g. learning activity management system translated to french "système d'apprentissage en gestion de l'activité" which is totally wrong. Users can contribute a better translation. In addition they could activate the translators toolkit, which includes glossary management, chat, etc.

Computer-assisted translation software

Today exists a rather large set of toolboxes that aim to increase the quality and the efficiency of the translation processs. Some of these tools already have been introduced by example. Based on Wikipedia's Computer-assisted translation entry, we may distinguis the following types of tools. Some of these may be integrated into a single software, i.e a computer-assisted translation (CAT) tool.

(1) Spell checkers and grammar checkers:

Both can be integrated into wordprocessors, online CMSs, programming editors or work as add-on programs

(2) Terminology managers:

This includes a whole family of tools, ranging from simple tables (entry + description) to complex glossary and thesaurus managagers. Some of these exist online. A good examples is the Open Termonology Forum and Google translation tool, both discussed in this article.

Commercial programs include: LogiTerm, MultiTerm, Termex

Open source programs include:

Open source online programs:

  • The Wikionary (probably not available "as is", but one could install a Mediawiki and then retrieve all the templates)

(3) Terminology databases

These include publicly available databases. Good examples:

(4) Full text indexers

I.e. indexing search engines that allow to formulate complex queries. Instead of searching all the Internet they will look up translation memories (see below).

(5) Concordancers

“retrieve instances of a word or an expression and their respective context in a monolingual, bilingual or multiligual corpus, such as a bitext or a translation memory.” (wikipedia)

(6) Bi-texts

A merge of source text and its translation

(7) Translation memory managers (TMM)

TMM relates to the idea that there is a database with text segments in the source language and their translations in one or more target languages. A text segment corresponds to sentences or equivalent (e.g. titles or bullet items).

In fact a TMM is specialized text editor that displays segments to be translated in one column and possible translations in another one. The translator then can reuse old translations and adapt it to the new fragment (or if there is no match, create a totally new translation.

Technical infrastructure for sofware strings translation

Translators should "see" what they translate. This concerns several items. When translating a string, the translator should be able to see:

  • The name of the constant (which must be meaningful, e.g. "modulename.mainmenu.edit" or "modulename.errormsg.upload.xxx")
  • A meanigful short description. This description may include a link to a glossary.
  • All other translations (languages strings) the translator understands (e.g. if I translate to German, I'd like see both English and French)
  • (If possible) the constant displayed in the interface. That of course requires extra programming. Or even better: be able to edit strings directly on the interface

Tools for consistency: When translating hundreds of strings (and the situation gets worse if it's done by several people) there should be a way to search through all terms in all modules in three ways:

  • Find same expressions in the target language and display an other language next to it
  • Find same expressions in another language and display the target strings next to it
  • Be able to edit and consult a short glossary that includes the most important terms (might be combined with the general user manual)
  • (dreaming) direct access to some online translation dictionary like the english/frenchgrand dictionnaire terminologique

See also the terminology management / glossary above

Examples of terminology and translation tools used in FOSS projects

In this section, we will look at various examples that raised our interest.

The open terminology forum

The Open Terminology forum is a freely-accessible terminology database whose content is supplied and validated by the members of a non-profit organization. According to What is it? page, “anyone who can show that they have specialized knowledge can become a member of the Forum and participate in their area of expertise, whether they are a professional, a translator, a journalist or an academic. Members can add terminology information, comment on (but not change) that contributed by others, make personal lists of selections of terms and create their personal descriptions with their own HTML links.” , retrieved 15:51, 27 January 2010 (UTC). “Each area of specialization has its own terms. The Forum is a place where these terms can be recorded, refined, discussed and validated by those concerned. As every item of information shown is personally attributed to the contributor, members can build up their reputations”

The main window shows terms in a three column layout. Terms can searched with wildcards (* and ?). A user can then click on each of the "meaning", "English term" and "French equivalent" in order to further explore or to edit the context (with permission)

(1) Click on meaning, then Info will provide contextual information. E.g. the English term "classroom management" is not translated (should become "gestion de classe") is associated with the rather useless "for a lesson" meaning, but clicking on "for a lesson" will show a good definition plus a link.

(2) Click on a term will allow to add variants, occurences, synonyms, translations or a comment.

(3)Click on help will show agreement among experts and some other information.

Wiktionary

“Designed as the lexical companion to Wikipedia, the encyclopaedia project, Wiktionary has grown beyond a standard dictionary and now includes a thesaurus, a rhyme guide, phrase books, language statistics and extensive appendices. We aim to include not only the definition of a word, but also enough information to really understand it. Thus etymologies, pronunciations, sample quotations, synonyms, antonyms and translations are included.” (Main Page, retrieved 18:07, 29 January 2010 (UTC)). Advanced users may customize Wiktionary preferences. Entries can be search or looked up through various indexes.

The structure of Wiktionary entries (pages) seems to be a moving target, i.e. together with emerging needs complexity is added or page structure is modified. Also, various language variants don't look the same. I.e. the french version includes the use of icons in from of main headings (much prettier ...). Finally, exploitation of various possible headers and templates may be different in various culture.

Let's have a look at the Thesaurus discussion. The English dictionary version includes a separate thesaurus called Wikisaurus. The Germans on the other hand, do also have a small thesaurus, but globally speaking one can observe that their wiktionary seems to be more systematic, i.e. entries do include more thesaurus-like features. Here is small table we made from looking at a few entries (i.e. we don't claim it to be accurate):

In all Wiktionaries we can find:

  • Synonymy – same or similar meaning
  • Troponymy – manner, such as trim to cut

In the German Wiktionary we get in addition:

  • Antonymy – opposite meaning
  • Hyponymy – narrower meaning, subclass (Unterbegriffe)
  • Hypernymy – broader meaning, superclass (Oberbegriffe)

In none we found (although it can be covered somewhat by super/subclasses)

  • Meronymy – part, such as wheel of a car
  • Holonymy – whole, such as car of a wheel

We may argue that the German strategy seems to see the dictionary itself as a kind of thesaurus, whereas in the English world the two have been split, although not all participants did think that this was a good strategry. Daniel K. Schneider also prefers the German "inclusive" strategy.

Let's have a look at Internet, top of english, french and german pages (Jan 2010):

Wikionary entries are normal Mediawiki pages, but follow structuring conventions, e.g. concerning the heading structure. In addition, users must understand how to use and to find (!) templates. A Mediawiki template can be defined as some kind of extension language that allows for shortcuts, complex formatting, transclusions, etc. More precisely, Wiktionary's Index to templates provides the following definition: “Templates in Wiktionary are a convenient and consistent way of including a snippet of text in various, similar articles. Within the text of an entry, enclosing a template name inside the double curly braces {{xxx}} will instruct Wiktionary to insert right there in the entry when it is viewed the contents of the page [[Template:xxx]].”. In addition, there exist shortcuts which are a specialized types of a redirection page that can be used to get to a commonly used Wiktionary reference page more quickly. Fairly complex languages to learn that makes Daniel K. Schneider sometimes wonder whether Wikipedia and its sister projects are not the world viewed by IT-savy people...

Wiktionary pages may include translations if the spelling of the word is the same. E.g. the localisation page has both an English and a french entry. But usually, a page will refer to entries of other languages through a special template, which allows to transclude these into the page.

A minimal page typically has the following structure:

== English ==

===Noun===

=== References ===  	 

The structure of a more complete entry can be illustrated by the phrase word. Its table of contents looks like this.

1. English
    1.1 Pronunciation
    1.2 Etymology
    1.3 Noun
        1.3.1 Synonyms
        1.3.2 Derived terms
        1.3.3 See also
        1.3.4 Translations
    1.4 Verb
        1.4.1 Derived terms
        1.4.2 Related terms
        1.4.3 Translations
    1.5 External links
    1.6 Anagrams
2. French
    2.1 Pronunciation
    2.2 Noun
    2.3 Anagrams

The french section is included because the word "phrase" does exist in french and means "sentence". The translation sections (one for the noun and one for the verb) includes definitions from many languages by transclusion.

Let's now examine the wiki code of an other medium-sized entry in more detail. The source wiki code of localization entry looks like this:

Screenshot of http://en.wiktionary.org/wiki/localization (upper part / jan 2010
{{wikipedia}}
{{also|localisation}}
==English==

===Etymology===
From [[localize]] + [[-ation]];
compare French ''[[localisation#French|localisation]]''

===Alternative spellings===
* [[localisation]] (''British'')

===Noun===
{{en-noun}}

# The [[act]] of [[localize|localizing]].
# The [[state]] of being [[localized]].

====Derived terms====
* [[l10n]], [[L10n]]
* [[cerebral localization]]
* [[chromosomal localization]]

====Related terms====
* [[locale]]
* [[localism]]
* [[locality]]
* [[locate]]

====Translations====

{{trans-top|act of localizing}}
* Chinese:
:* Mandrin: {{zh-tsp||地方化|difang-hua}}
* Dutch: {{t+|nl|lokalisatie|f}}
* Finnish: {{t+|fi|lokalisoida}}, {{t+|fi|kotoistaa}}
{{trans-mid}}
[ .... contents removed .... ]
* Norwegian: {{t-|no|lokalisering|f}}
* Russian: {{t+|ru|локализация|sc=Cyrl}}
{{trans-bottom}}

{{trans-top|state of being localized}}
* Chinese:
:* Mandrin: {{zh-ts||本地化}}
* Dutch: {{t+|nl|lokalisatie|f}}
{{trans-mid}}
* Norwegian: {{t-|no|lokalisering|f}}
* Russian: [[локализованность]]
{{trans-bottom}}

{{trans-top|software engineering:
act or process of making a product suitable for use
in a particular country or region}}
* Chinese:
:* Mandrin: {{zh-ts||本地化}}
* Dutch: landelijke aanpassing
[ .... contents removed .... ]
* Russian: [[локализация]]
{{trans-bottom}}

{{checktrans-top}}
* {{ttbc|French}}: {{t+|fr|localisation|f}}
* {{ttbc|German}}: [[Lokalisation]]
* {{ttbc|Italian}}: [[localizzazione]]
{{trans-mid}}
(rookaraizeeshon)
* {{ttbc|Korean}}: [[지방화]]
* {{ttbc|Latin}}: [[localizatio]]
* {{ttbc|Portuguese}}: [[localização]]
* {{ttbc|Spanish}}: [[localización]]
* {{ttbc|Turkish}}: [[Bölge]]
{{trans-bottom}}

====See also====
* [[internationalization]] / [[i18n]]

As the reader may notice, there are many templates ({{....}}). An author may just create minimal entries or fix existing entries. Advanced authors, as we mentionned above, may choose from probably hundreds of different templates.

Let's now look at tools for discussion and user feedback.

Feedback box in the English Wiktionary (Jan 2010)

In the English version and some but not all other language version (Jan 2010), users may leave a feedback. Users can first signal a problem, i.e. choose among:

  • Good
  • Bad
  • Messy
  • Mistake in definition
  • Confusing
  • Could not find the word I want
  • Incomplete
  • Entry has inaccurate information
  • Definition is too complicated

Optionally, a user then may leave a note. It probably will append it to the discussion page. Each mediawiki page has an associated discussion page (including one for this page you are reading).

Discussion does sometimes happen on wiktionary.org. A nice and funny example is the e-dictionary's discussion page (about 10 times as long as the entry ...)

We believe that Wiktionary is an interesting project and for several reasons:

  • The flexibilty of the Mediawiki infrastructure allows to cope with emerging views about what a dictionary entry should be. Indeed, from a simple dictionary it evolved into a complex one that inlcudes etymologies, pronunciations, sample quotations, synonyms, antonyms and translations (and more related kind of terms in the German version).
  • Wiktionary also integrates well with other Wikimedia sister projects like Wikipedia or Wikiversity.

Remark on the history: Most Wikitionary entries, as you can read in the Wiktionary entry of Wikipedia were initially created by bots: “most of the entries and many of the definitions at the project's largest language editions were created by bots that found creative ways to generate entries or (rarely) automatically imported thousands of entries from previously published dictionaries. Seven of the 18 bots registered at the English Wiktionary[4] created 163,000 of the entries there.[5] Only 259 entries remain (each containing many definitions) on Wiktionary from the original import by Websterbot from public domain sources; the majority of those imports have been split out to thousands of proper entries manually.”

Translatewiki.net

“Translatewiki.net is a localisation platform for translation communities, language communities, and open source projects. It started out with localisation for MediaWiki. Later support for MediaWiki extensions, FreeCol and other open source projects was added.” (Translate Wiki, retrieved 14:16, 4 February 2010 (UTC)).


Localization of MediaWiki messages is not easy as a look at the guidelines demonstrates, since there exist various kinds of messages. They may include various kinds of parameters, simple switches (singular/plural, gender) and even some grammatical transformations needed for some languages.

The Google translator toolkit

According to the Google translator toolkit basics page, the kit can do the following:

  • Upload Word documents, OpenOffice, RTF, HTML, text, Wikipedia articles and knols.
  • Use previous human translations and machine translation to 'pretranslate' your uploaded documents.
  • Use our simple WYSIWYG editor to improve the pretranslation.
  • Invite others (by email) to edit or view your translations.
  • Edit documents online with whomever you choose.
  • Download documents to your desktop in their native formats --- Word, OpenOffice, RTF or HTML.
  • Publish your Wikipedia and knol translations back to Wikipedia or Knol.

When uploading you can/must:(as of 1/2010)

  • choose among the four supported formats
  • Use translation memory (global or local)
  • Use glossary (none, one of yours)

Once a text is uploaded, it stays in Google memory (at least for a while) and users can then do a side by side translation in the supported formats. Too bad that Google does not suport Mediawiki format without passing through Wikipedia.

Glossaries

Let's look at Google's Translation memories, glossaries, and placeholders. Part of this kit includes the possibility to upload glossaries and to save TMX files. A glossary is a CSV table with the following format: A arbitrary number of columns with locales (at least) one followed by an optional part of speech (pos) and description columns.

Google custom glossary format
en-AU fr gsw locale4 .... pos description
lams mouton lämmli noun Either a software system or variant of sheep. Do not use as a verb
gate verrou schperri noun An element in pedagogical sequencing that will filter access to the next element in various ways. (... more needed here ...)

Since CSV uses commas to separate items, Commas in a cell must be escaped by double-quotes ("). For example, to add the term hello, world!, use "hello, world!". Quotes within quotes are escaped by \. For example, She said, "hello." should be entered as "She said, \"hello.\"". Such glossaries can be made with a spreadsheet (e.g. the google docs) and then be exported as CSV and then imported to the translation tool.

Translation memory

“A translation memory (TM) is a database of human translations. As you translate new sentences, we automatically search all available translation memories for previous translations similar to your new sentence. If such sentences exist, we rank and then show them to you. Comparing your translation to previous human translations improves consistency and saves you time: you can reuse previous translations or adjust them to create new, more contextually appropriate translations. When you finish translating documents in Google Translator Toolkit, we save your translations to a translation memory so you or other translators can avoid duplicating work.” (Using translation memories, retrieved 15:51, 27 January 2010 (UTC)).

Mozilla Projects / Firefox

Let's examine a few features of the Mozilla L10N strategy.

In the Mozilla project, localization strings are managed through XUL. The following fragment defines two strings to be displayed as so-called tokens

 <caption label="&<b class="token">identityTitle.label</b>;"/>
 <description>&<b class="token">identityDesc.label</b>;</description>

identityTitle.label and identityDesc.label will be substituted by a strings defined in a DTD as entities.

<!ENTITY <b class="token">identityTitle.label</b> "Identity">
<!ENTITY <b class="token">identityDesc.label</b> "Each account has an identity, which is the ↵
↳ information that other people see when they read your messages.">

("↵ ↳" indicates a single line, broken for readability) If you want to find more example constants defined as XML entities, locate the firefox installation directory on your computer and examine chrome/ab-CD.jar file, e.g. en-US.jar.

The L10N tools include

  • Use of a text editor that can handle UTF-8 files
  • A langpack2cvstree.sh script that converts the en-US language package into another locale.
  • A command line/web tool, called compare locales: finds missing and obsolate strings in a localization
  • Example: short text that describes the tool
  • MozillaBuild: An easy way to install everything you need to checkout/pull and checkin/push your localization and run compare-locales on Windows.
  • Mozilla Translator is a tool to help translate programs
  • Narro, is a web application that allows online translation and coordination. You can see in operation at l10n.mozilla.org
  • Translate Toolkit (moz2po and po2moz): converts various sorts of Mozilla files to Gettext PO format for translation efforts using a PO editor and the other way round. It's used be Pootle for example (see below).
  • MozLCDB similiar to PO but more dedicated to Mozilla products
  • Pootle a web server for localisation that allows web-based contributions and management. Combined with the Translate Toolkit it allows Mozilla products to be localised online.
  • Virtaal is an off-line PO editor developed by the Pootle team.

We wonder a bit who uses which tools. If we understand the situation right, there are several ways to translate as long as the translation does find its way back into the CVS at some point.

Also, it is not suprising, that the project makes a clear distinction between "official releases" and others. Official releases include translation of the installation and migration process, localizing the start page and other web pages built into the product, customizing settings like "live bookmarks", locally relevant search engine plugins, and more.

Now I wonder if translation learning management systems into different languages could imply using a different pedagogical vocabulary. E.g. "french didactics" vs. "belgian instructional design" vs. "canadian constructivism" (without any English word) vs. Swiss "let's have a bit of all".

LAMS

LAMS is a system for authoring and delivering learning activities, i.e. a learning design and learning activity management software. Since I made a little commitment to help with translation, I also shall insert some comments regarding features and UI elements that we could make better - Daniel K. Schneider 14:33, 22 January 2010 (UTC).

This open source project developed an original online solution to support translation. Software translators will use two tools:

  • An internationalization site (see below)
  • A LAMS Translators server (that is upgrade after a translation effort so that a translator can check the look and feel on a live system and make sure that there are no conceptual errors)
Editing translation strings

Strings to translate are called labels. A string may include slots to filled in with data and use the syntax {0}, {1}, etc.

The LAMS Internationalization site presents a list of available software modules and within each module, completion is shown as the following picture shows.

Lams translation server screendump (retrieved 1/2010 from Translating LAMS (Note: One ought be able to list just the modules for one language).

Translators can display a module, each label is displayed as a list of

English String, a french translation, update date, translator
Display of module labels

The translator then can choose between:

  • editing an isolated label
  • bulk translate all labels (date and author information of all tags will be overwritten)
  • translate only missing labels

Below are a few screenshots that illustrate the principle

Editing a single label (Note: We might add the Java property label since its a unique identfier)
Bulk translating labels
Homogenization

For the moment, there is no support for language-specific glossary and terminology management.

Good translators may develop several strategies to make sure that a common terminology is used.

Terminology must be homogeneous, e.g. the English word "Cancel" should normally translate to the same french word, e.g. "annuller" (v.s. "abandonner"). One also could argue that translators sometimes have to use less meaningful words, because at some point MS made certain decisions and people are used to it.

"Non-natural" terminology like "branching" or "gates" raise additional challenges. In french I translated this to "verrou" after looking at the German "Sperre". For various reasons I don't like the obvious translation of "Gate" to "Portail" (meaning "Portal"). Verrou means "lock" but also could mean a barrier in some geological sense. Anyhow, if several translators participate or if one translator translates once in a while, there is a very high chance that very different words are being used to deal with the same object. Also in my case, only once I decided to use LAMS for real, I stumbled upon terminology (created by myself) and that I really found unclear.

Fortunately in LAMS (and this is a mission-critical feature), translators can search for labels in any language (Note: Not true actually only in English and the target language) and then see the result in a similar fashion as in the bulk translate window.

Result of a label search
Result of a label search in English
Result of a label search in Target language (french)

This feature allows to see (1) where certain words of the target language are being use (e.g. check if the same word is used for different things, which may or may not be a proble) and (2) more importantly, how a given english word or phrase has been translated.

Since the result displayed is an editable bulk translation window, re-translation is pretty fast.

Testing

Testing is basically done by playing with a translator's test server. Translators have full rights, e.g. can explore the admin interface, create new teacher and student users and play with these. When the project manager sees that new important translations have been comitted he will update the test server, in particular the weeks before a software upgrade.

Under the hood

($$$ addition needed)

  • Labels are Java properties and my include slots for arguments, e.g.

The {0} cannot be within the range of an existing condition.

Formats

In this section we shall have a look at some more populat formats used in Localization.

Lowlevel formats

There exist various methods to handle language strings in computer programs.

Most often, some sort of constants are being used in the code. These constants are then defined either in so-called languages files or in a database. Let's have a look at an examples now. In a mediawiki (the software used for this wiki), default translations of language strings sit in files that are imported to a database when the software is first installed. The format of these constants may be quite different. In the Wikipedia project, constants are defined as keys in associated arrays. The most important one is $messages and includes 2000-3000 strings.

$messages = array(
 ... 
 'pagecategories'     => '{{PLURAL:$1|Category|Categories}}'
 'about'              => 'About',
 
 ...
);

These strings can be very short, e.g. the 'about' key (constant) tranlates to 'About'. However one also finds messages that extend to half a screen, and that look like this (only a portion is shown):

$messages = array(
  .....
  'blockedtext'   => "<big>'''Your user name or IP address has been blocked.'''</big>

The block was made by $1.
The reason given is ''$2''.

* Start of block: $8
* Expiry of block: $6
* Intended blockee: $7

You can contact $1 or another [[{{MediaWiki:Grouppage-sysop}}|administrator]] to discuss the block.
You cannot use the 'e-mail this user' feature unless a valid e-mail address is specified
 in your [[Special:Preferences|account preferences]] and you have not been blocked from using it.
 ..... (contents removed)
Your current IP address is $3, and the block ID is #$5.
Please include all above details in any queries you make.',
 .....
)

Customizing the of mediawiki messages can be done easily by editing the interface strings directly from the wiki. People can find the message they want to change/translate in the Special:Allmessages page and then edit the relevant string in the MediaWiki: namespace. Once edited, these changes are live and will remain even after an upgrade. A side effect of this fairly easy to use system is that willing translators just customize their own wiki, and language files shipped with the mediawiki software are no longer quite up-to-date.

Let's now look at standardized solutions for easing translation of software strings. Language files can be generated from more highlevel formats, such as Gettext or XLIFF. Also Gettext and XLIFF can be generated from computer code and merged back into it. In other words, an alternative solution to creating web-based interfaces as in the Mediawiki and the LAMS project, is to extract Gettext or XLIFF files, put them into a web interface or ship them to translators, and then merge them back.

However the fact that translators have to pay attention to various sorts of placeholders remains a difficulty in all cases.

Gettext

Gettext is a translation strings strategy and technology, popular in open source. is based on the idea that keys used to retrieve local language strings corresponds to the original string used in the source code. Documentation also is added as programming comment just before the corresponing line. From the source code so-called .PO files that are then use by translators.

XLIFF

  • XLIFF (XML Localization Interchange File Format) is an XML-based format created to standardize localization. I.e. XLIFF is an alternative to Gettext. XLIFF was standardized by OASIS in 2002. According to the Specification 1.2, XLIFF is the XML Localization Interchange File Format designed by a group of software providers, localization service providers, and localization tools providers. It is intended to give any software provider a single interchange file format that can be understood by any localization provider. It is loosely based on the OpenTag version 1.2 specification and borrows from the TMX 1.2 specification. However, it is different enough from either one to be its own format.
<xliff version='1.2'
       xmlns='urn:oasis:names:tc:xliff:document:1.2'>
<file original='hello.txt' source-language='en' target-language='fr'
       datatype='plaintext'>
  <body>
    <trans-unit id='hi'>
      &lt;source&gt;Hello world &lt;/source&gt;
      <target>Bonjour le monde</target>
      <alt-trans>
       <target xml:lang='es'>Hola mundo</target>
      </alt-trans>
    </trans-unit>
 </body>
</file>
</xliff>

The most important elements are <trans-unit> and <bin-unit>. They contain the translatable portions of the document. The <trans-unit> element contains the text to be translated, the translations, and other related information. The <bin-unit> contains binary data that may or may not need to be translated; it also can contain translated versions of the binary object as well as other related information.

In the trans-unit element, text to be translated is contained in a <source> element, the translated tell will be within the <target> element.

There exist convertors from PO to XLIFF and the other way round.

Translation Memory eXchange

Translation Memory eXchange (TMX) is according to Wikipedia, “an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR[1] (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA[2] (Localization Industry Standards Association). Being in existence since 1998, the format allows easier exchange of translation memory between tools and/or translators with little or no loss of critical data[3]. The current version is 1.4b - it allows for the recreation of the original source and target documents from the TMX data. TMX 2.0 was released for public comment in March, 2007”, retrieved 14:43, 3 February 2010 (UTC).

The OAXAL framework

  1. Consistency in the authored source-language text
  2. Consistency in the translated text
  3. Substantial automation of the Localization process

This references model, according to the Reference Model for Open Architecture for XML Authoring and Localization 1.0, retrieved 18:07, 29 January 2010 (UTC) has two prime use cases:

  1. Support authoring and localization within a CMS environment
  2. Support consistency within a translation-only environment

To achieve these goals and handle the use cases OAXAL defined the following software stack:

The architectural model for OAXAL. Source http://wiki.oasis-open.org/oaxal/FrontPage

Let's shortly describe some of these elements:

  • ITS (Internationalization Tag Set (ITS) Version 1.0) “is a technology to easily create XML which is internationalized and can be localized effectively. On the one hand, the ITS specification identifies concepts (such as "directionality") which are important for internationalization and localization. On the other hand, the ITS specification defines implementations of these concepts (termed "ITS data categories") as a set of elements and attributes called the Internationalization Tag Set (ITS).”. This work is explained in a Best Practices for XML Internationalization working group note.

Basically ITS allows you to insert tags into an XML tree that will tell systems (and human readers) of XML what should not be translated....

Below we reproduce a simple ITS example:

<dbk:article
  xmlns:its="http://www.w3.org/2005/11/its" 
  xmlns:dbk="http://docbook.org/ns/docbook" 
  its:version="1.0" version="5.0" xml:lang="en">
 <dbk:info>
  <dbk:title>An example article</dbk:title>
  <dbk:author
    its:translate="no">
   <dbk:personname>
    <dbk:firstname>John</dbk:firstname>
    <dbk:surname>Doe</dbk:surname>
   </dbk:personname>
   <dbk:affiliation>
    <dbk:address>
     <dbk:email>foo@example.com</dbk:email>
    </dbk:address>
   </dbk:affiliation>
  </dbk:author>
 </dbk:info>
 <dbk:para>This is a short article.</dbk:para>
 </dbk:article>

OAXAL defines (fairly complicated) workflows for document life cycles. From a high-level perspective, the life-cycle for localization-only workflows (e.g. as in sofware localization) looks like this:

OAXAL Localization Only Workflow using xml:tm namespace. Source http://wiki.oasis-open.org/oaxal/FrontPage

The key step is to extract some XLIFF file with translation string that then will be used to translate and merged back.

  • Translation Memory eXchange (TMX) “is an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA (Localization Industry Standards Association” (Wikipedia, retr. jan 2010).

Software

This section should include a short list of useful software. There exist other (better) indexes, e.g. the Compendium of Translation Software, a directory of commercial machine translation systems and computer-aided translation support tools, compiled by John Hutchins.

PO editors
Free translation tools
Free online translation support systems
  • Google translate. Note: There is a way to get support for mediawiki files. Upload a filed called *.mediawiki. E.g. edit wikipage and copy/paste source or get it by &action=raw URL Get parameter, etc. Do not whine if some formatting doesn't render well (each mediawiki is different). This is a non-documented feature and we guess that it may become official after some further development - Daniel K. Schneider 18:07, 29 January 2010 (UTC).
Frameworks
  • The Okapi Framework is a set of interface specifications, format definitions, components and applications that provides an environment to build interoperable tools for the different steps of the translation and localization process. (Seems to be a live project).
Links of links
Quality / Evaluation
  • Quality Model Builder - zipped file. (QMB) allows software designers and end-users to develop a hierarchical quality model to help in the evaluation of complex software.
  • ROTE - zipped file (ROTE) - is designed to help end-users rank the relative importance of their user requirements in order to guide software developers as to the priorities that should be attached to individual requirements.

open source localization guidelines

The idea behind writing this entry was to come up with some suggestions/guidelines concerning localization in volunteer-based substantial open source projects.

(under construction - 21:04, 28 January 2010 (UTC) !)

Integrate usability

Profit from the fact that translators probably will use the software and that they will stumble on surface usability and cognitive ergonomics issues. I.e. stuff that is hard to translate may be problematic in the source language to start with.

This is related to the next point (common grounding)

Establish common grounding for essential terms used

Early in the project, start a collaborative glossary/terminology project. The glossary should include all major terms used in the UI or the manuals. Such a project should have the following features for a single meaning term (for polysemic objects this must be expanded):

  • Entry term
  • Equivalents (but that should not be used in UI and the system messsages)
  • Definitions
  • Contexts (e.g. for an LMS, that would be types of pedagogical scenarios)
  • Sources
Provide a translator's toolbox for software strings

A online translation system should be created (or used) and have the following features:

  • Be able to concurrently display source + and target language (maybe two for comparison).
  • Search all terms in target or source language and then display all the terms again in triplets as above.
  • Display information strings and (if existing) constant/parameter names used by coders
  • Be able to annote each term with comments or maybe an semi-automatic link to a glossary where comments could be made (precision neeeded here). Also one ought to be able to find synonys.
  • Translated items should be clearly attributed to a single translator, including revisions if possible.
  • Support some form of translation memory, i.e. the system should provide a default translation of a string if possible.
Think about incentives

There should be some kind of incentives for the translators. One also might try to organize monetary incentives provided by localized communities (this is about OSS without resources to pay tranlators). E.g. a German fund might pay money to german translators and a French fund might pay a trip to Australia.

Have users participate

User's should be given the opportunity to participate. Windows, Google and Wiktionary do it, so can we. I.e. there should button that a user can hit for feedback. Feedback should be very simple, e.g. with radio buttons (but include the possibility to add a message.)

  • This is not clear (why)
  • This has a typo (what)
  • There is a better term (which/why)
  • This uses a different language (with respect to what)
  • etc.

The big question is where to position such a button and how it could detect what the user was talking about ...

Plan globally for all product items
  • Software localization in the narrow sense (UI and system messages)
  • In-software help (if existing)
  • Glossary (to be used by all actors, i.e. translators, developers, all category of software users, e.g. in e-learning authors, teachers, students)
  • Manuals of all sorts.

Links

Texts and web sites about software localization

Definitions
Organizations

Home Page]

How-to / tutorials
  • The Pavel Terminology Tutorial. This is substantial tutorial on all aspects regarding terminology made by the Canadian Translation bureau (make sure to explore all the menus, there are dozens of pages ...)
XLIFF
About language file formats

Glossaries and other structured data

Internationalization
  • Dotnet-culture.net provides information about date/time and decimal/currency representation.
Terminology databases
  • Le grand dictionnaire terminologique (english/french)
  • The Open Terminology Forum french, english, spanish, chinese (?). Limited to some subdomains / maintained by volonteers. This website might be of interest to some open source localization projects, either for hosting a project or to get some inspiration.
  • Termium Plus (The Government of Canada's terminology and linguistic data bank).
  • wiktionary.org (The free dictionary, a sister project of wikipedia). Includes translations.

Other

Example Projects and languages
Courses

Software links

Indexes

Bibliography

  • McKethan, Kenneth A. (Sandy)Jr. and Graciela White (2005). Demystifying Software Globalization, Translation Journal 9 (2), April 2005. HTML, retrieved 16:27, 21 January 2010 (UTC).
  • Esselink Bert (2000), A Practical Guide to Localization, , John Benjamins Publishing, ISBN 1-58811-006-0
  • Karunakar, G. and Shirvastava, R. (2007) Evolving Translations & Terminology - The Open Way, LRIL-2007 - National Seminar on Creation of Lexical Resources for Indian Language Computing and Processing at C-DAC Mumbai. PDF
  • Muegge, Uwe (2007). "Disciplining words: What you always wanted to know about terminology management". tcworld (tekom) (3): 17–19. PDF