Software localization: Difference between revisions

The educational technology and digital learning wiki
Jump to navigation Jump to search
Line 43: Line 43:
== Issues ==
== Issues ==


In this chapter we shall explorer some important issues. Since we shall focus on translation of open source software, we shall stress the importance of [[ergonomics|ergonomic]] aspects as the #1 priority of localization.  
In this chapter we shall explore some important issues. Since we shall focus on translation of open source software, we shall stress the importance of [[ergonomics|ergonomic]] aspects as the #1 priority of localization.  
By ''ergonomic aspects'' we mean both "surface [[usability]]" (users can understand the meaning of UI interface elements and system messages) '''and''' [[cognitive ergonomics]] (user can get meaningful tasks done with the system).  
By ''ergonomic aspects'' we mean both "surface [[usability]]" (users can understand the meaning of UI interface elements and system messages) '''and''' [[cognitive ergonomics]] (user can get meaningful tasks done with the system).  


Line 136: Line 136:
* [http://translate.google.com/ Google translate]. E.g.  [http://translate.google.com/#en|fr|learning%20activity%20management%20system learning activity management system] translated to french "système d'apprentissage en gestion de l'activité" which is totally wrong. Users can contribute a better translation. In addition they could activate the translators toolkit, which includes glossary management, chat, etc.
* [http://translate.google.com/ Google translate]. E.g.  [http://translate.google.com/#en|fr|learning%20activity%20management%20system learning activity management system] translated to french "système d'apprentissage en gestion de l'activité" which is totally wrong. Users can contribute a better translation. In addition they could activate the translators toolkit, which includes glossary management, chat, etc.


=== Technical infrastructure ===
=== Computer-assisted translation software ===


Translators should "see" what they translate. This implies concerns several items. When translating a string, the translator should be able to see:
Today exists a rather large set of toolboxes that aim to increase the quality and the efficiency of the translation processs. Some of these tools already have been introduced by example. Based on Wikipedia's [http://en.wikipedia.org/wiki/Computer-assisted_translation Computer-assisted translation] entry, we may distinguis the following types of tools:
 
'''(1) Spell checkers and grammar checkers''':
 
Both can be integrated into wordprocessors, online CMSs, programming editors or work as add-on programs
 
''' (2) Terminology managers''':
 
This includes a whole family of tools, ranging from simple tables (entry + description) to complex glossary and thesaurus managagers. Some of these exist online. A good examples is the [http://www.terminologie.com/ Open Termonology Forum] and Google translation tool, both discussed in this article.
 
Commercial programs include: LogiTerm, MultiTerm, Termex
 
Open source programs include:
 
Open source online programs:
* The Wikionary (probably not available "as is", but one could install a Mediawiki and then retrieve all the templates)
 
''' (3) Terminology databases '''
 
These include publicly available databases. Good examples:
* [http://en.wiktionary.org/ Wiktionary] (presented above)
* [http://www.granddictionnaire.com/BTML/FRA/r_Motclef/index800_1.asp Le grand dictionnaire terminologique]
* [http://www.terminologie.com/ Open Termonology Forum] (for some restricted domains only)
 
''' (4) Full text indexers '''
 
I.e. indexing search engines that allow to formulate complex queries. Instead of searching all the Internet they will look up translation memories (see below).
 
''' (5) Concordancers'''
 
{{quotation|retrieve instances of a word or an expression and their respective context in a monolingual, bilingual or multiligual corpus, such as a bitext or a translation memory.}} ([http://en.wikipedia.org/wiki/Computer-assisted_translation wikipedia])
 
''' (6) Bi-texts '''
 
A merge of source text and its translation
 
''' (7) Translation memory managers (TMM) '''
 
'''TMM''' relates to the idea that there is a database with text segments in the source language and their translations in one or more target languages. A text segment corresponds to sentences or equivalent (e.g. titles or bullet items).
 
In fact a TMM is specialized text editor that displays segments to be translated in one column and possible translations in another one. The translator then can reuse old translations and adapt it to the new fragment (or if there is no match, create a totally new translation.
 
=== Technical infrastructure for sofware strings translation ===
 
Translators should "see" what they translate. This concerns several items. When translating a string, the translator should be able to see:
* The name of the constant (which must be meaningful, e.g. "modulename.mainmenu.edit" or "modulename.errormsg.upload.xxx")
* The name of the constant (which must be meaningful, e.g. "modulename.mainmenu.edit" or "modulename.errormsg.upload.xxx")
* A meanigful short description. This description may include a link to a glossary.
* A meanigful short description. This description may include a link to a glossary.
Line 477: Line 521:
''The {0} cannot be within the range of an existing condition.''
''The {0} cannot be within the range of an existing condition.''


== Formats and software ==
== Formats ==
 
In this section we shall have a look at some more populat formats used in Localization.


=== Lowlevel formats ===
=== Lowlevel formats ===
Line 486: Line 532:


In a mediawiki (the software used for this wiki). Translations of language strings can either sit in files or in a database or both.
In a mediawiki (the software used for this wiki). Translations of language strings can either sit in files or in a database or both.


To ease translation work, language files can be generated from more highlevel formats, such as Gettext or XLIFF. Also Gettext and XLIFF can be generated from computer code and merged back into it.
To ease translation work, language files can be generated from more highlevel formats, such as Gettext or XLIFF. Also Gettext and XLIFF can be generated from computer code and merged back into it.
Line 524: Line 569:


There exist convertors from PO to XLIFF and the other way round.
There exist convertors from PO to XLIFF and the other way round.
=== Translation Memory eXchange ===
Translation Memory eXchange (TMX) is according to [http://en.wikipedia.org/wiki/Translation_Memory_eXchange Wikipedia], {{quotation|an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR[1] (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA[2] (Localization Industry Standards Association). Being in existence since 1998, the format allows easier exchange of translation memory between tools and/or translators with little or no loss of critical data[3]. The current version is 1.4b - it allows for the recreation of the original source and target documents from the TMX data. TMX 2.0 was released for public comment in March, 2007}}, retrieved 14:33, 3 February 2010 (UTC).


=== The OAXAL framework ===
=== The OAXAL framework ===
Line 576: Line 625:
* [http://en.wikipedia.org/wiki/Translation_Memory_eXchange Translation Memory eXchange] (TMX) {{quotation|is an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA (Localization Industry Standards Association}} (Wikipedia, retr. jan 2010).
* [http://en.wikipedia.org/wiki/Translation_Memory_eXchange Translation Memory eXchange] (TMX) {{quotation|is an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA (Localization Industry Standards Association}} (Wikipedia, retr. jan 2010).


=== Software ===
== Software ==
 
This section should include a short list of useful software


; PO editors
; PO editors
Line 592: Line 643:
** [http://okapi.sourceforge.net/Release/Utilities/Help/index.html Components]
** [http://okapi.sourceforge.net/Release/Utilities/Help/index.html Components]
** [http://okapi.sourceforge.net/downloads.html downloads] (include a localization toolbox and a translation memory editor).
** [http://okapi.sourceforge.net/downloads.html downloads] (include a localization toolbox and a translation memory editor).
; Links of links
* [http://en.wikipedia.org/wiki/Computer-assisted_translation http://en.wikipedia.org/wiki/Computer-assisted_translation#Comparison_of_different_CAT_tools Computer-assisted translation/Comparison of different CAT tools]


== open source localization guidelines ==
== open source localization guidelines ==
Line 728: Line 783:
* Karunakar, G. and Shirvastava, R. (2007) Evolving Translations & Terminology - The Open Way, LRIL-2007 - National Seminar on Creation of Lexical Resources for Indian Language Computing and Processing at C-DAC Mumbai. [http://tdil.mit.gov.in/april-jan-2008/8.4_Evolving_Translation.pdf PDF]
* Karunakar, G. and Shirvastava, R. (2007) Evolving Translations & Terminology - The Open Way, LRIL-2007 - National Seminar on Creation of Lexical Resources for Indian Language Computing and Processing at C-DAC Mumbai. [http://tdil.mit.gov.in/april-jan-2008/8.4_Evolving_Translation.pdf PDF]


* Muegge, Uwe (2007). "Disciplining words: What you always wanted to know about terminology management". tcworld (tekom) (3): 17–19. [http://www.tekom.de/upload/alg/tcworld_307.pdf PDF]


[[Category: Ergonomics and human-computer interaction]]
[[Category: Ergonomics and human-computer interaction]]
[[Category: Design methodologies]]
[[Category: Design methodologies]]

Revision as of 15:33, 3 February 2010

Draft

This article or section is currently under construction

In principle, someone is working on it and there should be a better version in a not so distant future.
If you want to modify this page, please discuss it with the person working on it (see the "history")

<pageby nominor="false" comments="false"/>

Introduction: What do do mean by localization ?

Software localization (or localisation) could mean simple translation of software a interface and messages to another language plus adaptation of some formats (e.g. measures, dates and currency). But usually, software localization implies more.

Firstly, the software should be internationalized (I18N), i.e. desiged in a way that it can be adapted to various languages and regions (not just one) without engineering changes to the programming logic. In principle, it is stuff that has to be done once. But sometimes I18N is done in stages, e.g. often developers forget that some languages are more verbose than others (need more space) or work in other directions (e.g. right-to-left).

In a second step, software has to localized, i.e. translated and adapted to local cultures. Localization (L10N) means adaption of a product following the needs of a particular population in a precise geographic region. Needs include linguistic, cultural and ergonomic aspects. (Le grand dictionnaire terminologique). McKethan and White (2005) define localization as “the process of adapting an internationalized product to a specific language, script, cultural, and coded character set environment. In localization, the same semantics are preserved while the syntax may be changed.” The authors further argue that “Localization goes beyond mere translation. The user must be able to not only select the desired language, but other local conventions as well. For instance, one can select German as a language, but also Switzerland as the specific locale of German. Locale allows for national or locale-specific variations on the usage of format, currency, spellchecker, punctuation, etc., all within the single German language area.”

Gregory M. Shreve (retrieved 16:27, 21 January 2010 (UTC)) adds adaptation of "non-textual materials" like colors, logos and pictures. “Localization is the process of preparing locale-specific versions of a product and consists of the translation of textual material into the language and textual conventions of the target locale and the adaptation of non-textual materials and delivery mechanisms to take into account the cultural requirements of that locale.”

The W3C Internationalzation web site also argues that

localization is often a substantially more complex issue. It can entail customization related to:

  1. Numeric, date and time formats
  2. Use of currency
  3. Keyboard usage
  4. Collation and sorting
  5. Symbols, icons and colors
  6. Text and graphics containing references to objects, actions or ideas which, in a given culture, may be subject to misinterpretation or viewed as insensitive.
  7. Varying legal requirements
  8. and many more things.
Localization may even necessitate a comprehensive rethinking of logic, visual design, or presentation if the way of doing business (eg., accounting) or the accepted paradigm for learning (eg., focus on individual vs. group) in a given locale differs substantially from the originating culture.

Usually, language localization extends to national subcultures. E.g. German would decline in De-de (German for Germany) and other versions for Switzerland (de-ch) and Austria (de-at), etc. National versions of a language include different words, different spellings (like "localization" vs. "localisation"), and sometimes different grammar. Conversely, a multi-lingual country like Switzerland would have German Swiss (de-ch), French Suiss (fr-ch) and Italian-Swiss (it-ch). These country-specific localizations have in common the way date/time, decimals and currency is represented. Software translators may adapt the following strategies:

  • Create a generic translation for one language, e.g. fr, and then add specific local variants like fr_fr, fr_be on top of it.
  • Create only one translation, e.g. fr_fr, and have users cope with it. Data/time and decimal/currency representation differences should be handled though. Most often, this is the case in free open source software.
  • Sometimes, localizations only exist for major sub-cultures, e.g. a de-de and de-ch version. Other german speaking cultures like de-at (too close to German) or de-lu and de-li (too small) are not covered.

Finally, the term Software globalization (G11N), also known as National Language Support, refers to the combination of software internationalization (I18N) and localization (L10N).

Let's recapitulate and explain the abbreviations for internationalization, localization and globalization. The "funny" kind of abbreviations used in the localization community are so-called numeronyms as we shall explain for the I18N.

  • Internationalization, known as I18N. The 18 stands for the number of letters between the first i and last n in internationalization.
  • Localization, known as l10n (or L10N) , is composed of the l of localization, followed by 10 letters (ocalizatio) and the final n of localization.
  • Software globalization, known as G11N refers to: I18N + L10N
  • Finally, GILT stands for Globalization, Internationalization, Localization and Translation, i.e. all matters related to localization.

Issues

In this chapter we shall explore some important issues. Since we shall focus on translation of open source software, we shall stress the importance of ergonomic aspects as the #1 priority of localization. By ergonomic aspects we mean both "surface usability" (users can understand the meaning of UI interface elements and system messages) and cognitive ergonomics (user can get meaningful tasks done with the system).

Ergonomy considerations apply to both the products (software, manuals), and to the programmer and translator environments (tools, community, etc.)

Infrastructure and people

In a larger project, the list of types of participants can be quite long: E.g. Gregory M. Shreve identifies: Project managers, Translators (Generic), Localization Translators (Specialists), Terminologists, Internationalization/Localization Engineers (Software Background), Proofreaders, QA specialists, Testing engineers, Multilingual Desktop publishing specialists.

In a volonteer-based open source project we might see the following roles (that my be cumulated by the same person)

  • one person to coordinate software development and translation
  • one person to coach software string translators
  • one person to coach manual writing, including a glossary
  • volonteer translators
  • volonteer users that provide feedback about the UI ergonomics, spelling, in-context help, etc.

Localization targets

I18N and L10N should be thought of in terms of the whole system. A software product may include:

User documentation (includes several genres and available through several media, e.g. within the software itself, HTML, paper/PDF)
Manuals
Short contextual help
Glossary
Tutorials
....
Software (user)
Menus and Icons (visual command language)
Command languages (sometimes)
Messages (various output)
....
Software documentation
Documentation of language constants (and other useful elements)
Developer manuals (maybe)

Since translation work is often split up between many volonteers with domain knowledge, there ought to be some potential for synergy that FOSS project managers and developers might try to exploit and develop. E.g. people writing the tutorials also should make comments about the meaningfulness of system messages as well as other usability issues, and not just try to adapt the user to the system.

In addition, translation of software modules, manuals, tutorials ought to be coordinated somewhere, see the terminology issue below.

Managing volonteers in an open source project

Open source projects don't have the funding to pay professional translators. This situation has disadvantages but also some advantages.

Disadvantages: Quality of the translation, completion (untranslated strings for new versions, missing languages, etc.)
Advantages: Meaninfulness of the translation. Often translators are users, i.e. they have know how of the tool which "normal affordable" translators do not have. According to [ LISA], a second interesting fact is that “the natural tendency of the localizer is to magnify the difference between [two expressions in the source language that refer to the same thing], just to be on the safe side. The result is that localizers tend to find the problems in a product that were not previously noted, problems that may well have confused users in the source language as well.”

I (16:27, 21 January 2010 (UTC)) believe that there ought to be some strategies to improve volonteer translation efforts and to profit from emerging insights. The main issues:

(1) Motivating people to help and to continue translating

(2) Make sure that translation is usable by providing a decent enough translation support environment (see next item)

(3) Make sure to profit from translator's domain knowledge, plus inconsistencies he may detect in the original source language interface and messages.

Terminology management / Glossary

Terminology management refers to the fact that critical and specialized terms must be clearly defined in the source language and be linked to target languages. With respect to software localization, this should be done independently of translation of UI elements and system messages.

The Localization Industry Standards Association (LISA) defines terminology management as follows: “Quality translation relies on the correct use of specialized terms. It improves reader understanding and reduces the time and costs associated with translation. Special terminology management systems store terms and their translations, so that terms can be translated consistently. Full-featured systems go beyond simple term lookup, however, to contain information about terms, such as part of speech, alternate terms and synonyms, product line information, and usage notes. They are generally integrated with translation memory systems and word processors to improve translator productivity.” (retrieved 15:51, 27 January 2010 (UTC)).

Silvia Pavel defines terminological data as “Any information relating to a term or the concept designated. Note: The more common terminological data include the entry term, equivalents, alternate terms, definitions, contexts, sources, usage labels, and subject fields.” This can lead to rather large records, e.g. have a look at the term Apprentissage/Learning in the Termium Plus database.

Terminology management is not a simple formal question of agreeing on some ways of translating words. It's related to fundamental aspects of language use in the sense of language as action and that requires common grounding. (Clark, 1996).

See the example section below for some online software that can help establish common lexical items, including ways to support process of meaning negotiation between various actors.

On the code side

Language files
All output messages to the user must be defined as a kind of name constant that the programmers will use. Alternatively, it can be decided that source langage strings (plus a documentation string nearby in the code) are the constants.
Name of the constant should be meaningful and available to translators (since at some point here must a common way to talk about a precise translation item)
Languages files must be separate for the translators (e.g. in a textfile, a localization format, or ad database)
Encoding
  • Nowadys, project should systematically use some form of Unicode
Space and Layout
  • Space of text fields: Some languages are more verbose and one must plan for that, by using wider icons, menu items, user input fields and such or else use a "fluid" design.

Reviewing

Localized content should be tested in target. In other words: Using the software for real or almost real by real target persons does provide good feedback. If this is not an option, the reviewer must be able to "imagine" a real situation, i.e. receive enough contextual information.

In an open-source project like LAMS that allows for frequent updating one could ask volonteer teachers to provide feedback and to ask their learners to do the same (at least write down meaningless/difficult to understand) UI elements and system messages. If this is not possible, translators should at least play all the roles: Administer the system, design both a complex and a simple sequence, use it as teacher (including the monitoring) tool, and experience it as learner. Another option could be "problem/suggestions" button to be installed optionally by the administrator in the interfaces and that would allows user to complain and make suggestions in an easy, quick and non-obtrusive way.

This idea needs to be fleshed out.

Examples of software that collects user information

  • The "report Broken Web Site" and "Report Web Forgery" buttons in Firefox (under the Help menu)
  • The dialog you get once MS Help dialogs end up without solutions, e.g. three questions: (1) was it useful, (2) why ? and (3) additional information. Now I would like to know what kind of volume / 1000 users this generates and how it is being used.
  • Some google help files have a similar interface. (1) Was this information helpful, followed by (2) Easy to understand? (3) Complete with enough details ?
  • Google translate. E.g. learning activity management system translated to french "système d'apprentissage en gestion de l'activité" which is totally wrong. Users can contribute a better translation. In addition they could activate the translators toolkit, which includes glossary management, chat, etc.

Computer-assisted translation software

Today exists a rather large set of toolboxes that aim to increase the quality and the efficiency of the translation processs. Some of these tools already have been introduced by example. Based on Wikipedia's Computer-assisted translation entry, we may distinguis the following types of tools:

(1) Spell checkers and grammar checkers:

Both can be integrated into wordprocessors, online CMSs, programming editors or work as add-on programs

(2) Terminology managers:

This includes a whole family of tools, ranging from simple tables (entry + description) to complex glossary and thesaurus managagers. Some of these exist online. A good examples is the Open Termonology Forum and Google translation tool, both discussed in this article.

Commercial programs include: LogiTerm, MultiTerm, Termex

Open source programs include:

Open source online programs:

  • The Wikionary (probably not available "as is", but one could install a Mediawiki and then retrieve all the templates)

(3) Terminology databases

These include publicly available databases. Good examples:

(4) Full text indexers

I.e. indexing search engines that allow to formulate complex queries. Instead of searching all the Internet they will look up translation memories (see below).

(5) Concordancers

“retrieve instances of a word or an expression and their respective context in a monolingual, bilingual or multiligual corpus, such as a bitext or a translation memory.” (wikipedia)

(6) Bi-texts

A merge of source text and its translation

(7) Translation memory managers (TMM)

TMM relates to the idea that there is a database with text segments in the source language and their translations in one or more target languages. A text segment corresponds to sentences or equivalent (e.g. titles or bullet items).

In fact a TMM is specialized text editor that displays segments to be translated in one column and possible translations in another one. The translator then can reuse old translations and adapt it to the new fragment (or if there is no match, create a totally new translation.

Technical infrastructure for sofware strings translation

Translators should "see" what they translate. This concerns several items. When translating a string, the translator should be able to see:

  • The name of the constant (which must be meaningful, e.g. "modulename.mainmenu.edit" or "modulename.errormsg.upload.xxx")
  • A meanigful short description. This description may include a link to a glossary.
  • All other translations (languages strings) the translator understands (e.g. if I translate to German, I'd like see both English and French)
  • (If possible) the constant displayed in the interface. That of course requires extra programming. Or even better: be able to edit strings directly on the interface

Tools for consistency: When translating hundreds of strings (and the situation gets worse if it's done by several people) there should be a way to search through all terms in all modules in three ways:

  • Find same expressions in the target language and display an other language next to it
  • Find same expressions in another language and display the target strings next to it
  • Be able to edit and consult a short glossary that includes the most important terms (might be combined with the general user manual)
  • (dreaming) direct access to some online translation dictionary like the english/frenchgrand dictionnaire terminologique

See also the terminology management / glossary above

Examples of terminology and translation tools used in FOSS projects

In this section, we will look at various examples that raised our interest.

The open terminology forum

The Open Terminology forum is a freely-accessible terminology database whose content is supplied and validated by the members of a non-profit organization. According to What is it? page, “anyone who can show that they have specialized knowledge can become a member of the Forum and participate in their area of expertise, whether they are a professional, a translator, a journalist or an academic. Members can add terminology information, comment on (but not change) that contributed by others, make personal lists of selections of terms and create their personal descriptions with their own HTML links.” , retrieved 15:51, 27 January 2010 (UTC). “Each area of specialization has its own terms. The Forum is a place where these terms can be recorded, refined, discussed and validated by those concerned. As every item of information shown is personally attributed to the contributor, members can build up their reputations”

The main window shows terms in a three column layout. Terms can searched with wildcards (* and ?). A user can then click on each of the "meaning", "English term" and "French equivalent" in order to further explore or to edit the context (with permission)

(1) Click on meaning, then Info will provide contextual information. E.g. the English term "classroom management" is not translated (should become "gestion de classe") is associated with the rather useless "for a lesson" meaning, but clicking on "for a lesson" will show a good definition plus a link.

(2) Click on a term will allow to add variants, occurences, synonyms, translations or a comment.

(3)Click on help will show agreement among experts and some other information.

Wiktionary

“Designed as the lexical companion to Wikipedia, the encyclopaedia project, Wiktionary has grown beyond a standard dictionary and now includes a thesaurus, a rhyme guide, phrase books, language statistics and extensive appendices. We aim to include not only the definition of a word, but also enough information to really understand it. Thus etymologies, pronunciations, sample quotations, synonyms, antonyms and translations are included.” (Main Page, retrieved 18:07, 29 January 2010 (UTC)). Advanced users may customize Wiktionary preferences. Entries can be search or looked up through various indexes.

The structure of Wiktionary entries (pages) seems to be a moving target, i.e. together with emerging needs complexity is added or page structure is modified. Also, various language variants don't look the same. I.e. the french version includes the use of icons in from of main headings (much prettier ...). Finally, exploitation of various possible headers and templates may be different in various culture.

Let's have a look at the Thesaurus discussion. The English dictionary version includes a separate thesaurus called Wikisaurus. The Germans on the other hand, do also have a small thesaurus, but globally speaking one can observe that their wiktionary seems to be more systematic, i.e. entries do include more thesaurus-like features. Here is small table we made from looking at a few entries (i.e. we don't claim it to be accurate):

In all Wiktionaries we can find:

  • Synonymy – same or similar meaning
  • Troponymy – manner, such as trim to cut

In the German Wiktionary we get in addition:

  • Antonymy – opposite meaning
  • Hyponymy – narrower meaning, subclass (Unterbegriffe)
  • Hypernymy – broader meaning, superclass (Oberbegriffe)

In none we found (although it can be covered somewhat by super/subclasses)

  • Meronymy – part, such as wheel of a car
  • Holonymy – whole, such as car of a wheel

We may argue that the German strategy seems to see the dictionary itself as a kind of thesaurus, whereas in the English world the two have been split, although not all participants did think that this was a good strategry. Daniel K. Schneider also prefers the German "inclusive" strategy.

Let's have a look at Internet, top of english, french and german pages (Jan 2010):

Wikionary entries are normal Mediawiki pages, but follow structuring conventions, e.g. concerning the heading structure. In addition, users must understand how to use and to find (!) templates. A Mediawiki template can be defined as some kind of extension language that allows for shortcuts, complex formatting, transclusions, etc. More precisely, Wiktionary's Index to templates provides the following definition: “Templates in Wiktionary are a convenient and consistent way of including a snippet of text in various, similar articles. Within the text of an entry, enclosing a template name inside the double curly braces {{xxx}} will instruct Wiktionary to insert right there in the entry when it is viewed the contents of the page [[Template:xxx]].”. In addition, there exist shortcuts which are a specialized types of a redirection page that can be used to get to a commonly used Wiktionary reference page more quickly. Fairly complex languages to learn that makes Daniel K. Schneider sometimes wonder whether Wikipedia and its sister projects are not the world viewed by IT-savy people...

Wiktionary pages may include translations if the spelling of the word is the same. E.g. the localisation page has both an English and a french entry. But usually, a page will refer to entries of other languages through a special template, which allows to transclude these into the page.

A minimal page typically has the following structure:

== English ==

===Noun===

=== References ===  	 

The structure of a more complete entry can be illustrated by the phrase word. Its table of contents looks like this.

1. English
    1.1 Pronunciation
    1.2 Etymology
    1.3 Noun
        1.3.1 Synonyms
        1.3.2 Derived terms
        1.3.3 See also
        1.3.4 Translations
    1.4 Verb
        1.4.1 Derived terms
        1.4.2 Related terms
        1.4.3 Translations
    1.5 External links
    1.6 Anagrams
2. French
    2.1 Pronunciation
    2.2 Noun
    2.3 Anagrams

The french section is included because the word "phrase" does exist in french and means "sentence". The translation sections (one for the noun and one for the verb) includes definitions from many languages by transclusion.

Let's now examine the wiki code of an other medium-sized entry in more detail. The source wiki code of localization entry looks like this:

Screenshot of http://en.wiktionary.org/wiki/localization (upper part / jan 2010
{{wikipedia}}
{{also|localisation}}
==English==

===Etymology===
From [[localize]] + [[-ation]];
compare French ''[[localisation#French|localisation]]''

===Alternative spellings===
* [[localisation]] (''British'')

===Noun===
{{en-noun}}

# The [[act]] of [[localize|localizing]].
# The [[state]] of being [[localized]].

====Derived terms====
* [[l10n]], [[L10n]]
* [[cerebral localization]]
* [[chromosomal localization]]

====Related terms====
* [[locale]]
* [[localism]]
* [[locality]]
* [[locate]]

====Translations====

{{trans-top|act of localizing}}
* Chinese:
:* Mandrin: {{zh-tsp||地方化|difang-hua}}
* Dutch: {{t+|nl|lokalisatie|f}}
* Finnish: {{t+|fi|lokalisoida}}, {{t+|fi|kotoistaa}}
{{trans-mid}}
[ .... contents removed .... ]
* Norwegian: {{t-|no|lokalisering|f}}
* Russian: {{t+|ru|локализация|sc=Cyrl}}
{{trans-bottom}}

{{trans-top|state of being localized}}
* Chinese:
:* Mandrin: {{zh-ts||本地化}}
* Dutch: {{t+|nl|lokalisatie|f}}
{{trans-mid}}
* Norwegian: {{t-|no|lokalisering|f}}
* Russian: [[локализованность]]
{{trans-bottom}}

{{trans-top|software engineering:
act or process of making a product suitable for use
in a particular country or region}}
* Chinese:
:* Mandrin: {{zh-ts||本地化}}
* Dutch: landelijke aanpassing
[ .... contents removed .... ]
* Russian: [[локализация]]
{{trans-bottom}}

{{checktrans-top}}
* {{ttbc|French}}: {{t+|fr|localisation|f}}
* {{ttbc|German}}: [[Lokalisation]]
* {{ttbc|Italian}}: [[localizzazione]]
{{trans-mid}}
(rookaraizeeshon)
* {{ttbc|Korean}}: [[지방화]]
* {{ttbc|Latin}}: [[localizatio]]
* {{ttbc|Portuguese}}: [[localização]]
* {{ttbc|Spanish}}: [[localización]]
* {{ttbc|Turkish}}: [[Bölge]]
{{trans-bottom}}

====See also====
* [[internationalization]] / [[i18n]]

As the reader may notice, there are many templates ({{....}}). An author may just create minimal entries or fix existing entries. Advanced authors, as we mentionned above, may choose from probably hundreds of different templates.

Let's now look at tools for discussion and user feedback.

Feedback box in the English Wiktionary (Jan 2010)

In the English version and some but not all other language version (Jan 2010), users may leave a feedback. Users can first signal a problem, i.e. choose among:

  • Good
  • Bad
  • Messy
  • Mistake in definition
  • Confusing
  • Could not find the word I want
  • Incomplete
  • Entry has inaccurate information
  • Definition is too complicated

Optionally, a user then may leave a note. It probably will append it to the discussion page. Each mediawiki page has an associated discussion page (including one for this page you are reading).

Discussion does sometimes happen on wiktionary.org. A nice and funny example is the e-dictionary's discussion page (about 10 times as long as the entry ...)

We believe that Wiktionary is an interesting project and for several reasons:

  • The flexibilty of the Mediawiki infrastructure allows to cope with emerging views about what a dictionary entry should be. Indeed, from a simple dictionary it evolved into a complex one that inlcudes etymologies, pronunciations, sample quotations, synonyms, antonyms and translations (and more related kind of terms in the German version).
  • Wiktionary also integrates well with other Wikimedia sister projects like Wikipedia or Wikiversity.

Remark on the history: Most Wikitionary entries, as you can read in the Wiktionary entry of Wikipedia were initially created by bots: “most of the entries and many of the definitions at the project's largest language editions were created by bots that found creative ways to generate entries or (rarely) automatically imported thousands of entries from previously published dictionaries. Seven of the 18 bots registered at the English Wiktionary[4] created 163,000 of the entries there.[5] Only 259 entries remain (each containing many definitions) on Wiktionary from the original import by Websterbot from public domain sources; the majority of those imports have been split out to thousands of proper entries manually.”

The Google translator toolkit

According to the Google translator toolkit basics page, the kit can do the following:

  • Upload Word documents, OpenOffice, RTF, HTML, text, Wikipedia articles and knols.
  • Use previous human translations and machine translation to 'pretranslate' your uploaded documents.
  • Use our simple WYSIWYG editor to improve the pretranslation.
  • Invite others (by email) to edit or view your translations.
  • Edit documents online with whomever you choose.
  • Download documents to your desktop in their native formats --- Word, OpenOffice, RTF or HTML.
  • Publish your Wikipedia and knol translations back to Wikipedia or Knol.

When uploading you can/must:(as of 1/2010)

  • choose among the four supported formats
  • Use translation memory (global or local)
  • Use glossary (none, one of yours)

Once a text is uploaded, it stays in Google memory (at least for a while) and users can then do a side by side translation in the supported formats. Too bad that Google does not suport Mediawiki format without passing through Wikipedia.

Glossaries

Let's look at Google's Translation memories, glossaries, and placeholders. Part of this kit includes the possibility to upload glossaries and to save TMX files. A glossary is a CSV table with the following format: A arbitrary number of columns with locales (at least) one followed by an optional part of speech (pos) and description columns.

Google custom glossary format
en-AU fr gsw locale4 .... pos description
lams mouton lämmli noun Either a software system or variant of sheep. Do not use as a verb
gate verrou schperri noun An element in pedagogical sequencing that will filter access to the next element in various ways. (... more needed here ...)

Since CSV uses commas to separate items, Commas in a cell must be escaped by double-quotes ("). For example, to add the term hello, world!, use "hello, world!". Quotes within quotes are escaped by \. For example, She said, "hello." should be entered as "She said, \"hello.\"". Such glossaries can be made with a spreadsheet (e.g. the google docs) and then be exported as CSV and then imported to the translation tool.

Translation memory

“A translation memory (TM) is a database of human translations. As you translate new sentences, we automatically search all available translation memories for previous translations similar to your new sentence. If such sentences exist, we rank and then show them to you. Comparing your translation to previous human translations improves consistency and saves you time: you can reuse previous translations or adjust them to create new, more contextually appropriate translations. When you finish translating documents in Google Translator Toolkit, we save your translations to a translation memory so you or other translators can avoid duplicating work.” (Using translation memories, retrieved 15:51, 27 January 2010 (UTC)).

Mozilla Projects / Firefox

Let's examine a few features of the Mozilla L10N strategy.

In the Mozilla project, localization strings are managed through XUL. The following fragment defines two strings to be displayed as so-called tokens

 <caption label="&<b class="token">identityTitle.label</b>;"/>
 <description>&<b class="token">identityDesc.label</b>;</description>

identityTitle.label and identityDesc.label will be substituted by a strings defined in a DTD as entities.

<!ENTITY <b class="token">identityTitle.label</b> "Identity">
<!ENTITY <b class="token">identityDesc.label</b> "Each account has an identity, which is the ↵
↳ information that other people see when they read your messages.">

("↵ ↳" indicates a single line, broken for readability) If you want to find more example constants defined as XML entities, locate the firefox installation directory on your computer and examine chrome/ab-CD.jar file, e.g. en-US.jar.

The L10N tools include

  • Use of a text editor that can handle UTF-8 files
  • A langpack2cvstree.sh script that converts the en-US language package into another locale.
  • A command line/web tool, called compare locales: finds missing and obsolate strings in a localization
  • Example: short text that describes the tool
  • MozillaBuild: An easy way to install everything you need to checkout/pull and checkin/push your localization and run compare-locales on Windows.
  • Mozilla Translator is a tool to help translate programs
  • Narro, is a web application that allows online translation and coordination. You can see in operation at l10n.mozilla.org
  • Translate Toolkit (moz2po and po2moz): converts various sorts of Mozilla files to Gettext PO format for translation efforts using a PO editor and the other way round. It's used be Pootle for example (see below).
  • MozLCDB similiar to PO but more dedicated to Mozilla products
  • Pootle a web server for localisation that allows web-based contributions and management. Combined with the Translate Toolkit it allows Mozilla products to be localised online.
  • Virtaal is an off-line PO editor developed by the Pootle team.

We wonder a bit who uses which tools. If we understand the situation right, there are several ways to translate as long as the translation does find its way back into the CVS at some point.

Also, it is not suprising, that the project makes a clear distinction between "official releases" and others. Official releases include translation of the installation and migration process, localizing the start page and other web pages built into the product, customizing settings like "live bookmarks", locally relevant search engine plugins, and more.

Now I wonder if translation learning management systems into different languages could imply using a different pedagogical vocabulary. E.g. "french didactics" vs. "belgian instructional design" vs. "canadian constructivism" (without any English word) vs. Swiss "let's have a bit of all".

LAMS

LAMS is a system for authoring and delivering learning activities, i.e. a learning design and learning activity management software. Since I made a little commitment to help with translation, I also shall insert some comments regarding features and UI elements that we could make better - Daniel K. Schneider 14:33, 22 January 2010 (UTC).

This open source project developed an original online solution to support translation. Software translators will use two tools:

  • An internationalization site (see below)
  • A LAMS Translators server (that is upgrade after a translation effort so that a translator can check the look and feel on a live system and make sure that there are no conceptual errors)
Editing translation strings

Strings to translate are called labels. A string may include slots to filled in with data and use the syntax {0}, {1}, etc.

The LAMS Internationalization site presents a list of available software modules and within each module, completion is shown as the following picture shows.

Lams translation server screendump (retrieved 1/2010 from Translating LAMS (Note: One ought be able to list just the modules for one language).

Translators can display a module, each label is displayed as a list of

English String, a french translation, update date, translator
Display of module labels

The translator then can choose between:

  • editing an isolated label
  • bulk translate all labels (date and author information of all tags will be overwritten)
  • translate only missing labels

Below are a few screenshots that illustrate the principle

Editing a single label (Note: We might add the Java property label since its a unique identfier)
Bulk translating labels
Homogenization

For the moment, there is no support for language-specific glossary and terminology management.

Good translators may develop several strategies to make sure that a common terminology is used.

Terminology must be homogeneous, e.g. the English word "Cancel" should normally translate to the same french word, e.g. "annuller" (v.s. "abandonner"). One also could argue that translators sometimes have to use less meaningful words, because at some point MS made certain decisions and people are used to it.

"Non-natural" terminology like "branching" or "gates" raise additional challenges. In french I translated this to "verrou" after looking at the German "Sperre". For various reasons I don't like the obvious translation of "Gate" to "Portail" (meaning "Portal"). Verrou means "lock" but also could mean a barrier in some geological sense. Anyhow, if several translators participate or if one translator translates once in a while, there is a very high chance that very different words are being used to deal with the same object. Also in my case, only once I decided to use LAMS for real, I stumbled upon terminology (created by myself) and that I really found unclear.

Fortunately in LAMS (and this is a mission-critical feature), translators can search for labels in any language (Note: Not true actually only in English and the target language) and then see the result in a similar fashion as in the bulk translate window.

Result of a label search
Result of a label search in English
Result of a label search in Target language (french)

This feature allows to see (1) where certain words of the target language are being use (e.g. check if the same word is used for different things, which may or may not be a proble) and (2) more importantly, how a given english word or phrase has been translated.

Since the result displayed is an editable bulk translation window, re-translation is pretty fast.

Testing

Testing is basically done by playing with a translator's test server. Translators have full rights, e.g. can explore the admin interface, create new teacher and student users and play with these. When the project manager sees that new important translations have been comitted he will update the test server, in particular the weeks before a software upgrade.

Under the hood

($$$ addition needed)

  • Labels are Java properties and my include slots for arguments, e.g.

The {0} cannot be within the range of an existing condition.

Formats

In this section we shall have a look at some more populat formats used in Localization.

Lowlevel formats

There exist various methods to handle language strings in computer programs.

Most often, some sort of constants are being used in the code. These constants are then defined either in so-called languages files or in a database. Let's have a look at some examples:

In a mediawiki (the software used for this wiki). Translations of language strings can either sit in files or in a database or both.

To ease translation work, language files can be generated from more highlevel formats, such as Gettext or XLIFF. Also Gettext and XLIFF can be generated from computer code and merged back into it.

Gettext

Gettext is a translation strings strategy and technology, popular in open source. is based on the idea that keys used to retrieve local language strings corresponds to the original string used in the source code. Documentation also is added as programming comment just before the corresponing line. From the source code so-called .PO files that are then use by translators.

XLIFF

  • XLIFF (XML Localization Interchange File Format) is an XML-based format created to standardize localization. I.e. XLIFF is an alternative to Gettext. XLIFF was standardized by OASIS in 2002. According to the Specification 1.2, XLIFF is the XML Localization Interchange File Format designed by a group of software providers, localization service providers, and localization tools providers. It is intended to give any software provider a single interchange file format that can be understood by any localization provider. It is loosely based on the OpenTag version 1.2 specification and borrows from the TMX 1.2 specification. However, it is different enough from either one to be its own format.
<xliff version='1.2'
       xmlns='urn:oasis:names:tc:xliff:document:1.2'>
<file original='hello.txt' source-language='en' target-language='fr'
       datatype='plaintext'>
  <body>
    <trans-unit id='hi'>
      &lt;source&gt;Hello world &lt;/source&gt;
      <target>Bonjour le monde</target>
      <alt-trans>
       <target xml:lang='es'>Hola mundo</target>
      </alt-trans>
    </trans-unit>
 </body>
</file>
</xliff>

The most important elements are <trans-unit> and <bin-unit>. They contain the translatable portions of the document. The <trans-unit> element contains the text to be translated, the translations, and other related information. The <bin-unit> contains binary data that may or may not need to be translated; it also can contain translated versions of the binary object as well as other related information.

In the trans-unit element, text to be translated is contained in a <source> element, the translated tell will be within the <target> element.

There exist convertors from PO to XLIFF and the other way round.

Translation Memory eXchange

Translation Memory eXchange (TMX) is according to Wikipedia, “an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR[1] (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA[2] (Localization Industry Standards Association). Being in existence since 1998, the format allows easier exchange of translation memory between tools and/or translators with little or no loss of critical data[3]. The current version is 1.4b - it allows for the recreation of the original source and target documents from the TMX data. TMX 2.0 was released for public comment in March, 2007”, retrieved 14:33, 3 February 2010 (UTC).

The OAXAL framework

  1. Consistency in the authored source-language text
  2. Consistency in the translated text
  3. Substantial automation of the Localization process

This references model, according to the Reference Model for Open Architecture for XML Authoring and Localization 1.0, retrieved 18:07, 29 January 2010 (UTC) has two prime use cases:

  1. Support authoring and localization within a CMS environment
  2. Support consistency within a translation-only environment

To achieve these goals and handle the use cases OAXAL defined the following software stack:

The architectural model for OAXAL. Source http://wiki.oasis-open.org/oaxal/FrontPage

Let's shortly describe some of these elements:

  • ITS (Internationalization Tag Set (ITS) Version 1.0) “is a technology to easily create XML which is internationalized and can be localized effectively. On the one hand, the ITS specification identifies concepts (such as "directionality") which are important for internationalization and localization. On the other hand, the ITS specification defines implementations of these concepts (termed "ITS data categories") as a set of elements and attributes called the Internationalization Tag Set (ITS).”. This work is explained in a Best Practices for XML Internationalization working group note.

Basically ITS allows you to insert tags into an XML tree that will tell systems (and human readers) of XML what should not be translated....

Below we reproduce a simple ITS example:

<dbk:article
  xmlns:its="http://www.w3.org/2005/11/its" 
  xmlns:dbk="http://docbook.org/ns/docbook" 
  its:version="1.0" version="5.0" xml:lang="en">
 <dbk:info>
  <dbk:title>An example article</dbk:title>
  <dbk:author
    its:translate="no">
   <dbk:personname>
    <dbk:firstname>John</dbk:firstname>
    <dbk:surname>Doe</dbk:surname>
   </dbk:personname>
   <dbk:affiliation>
    <dbk:address>
     <dbk:email>foo@example.com</dbk:email>
    </dbk:address>
   </dbk:affiliation>
  </dbk:author>
 </dbk:info>
 <dbk:para>This is a short article.</dbk:para>
 </dbk:article>

OAXAL defines (fairly complicated) workflows for document life cycles. From a high-level perspective, the life-cycle for localization-only workflows (e.g. as in sofware localization) looks like this:

OAXAL Localization Only Workflow using xml:tm namespace. Source http://wiki.oasis-open.org/oaxal/FrontPage

The key step is to extract some XLIFF file with translation string that then will be used to translate and merged back.

  • Translation Memory eXchange (TMX) “is an open XML standard for the exchange of translation memory data created by computer-aided translation and localization tools. TMX is developed and maintained by OSCAR (Open Standards for Container/Content Allowing Re-use), a special interest group of LISA (Localization Industry Standards Association” (Wikipedia, retr. jan 2010).

Software

This section should include a short list of useful software

PO editors
Free translation tools
Free online translation support systems
  • Google translate. Note: There is a way to get support for mediawiki files. Upload a filed called *.mediawiki. E.g. edit wikipage and copy/paste source or get it by &action=raw URL Get parameter, etc. Do not whine if some formatting doesn't render well (each mediawiki is different). This is a non-documented feature and we guess that it may become official after some further development - Daniel K. Schneider 18:07, 29 January 2010 (UTC).
Frameworks
  • The Okapi Framework is a set of interface specifications, format definitions, components and applications that provides an environment to build interoperable tools for the different steps of the translation and localization process. (Seems to be a live project).
Links of links

open source localization guidelines

The idea behind writing this entry was to come up with some suggestions/guidelines concerning localization in volunteer-based substantial open source projects.

(under construction - 21:04, 28 January 2010 (UTC) !)

Integrate usability

Profit from the fact that translators probably will use the software and that they will stumble on surface usability and cognitive ergonomics issues. I.e. stuff that is hard to translate may be problematic in the source language to start with.

This is related to the next point (common grounding)

Establish common grounding for essential terms used

Early in the project, start a collaborative glossary/terminology project. The glossary should include all major terms used in the UI or the manuals. Such a project should have the following features for a single meaning term (for polysemic objects this must be expanded):

  • Entry term
  • Equivalents (but that should not be used in UI and the system messsages)
  • Definitions
  • Contexts (e.g. for an LMS, that would be types of pedagogical scenarios)
  • Sources
Provide a translator's toolbox for software strings

A online translation system should be created (or used) and have the following features

  • display equivalents of a term in each language and at least be able to concurrently display source + two target languages.
  • search all terms in target or source language and then display all the terms again in triplets as above
  • display information and (if existing) constant/parameter string used by coders
  • be able to annote each term with comments or maybe an semi-automatic link to a glossary where comments could be made (precision neeeded here)
  • Translated items should be clearly attributed to a single translator, including revisions if possible.
Think about incentives

There should be some kind of incentives for the translators. One also might try to organize monetary incentives provided by localized communities (this is about OSS without resources to pay tranlators). E.g. a German fund might pay money to german translators and a French fund might pay a trip to Australia.

Have users participate

User's should be given the opportunity to participate. Windows, Google and Wiktionary do it, so can we. I.e. there should button that a user can hit for feedback. Feedback should be very simple, e.g. with radio buttons (but include the possibility to add a message.)

  • This is not clear (why)
  • This has a typo (what)
  • There is a better term (which/why)
  • This uses a different language (with respect to what)
  • etc.

The big question is where to position such a button and how it could detect what the user was talking about ...

Plan globally for all product items
  • Software localization in the narrow sense (UI and system messages)
  • In-software help (if existing)
  • Glossary (to be used by all actors, i.e. translators, developers, all category of software users, e.g. in e-learning authors, teachers, students)
  • Manuals of all sorts.

Links

Texts and web sites about software localization

Definitions
Organizations

Home Page]

How-to / tutorials
  • The Pavel Terminology Tutorial. This is substantial tutorial on all aspects regarding terminology made by the Canadian Translation bureau (make sure to explore all the menus, there are dozens of pages ...)
XLIFF
About language file formats

Glossaries and other structured data

Internationalization
  • Dotnet-culture.net provides information about date/time and decimal/currency representation.
Terminology databases
  • Le grand dictionnaire terminologique (english/french)
  • The Open Terminology Forum french, english, spanish, chinese (?). Limited to some subdomains / maintained by volonteers. This website might be of interest to some open source localization projects, either for hosting a project or to get some inspiration.
  • Termium Plus (The Government of Canada's terminology and linguistic data bank).
  • wiktionary.org (The free dictionary, a sister project of wikipedia). Includes translations.

Other

Example Projects and languages
Courses

Software links

Indexes

Bibliography

  • McKethan, Kenneth A. (Sandy)Jr. and Graciela White (2005). Demystifying Software Globalization, Translation Journal 9 (2), April 2005. HTML, retrieved 16:27, 21 January 2010 (UTC).
  • Esselink Bert (2000), A Practical Guide to Localization, , John Benjamins Publishing, ISBN 1-58811-006-0
  • Karunakar, G. and Shirvastava, R. (2007) Evolving Translations & Terminology - The Open Way, LRIL-2007 - National Seminar on Creation of Lexical Resources for Indian Language Computing and Processing at C-DAC Mumbai. PDF
  • Muegge, Uwe (2007). "Disciplining words: What you always wanted to know about terminology management". tcworld (tekom) (3): 17–19. PDF