A TEI Project

Part 3: TEI and localization

Table of contents

1. TEI internationalisation project

The aim is to lower the barrier for entry of non-English-speaking users by:

2. Reminder: definitions

Internationalization (I18N)
Internationalization is the process of generalizing a product so that it can handle multiple languages and cultural conventions without the need for redesign. Internationalization takes place at the level of program design and document development.
Localization (L10N)
Localization is the process of taking a product and making it linguistically and culturally appropriate to a given target locale (country/region and language) where it will be used.

3. Examples of translation

4. Localisation of examples

What does this
 <l>Sire Thopas was a doghty swayn;</l>
 <l>White was his face as payndemayn,</l>
 <l>His lippes rede as rose;</l>
 <l>His rode is lyk scarlet in grayn,</l>
 <l>And I yow telle in good certayn,</l>
 <l>He hadde a semely nose.</l>
mean to a Chinese scholar?

5. More examples

Element names are often easy to understand; but what if we use English in attribute values?
Next morning a boy in that dormitory confided to his
bosom friend, a<distinct type="psSlang">fag</distinct> of
Macrea's, that there was trouble in their midst which King<distinct type="archaic">would
fain</distinct>keep secret.
Here there is the English word ‘psSlang’ (expandable to ‘public school slang’) for the type attribute of <distinct> to consider, where the value of ‘fag’ gives little help.

6. More examples

The names of the elements may stand in the way of easy comprehension:
<persName key="EGBR1">
 <roleName type="office">Governor</roleName>
 <forename sort="2">Edmund</forename>
 <forename full="initsort="3">G.</forename>
 <addName type="nick">Jerry</addName>
 <addName type="epithet">Moonbeam</addName>
 <surname sort="1">Brown</surname>
 <genName full="abb">Jr</genName>.

This can only really be take advantage of by someone who
  1. appreciates the cultural context of ‘forename’ and ‘surname’
  2. can mentally expand ‘nick’ to ‘nickname’ (and knows what a nickname is)
  3. can appreciate whether a ‘Governor Edmund G. Jerry Moonbeam Brown Jr.’ is a politician, a kind of food, or a new dance

7. Translation choices

The user of the Guidelines may prefer to:
  1. read ‘contiene un único documento TEI, compuesto de una cabecera TEI (TEI header) y un cuerpo de texto (text), aislado o como parte de un elemento corpusTei (teiCorpus)’ instead of ‘contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.’ in the documentation
  2. use element names of <líneaDirección>, <ligneAdresse>, <linDireccio> or <AdressZeile> instead of <addrLine>
  3. see culturally-adapted examples, such as
     <byline rend="ur">唐・元稹</byline>

8. TEI I18N and L10N process 1: Unicode

9. Defining a non-Unicode character

A new character, assigned to a position in the Unicode Private Use Area (PUA), and with a standardized form as a fallback:
 <glyph xml:id="z103">
  <glyphName>LATIN LETTER Z WITH TWO STROKES</glyphName>
  <mapping type="standardized">Z</mapping>
  <mapping type="PUA">U+E304</mapping>
This can now be referred to using the <gi> element, as in
<g ref="#z103"/>

10. Defining a non-Unicode character (2)

It is also possible to override what appears in the text by using markup like this
<g ref="#z103">z</g>
where the content of the <g> element can be used immediately without any lookup.

11. TEI I18N and L10N process 2: TEI literate programming

The TEI is written in a high-level markup language for specifying XML schemas and their documentation. This is an XML vocabulary known as ODD (One Document Does it all):
  1. The element and attribute sets making up the schema are formally specified using a special XML vocabulary
  2. The specification language also includes support for macros (like DTD entities, or schema patterns), a hierarchical class system for attributes and elements, and the creation of pre-defined groups of elements known as modules.
  3. Content models for elements and attributes are written using an embedded RELAXNG XML notation, but tools are available to generate schemas in any of RELAXNG, DTD language, or W3C schema.
  4. Documentation describing the supported elements, attributes, value lists etc is managed along with their specification, together with use cases, examples, and other supporting material.
ODD is a standard TEI module.

12. Use of ODD

The TEI's 22 modules (containing 500 elements) can be combined together and customized as desired using the ODD language. Customization may include: The last two may make documents which break compatibility.

13. ODD for translation

The ODD language has allowance for translating element name, attribute names, and descriptions, and for preserving information to allow canonicalisation.

The technical documentation elements (<gloss> and <desc>) for TEI elements and attributes etc can be specified multiple times, in different languages, distinguished by the standard xml:lang attribute.

There is also a container (<equiv>) to specify the relationship of an element, attribute or value to standardised schemes or ontologies.

14. ODD example

<elementSpec module="headerident="taxonomy">
 <desc>defines a typology used to classify texts either
   implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.</desc>
    <rng:ref name="category"/>
     <rng:ref name="model.biblLike"/>
     <rng:ref name="category"/>
  <egXML><taxonomy xml:id="tax.b">
    <bibl>Brown Corpus</bibl>
    <category xml:id="tax.b.a">
     <catDesc>Press Reportage</catDesc>
     <category xml:id="tax.b.a1">
     <category xml:id="tax.b.a2">
     <category xml:id="tax.b.a3">
     <category xml:id="tax.b.a4">
     <category xml:id="tax.b.a5">
     <category xml:id="tax.b.a6">
    <category xml:id="tax.b.d">
     <category xml:id="tax.b.d1">
     <category xml:id="tax.b.d2">
      <catDesc>Periodicals and tracts</catDesc>

15. Translating element names

The objects identified by the ident attribute in the TEI can be given an alternate name by use of the <altIdent> element; so the example above could be rewritten as
<elementSpec xmlns="http://www.tei-c.org/ns/1.0" module="headerident="taxonomy">
<altIdent xmlns="http://www.tei-c.org/ns/1.0" xml:lang="fr">taxinomie</altIdent>
providing a French name for the element.

16. How does that work in the schema?

The normal schema, using RELAXNG compact syntax, has the definition
taxonomy = ## (taxonomy) defines a typology used to classify texts either ## implicitly, by means of a bibliographic citation, ## or explicitly by a structured taxonomy. element taxonomy { taxonomy.content, taxonomy.attributes } taxonomy.content = category+ | (model.biblLike, category*) taxonomy.attributes = att.global.attributes, empty
in which the element <taxonomy> is defined by the containing pattern taxonomy; it is the pattern name which other elements use, not the element name.

17. Translated schema

If the schema were translated into Greek, it would look like this:
taxonomy = element ταξινομια { taxonomy.content, taxonomy.attributes } ...
where the ‘pattern name’ remains the same. This type of schema markup is generated by the TEI tools, picking up the information from <altIdent>.

18. Translating descriptions

We can expand the TEI source to add Chinese translations alongside the English originals, and the appropriate text can be passed to the generated schemas or documentation:
<elementSpec xmlns="http://www.tei-c.org/ns/1.0" module="headerident="taxonomy">
<desc xmlns="http://www.tei-c.org/ns/1.0">defines a typology used to classify texts either
implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.</desc>
<desc xmlns="http://www.tei-c.org/ns/1.0" version="2007-05-02xml:lang="zh-tw">定義文件分類的類型學,可以是潛在地以書目資料的方式,或是明確地以結構分類法的方式來分類。</desc>

19. Using a translated schema in practice

If we take a Spanish play, and translate the element names to Spanish, a text like this will be much more familiar-looking to encoders in Spanish-speaking countries:
 <div1 tipo="part">
  <div2 tipo="act">
   <encabezado tipo="main">Jornada primera</encabezado>
   <div3 tipo="scene">
    <encabezado tipo="main">Cuadro único</encabezado>
    <acotacion formato="centered">
     <resaltado formato="bold">(Salen </resaltado>REBOLLEDO,
    <resaltado formato="bold">la</resaltado>CHISPA<resaltado formato="bold">
           soldados</resaltado>.<resaltado formato="bold">)</resaltado>
it is straightforward to write a transformation (eg in XSL) which reads the TEI source with the element names and <altIdent> information, and puts the text back to canonical form.

20. Internationalised interfaces for TEI applications

An application which turns TEI XML into HTML for web display, and provides a heading such as ‘Contents’ when it meets <divGen type="toc"/>, will have to provide appropriate translations. Eg:
ISO Language codeText
hiMula Shabda
plSpis treści
ptÍndice geral

21. TEI schema-making tools

The ODD language files are processed to produce schemas in the chosen language, using the Roma tools (web service and command-line script). This has support for varying the languages of its interface, but must also allow for supporting the following output schemes:

22. The application of the W3C ITS guidelines to TEI work

The W3C Internationalisation Tag Set encodes information for translators and localisors.

The ITS consists of a set of elements and attributes for annotating a text with information for further processing, covering Internationalization: and Localization

23. ITS information location

The primary ITS notion is that information about elements and attributes can be supplied where the information consists of a set of data categories.

24. ITS data categories

On an instance element the following attributes may be attached or derived:
should this object be translated?
Is there some localisation hint?
What type of hint is it?
Does this object describe a technical term?
Where is the term defined?
What is the text direction?
Is there some Ruby annotation?

25. TEI text with ITS rules and markup attached


   <its:translateRule translate="noselector="//t:body/t:p"/></its:rules>
   <p>Hello <hi>world</hi>
translate me</p>
where the ITS rules say that <p> elements should not normally be translated, but the second <p> has an explicit override.

26. ITS for ODD

We can express the relationship between the structural elements and the documentation elements in ODD with the following ITS rules, which says that the default is to not translate anything, but gives a set of elements which are to be translated:
 <its:ns prefix="teiuri="http://www.tei-c.org/ns/1.0"/>
 <its:translateRule translate="noselector="//tei:*"/>
 <its:translateRule translate="yesselector="//tei:desc"/>
 <its:translateRule translate="yesselector="//tei:gloss"/>
 <its:translateRule translate="yesselector="//tei:valDesc"/>
 <its:translateRule translate="yesselector="//tei:p[@rend='dataDesc']"/>
 <its:translateRule translate="yesselector="//tei:remarks"/></its:rules>

27. Example of ITS implementation

We can show graphically, using an ITS tool, which elements need a translated equivalent (those in green)

28. Scale of TEI work

The scale of work involved is not impossible to contemplate. The TEI contains The worked needed for each language is to At a system level, we need to

29. TEI plans

The TEI Consortium is working with TEI scholars to advance I18N and L10N in various languages.

French, Spanish, Italian, and Japanese are largely complete for translated <desc> and <gloss> texts; Chinese and German are in progress.

We hope to process Portuguese, Greek, Czech and Korean automatically using SYSTRAN.

Translation of all interface strings in XSL stylesheets to Bulgarian, Chinese, Dutch, French, German, Greek, Hindi, Italian, Japanese, Polish, Portuguese, Romanian, Russian, Serbian, Slovenian, Spanish, Swedish, Thai, and Turkish is complete.

30. TEI chapter source in English and French

31. Example of translated ODD


 <gloss>TEI document</gloss>
 <gloss version="2008-01-30xml:lang="ja">TEI文書</gloss>
 <gloss version="2007-06-12xml:lang="fr">document TEI</gloss>
 <gloss version="2006-10-18xml:lang="de">TEI-Dokument</gloss>
 <gloss version="2007-05-04xml:lang="es">documento TEI</gloss>
 <gloss version="2007-05-02xml:lang="zh-tw">TEI文件</gloss>
 <gloss version="2007-01-21xml:lang="it">documento TEI</gloss>
 <desc>contains a single TEI-conformant document,
   comprising a TEI header and a text, either in isolation or as part of a<gi>teiCorpus</gi>element.</desc>
 <desc version="2008-01-30xml:lang="ja"> TEI準拠の文書を示す.</desc>
 <desc version="2007-06-12xml:lang="fr">contient un seul document, conforme à la TEI, qui
   comprend un en-tête TEI et un texte, soit de façon isolée soit comme une partie d’un élément<gi>teiCorpus</gi>
 <desc version="2006-10-18xml:lang="de">
   enthält ein einzelnes TEI-konformes Dokument, das aus TEI-Header (Dateikopf) und Text besteht, entweder als eigenständige Datei oder als Teil eines Elements<gi>teiCorpus</gi>.
 <desc version="2007-05-04xml:lang="es">contiene un único documento TEI-conforme, que comprende un encabezado y un texto, sea este aislado o parte de un elemento <gi>teiCorpus</gi>
 <desc version="2007-05-02xml:lang="zh-tw">符合TEI標準的單一文件,包括ㄧ個TEI標頭以及ㄧ個文本,可單獨出現或是處於元素<gi>teiCorpus</gi>(tei文集)之中。</desc>
 <desc version="2007-01-21xml:lang="it">contiene un documento TEI-conforme, comprendente un'intestazione e un testo, sia esso isolato o parte di un elemento <gi>teiCorpus</gi>
   <rng:ref name="teiHeader"/>
      <rng:ref name="model.resourceLike"/>
      <rng:ref name="text"/>
    <rng:ref name="text"/>
  <sch:ns prefix="teiuri="http://www.tei-c.org/ns/1.0"/>
  <sch:ns prefix="rnguri="http://relaxng.org/ns/structure/1.0"/>
  <attDef ident="versionusage="opt">
   <desc>The version of the TEI scheme</desc>
   <desc version="2008-01-30xml:lang="ja">TEIスキームの版を示す.</desc>
   <desc version="2007-06-12xml:lang="fr">la version du schéma TEI</desc>
   <desc version="2006-10-18xml:lang="de"> Version des TEI-Schemas</desc>
   <desc version="2007-05-04xml:lang="es">Versión del esquema TEI</desc>
   <desc version="2007-05-02xml:lang="zh-tw">TEI架構的版本</desc>
   <desc version="2007-01-21xml:lang="it">versione dello schema TEI</desc>
    <rng:data type="decimal"/>
   <valDesc>A number identifying the version of the TEI guidelines</valDesc>
       <title>The shortest TEI Document Imaginable</title>
       <p>First published as part of TEI P2.</p>
       <p>No source: this is an original work.</p>
      <p>This is about the shortest TEI document imaginable.</p>
  <p>This element is required.</p>
  <ptr target="#DS"/>
  <ptr target="#CCDEF"/>

32. Using oXygen editor with Chinese annotation

33. Example of reference documentation

34. Example of reference documentation in Japanese

35. Interface translation in Japanese

36. Reference documentation in Japanese, with German annotation

Date: 2008-03-05
This page is copyrighted