By Gershon Joseph and Rodolfo Raya
27 August, 2007.
The original white paper published by the OASIS DITA Translation Subcommittee is available as PDF here.
OASIS (Organization for the Advancement of Structured Information Standards) is a not-for-profit, international consortium that drives the development, convergence, and adoption of e-business standards. Members themselves set the OASIS technical agenda, using a lightweight, open process expressly designed to promote industry consensus and unite disparate efforts. The consortium produces open standards for Web services, security, e-business, and standardization efforts in the public sector and for application-specific markets. OASIS was founded in 1993. More information can be found on the OASIS website at http://www.oasis-open.org.
The purpose of the OASIS DITA Technical Committee (TC) is to define and maintain the Darwin Information Typing Architecture (DITA) and to promote the use of the architecture for creating standard information types and domain-specific markup vocabularies. The Translation Subcommittee defines best practices and guidelines for DITA authoring, translation and localization, and recommends solutions for industry requirements for consideration by the OASIS DITA TC. The group recommends widespread adoption of these concepts through liaisons with industry, other standards, and providers of commercial and open source tools.
Many organizations have previously translated content that was authored in non-XML tools, such as desktop publishing applications. When migrating their legacy content into the new DITA authoring environment, what does an organization do about their legacy translation memory? This legacy translation memory (TM) has been created with large financial investments that can't easily be discarded simply because a new authoring architecture is being adopted.
This paper describes best practices that will help organizations use their legacy TM for future translation projects that are authored in DITA XML. These practices will allow them to minimize the expense of ongoing translations of XML-based content.
In general, there is no need to translate the existing content after migration to DITA before adding new content to the documents. Without following these best practices, the conversion to DITA will be more expensive for each language.
This practice assumes that content reuse in DITA will be based on the use of the conref mechanism rather than on the use of user-defined entities. The discussion about conrefs applies as well to user-defined entities.
This section describes the recommended process at a high level that is independent of tools used and the features they support. This best practice recommends segmenting the TM at the sentence level to achieve better matching to support the migration of content to DITA and prior to translating the DITA content.
It should be noted that, in general, sentence-level segmentation provides better matching. However, working with segmentation at the block or paragraph level improves the quality of the translation. For example, you may need three sentences in Spanish to translate two English sentences. The resulting Spanish translation will read better if the paragraph is translated as a block instead of as isolated sentences. Therefore, you may want to set the TM back to block segmentation following the transform to DITA.
The process includes adjusting the tagging and segmentation rules of your TM so that it is better aligned with the DITA content. This process of creating a better aligned TM should result in an improvement of 10-20% on TM matching. Whether it's worth the effort and expense in doing this process depends on the size of the DITA documents to be translated and the number of target languages.
Please take time to acquaint yourself with the relevant localization industry standards. Open standards allow you to have more choice and flexibility when establishing the best way to handle the transition to DITA. A full list is provided at the end of this document. At the forefront of these standards are XLIFF for exchanging localization text with language service providers and xml:tm, which takes the reuse principle down to the sentence level and integrates perfectly with DITA.
If you are beginning with non-XML content, most likely in a desktop publishing application, do the following:
The contents of a DITA non-inline element, for example <p>, <section> and <table>.
Computer Aided Translation, which helps the translator translate the source content. CAT tools usually leverage Translation Memory to match sentences and inline phrases that were previously translated. In addition, some CAT tools use Machine Translation to translate glossary and other company-specific terms (extracted from a terminology database).
The level of accuracy with which CAT tools can match content being translated to the TM. The levels of matching are defined as follows:
Machine Translation is a technology that translates content directly from source without human intervention. Used in isolation, MT usually generates an unusable translation. However, when integrated into a CAT tool to translate specific terminology, MT is a useful technology.
Translation Memory is a technology that reuses translations previously stored in the database used by the translation tool. TM preserves the translation output for reuse with subsequent translations.
Translation Memory eXchange is an industry standard format for exchanging TM between CAT tools.
XML Localisation Interchange File Format is a document format used for the interchange of translatable text between CAT tools.
Segmentation Rules eXchange is an industry standard for establishing and exchanging sentence-level breaks.
XML Based Text Memory is an industry standard that takes the DITA principle of text reuse (both author and translation memory) down to the sentence level. It also provides a standard mechanism for establishing in-context exact matching.