< draft-davis-t-langtag-ext-03.txt   draft-davis-t-langtag-ext-04.txt >
Internet Engineering Task Force M. Davis Internet Engineering Task Force M. Davis
Internet-Draft Google Internet-Draft Google
Intended status: Informational A. Phillips Intended status: Informational A. Phillips
Expires: January 12, 2012 Lab126 Expires: February 8, 2012 Lab126
Y. Umaoka Y. Umaoka
IBM IBM
C. Falk C. Falk
Infinite Automata Infinite Automata
July 11, 2011 August 7, 2011
BCP 47 Extension T - Transformed Content BCP 47 Extension T - Transformed Content
draft-davis-t-langtag-ext-03 draft-davis-t-langtag-ext-04
Abstract Abstract
This document specifies an Extension to BCP 47 which provides subtags This document specifies an Extension to BCP 47 which provides subtags
for specifying the source language or script of transformed content, for specifying the source language or script of transformed content,
including content that has been transliterated, transcribed, or including content that has been transliterated, transcribed, or
translated, or in some other way influenced by the source. It also translated, or in some other way influenced by the source. It also
provides for additional information used for identification. provides for additional information used for identification.
Status of this Memo Status of this Memo
skipping to change at page 1, line 39 skipping to change at page 1, line 39
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 12, 2012. This Internet-Draft will expire on February 8, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 15 skipping to change at page 2, line 15
to this document. to this document.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4
2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 4 2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 4
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Canonicalization . . . . . . . . . . . . . . . . . . . . . 7 2.3. Canonicalization . . . . . . . . . . . . . . . . . . . . . 7
2.4. BCP47 Registration Form . . . . . . . . . . . . . . . . . 7 2.4. BCP47 Registration Form . . . . . . . . . . . . . . . . . 8
2.5. Field Definitions . . . . . . . . . . . . . . . . . . . . 7 2.5. Field Definitions . . . . . . . . . . . . . . . . . . . . 8
2.6. Registration of Field Subtags . . . . . . . . . . . . . . 9 2.6. Registration of Field Subtags . . . . . . . . . . . . . . 10
2.7. Machine-Readable Data . . . . . . . . . . . . . . . . . . 10 2.7. Registration of Additional Fields . . . . . . . . . . . . 10
3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 2.8. Committee Responses to Registration Proposals . . . . . . 10
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 2.9. Machine-Readable Data . . . . . . . . . . . . . . . . . . 11
5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14
6.1. Normative References . . . . . . . . . . . . . . . . . . . 12 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14
6.2. Informative References . . . . . . . . . . . . . . . . . . 12 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 6.1. Normative References . . . . . . . . . . . . . . . . . . . 14
6.2. Informative References . . . . . . . . . . . . . . . . . . 14
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15
1. Introduction 1. Introduction
[BCP47] permits the definition and registration of language tag [BCP47] permits the definition and registration of language tag
extensions "that contain a language component and are compatible with extensions "that contain a language component and are compatible with
applications that understand language tags". This document defines applications that understand language tags". This document defines
an extension for specifying the source of content that has been an extension for specifying the source of content that has been
transformed, including text that has been transliterated, transformed, including text that has been transliterated,
transcribed, or translated, or in some other way influenced by the transcribed, or translated, or in some other way influenced by the
source. It may be used in queries to request content that has been source. It may be used in queries to request content that has been
transformed. The "singleton" identifier for this extension is 't'. transformed. The "singleton" identifier for this extension is 't'.
Language tags, as defined by [BCP47], are useful for identifying the Language tags, as defined by [BCP47], are useful for identifying the
language of content. There are mechanisms for specifying variant language of content. There are mechanisms for specifying variant
subtags for special purposes. However, these variants are subtags for special purposes. However, these variants are
insufficient for specifying content that has undergone insufficient for specifying content that has undergone
transformations, including content that has been transliterated, transformations, including content that has been transliterated,
transcribed, or translated. That is, for fully specifying such transcribed, or translated. The correct interpretation of the
content, it is important to specify the source language and/or content may depend upon knowledge of the conventions used for the
script. In addition, it may also be important to identify a transformation.
particular specification for the transformation.
For example, suppose that Italian or Russian cities on a map are Suppose that Italian or Russian cities on a map are transcribed for
transcribed for Japanese users. Each name needs to be transliterated Japanese users. Each name needs to be transliterated into katakana
into katakana using rules appropriate for the specific source and using rules appropriate for the specific source and target language.
target language. When tagging such data, it is important to be able When tagging such data, it is important to be able to indicate not
to indicate not only the resulting content language ("ja" in this only the resulting content language ("ja" in this case), but also the
case), but also the source language. source language.
Transforms such as transliteration may vary depending not only on the Transforms such as transliterations may vary depending not only on
basis of the source and target script, but also on language. Thus the basis of the source and target script, but also on the source and
the Russian <U+041F U+0443 U+0442 U+0438 U+043D> (which corresponds target language. Thus the Russian <U+041F U+0443 U+0442 U+0438
to the Cyrillic <PE, U, TE, I, EN>) transliterates into "Putin" in U+043D> (which corresponds to the Cyrillic <PE, U, TE, I, EN>)
English but "Poutine" in French. The identifier could be used to transliterates into "Putin" in English but "Poutine" in French. The
indicate a desired mechanical transformation in an API, or could be identifier could be used to indicate a desired mechanical
used to tag data that has been converted (mechanically or by hand) transformation in an API, or could be used to tag data that has been
according to a transliteration method. converted (mechanically or by hand) according to a transliteration
method.
In addition, many different conventions have arisen for how to
transform text, even between the same languages and scripts. For
example, "Gaddafi" is commonly transliterated from Arabic to English
as any of (G/Q/K/Kh)a(d/dh/dd/dhdh/th/zz)af(i/y). Some examples of
standardized conventions used for transcribing or transliterating
text include:
a. United Nations Group of Experts on Geographical Names (UNGEGN)
b. US Library of Congress (LOC)
c. US Board on Geographic Names (BGN)
d. Korean Ministry of Culture, Sports and Tourism (MCST)
e. International Organization for Standardization (ISO)
The usage of this extension is not limited to formal transformations, The usage of this extension is not limited to formal transformations,
and may include other instances where the content is in some other and may include other instances where the content is in some other
way influenced by the source. For example, this extension could be way influenced by the source. For example, this extension could be
used to designate a request for a speech recognizer that is tailored used to designate a request for a speech recognizer that is tailored
specifically for 2nd-language speakers who are 1st-language speakers specifically for 2nd-language speakers who are 1st-language speakers
of a particular language (e.g. a recognizer for "English spoken with of a particular language (e.g. a recognizer for "English spoken with
a Chinese accent"). a Chinese accent").
1.1. Requirements Language 1.1. Requirements Language
skipping to change at page 4, line 47 skipping to change at page 5, line 15
| | transformed from the Cyrillic script. | | | transformed from the Cyrillic script. |
+---------------------+---------------------------------------------+ +---------------------+---------------------------------------------+
Note that the sequence of subtags governed by 't' cannot contain a Note that the sequence of subtags governed by 't' cannot contain a
singleton (a single-character subtag), because that would start a new singleton (a single-character subtag), because that would start a new
extension. For example, the tag "ja-t-i-ami" does not indicate that extension. For example, the tag "ja-t-i-ami" does not indicate that
the source is in "i-ami", because "i-ami" is not a regular language the source is in "i-ami", because "i-ami" is not a regular language
tag in [BCP47]. That tag would express an empty 't' extension tag in [BCP47]. That tag would express an empty 't' extension
followed by an 'i' extension. followed by an 'i' extension.
It is sometimes necessary to indicate additional information about The t extension is not intended for use in structured data that
the transformation. This additional information is optionally already provides separate source and target language identifiers.
supplied after the source in a series of one or more fields, where For example, this is the case in localization interchange formats
each field consists of a field separator subtag followed by one or such as XLIFF. In such cases, it would be inappropriate to use "ja-
more non-separator subtags. Each field separator subtag consists of t-it" for the target language tag because the source language tag
a single letter followed by a single digit. "it" would already be present in the data. Instead one would use the
language tag "ja".
As noted earlier, it is sometimes necessary to indicate additional
information about a transformation. This additional information is
optionally supplied after the source in a series of one or more
fields, where each field consists of a field separator subtag
followed by one or more non-separator subtags. Each field separator
subtag consists of a single letter followed by a single digit.
A transformation mechanism is an optional field that indicates the A transformation mechanism is an optional field that indicates the
specification used for the transformation, such as "UNGEGN" for the specification used for the transformation, such as "UNGEGN" for the
the United Nations Group of Experts on Geographical Names the United Nations Group of Experts on Geographical Names
transliterations and transcriptions. It uses the 'm0' field transliterations and transcriptions. It uses the 'm0' field
separator followed by certain subtags. separator followed by certain subtags.
For example: For example:
+------------------------------------+------------------------------+ +------------------------------------+------------------------------+
skipping to change at page 8, line 21 skipping to change at page 9, line 4
One field is initially specified in [UTS35]: the transform mechanism. One field is initially specified in [UTS35]: the transform mechanism.
That field is summarized here: That field is summarized here:
a. The transform mechanism consists of a sequence of subtags a. The transform mechanism consists of a sequence of subtags
starting with the 'm0' separator followed by one or more starting with the 'm0' separator followed by one or more
mechanism subtags. Each mechanism subtag has a length of 3 to 8 mechanism subtags. Each mechanism subtag has a length of 3 to 8
alphanumeric characters. The sequence as a whole provides an alphanumeric characters. The sequence as a whole provides an
identification of the specification for the transform, such as identification of the specification for the transform, such as
the mechanism subtag 'UNGEGN' in "und-Cyrl-t-und-latn-m0-ungegn". the mechanism subtag 'UNGEGN' in "und-Cyrl-t-und-latn-m0-ungegn".
In many cases, only one mechanism subtag is necessary, but In many cases, only one mechanism subtag is necessary, but
multiple subtags MAY be defined in [UTS35] where necessary. multiple subtags MAY be defined in [UTS35] where necessary.
b. Any purely numeric subtag is a representation of a date in the b. Any purely numeric subtag is a representation of a date in the
Gregorian calendar. It MAY occur in any mechanism field. If it Gregorian calendar. It MAY occur in any mechanism field, but it
does occur: SHOULD only be used where necessary. If it does occur:
* it MUST occur as the final subtag in the field, * it MUST occur as the final subtag in the field
* it MUST NOT be the only subtag in the field, and * it MUST NOT be the only subtag in the field
* it MUST consist of a sequence of digits of the form YYYY, * it MUST consist of a sequence of digits of the form YYYY,
YYYYMM, or YYYYMMDD. YYYYMM, or YYYYMMDD
For example, 20110623 represents June 23th, 2011. A date subtag * it SHOULD be as short as possible
SHOULD only be used where necessary, and then SHOULD be as short
as possible. For example, suppose that the BGN transliteration Examples:
specification for Cyrillic to Latin had three versions, dated
June 11th, 1999; Dec 30th, 1999; and May 1st, 2011. In that * 20110623 represents June 23rd, 2011.
case, the corresponding first two DATE subtags would require
months to be distinctive (199906 and 199912), but the last subtag * There are 3 dated versions of the UNGEGN transliteration
would only require the year (2011). specification for Hebrew to Latin. They can be represented by
the following language tags:
+ und-Hebr-t-und-Latn-m0-ungegn-1972
+ und-Hebr-t-und-Latn-m0-ungegn-1977
+ und-Hebr-t-und-Latn-m0-ungegn-2007
* Suppose that the BGN transliteration specification for
Cyrillic to Latin had three versions, dated June 11th, 1999;
Dec 30th, 1999; and May 1st, 2011. In that case, the
corresponding first two DATE subtags would require months to
be distinctive (199906 and 199912), but the last subtag would
only require the year (2011).
c. Some mechanisms may use a versioning system that is not c. Some mechanisms may use a versioning system that is not
distinguished by date, or not by date alone. In the latter case, distinguished by date, or not by date alone. In the latter case,
the version will be of a form specified by [UTS35] for that the version will be of a form specified by [UTS35] for that
mechanism. For example, if the mechanism XXX uses versions of mechanism. For example, if the mechanism XXX uses versions of
the form v21a, then a tag could look like "ja-t-it-m0-xxx-v21a". the form v21a, then a tag could look like "ja-t-it-m0-xxx-v21a".
If there are multiple subversions distinguished by date, then a If there are multiple subversions distinguished by date, then a
tag could look like "ja-t-it-m0-xxx-v21a-2007". tag could look like "ja-t-it-m0-xxx-v21a-2007".
A language tag with the t extension MAY be used to request a specific A language tag with the t extension MAY be used to request a specific
skipping to change at page 9, line 39 skipping to change at page 10, line 37
| | versioning conventions used for the mechanism. | | | versioning conventions used for the mechanism. |
+-------------+-----------------------------------------------------+ +-------------+-----------------------------------------------------+
Proposals for clarifications of descriptions or additional aliases Proposals for clarifications of descriptions or additional aliases
may also be requested by filing a ticket. may also be requested by filing a ticket.
The committee MAY define a template for submissions that requests The committee MAY define a template for submissions that requests
more information, if it is found that such information would be more information, if it is found that such information would be
useful in evaluating proposals. useful in evaluating proposals.
2.7. Registration of Additional Fields
In the event that it proves necessary to add an additional field
(such as 'm2'), it can be requested by filing a ticket at
cldr.unicode.org [2]. The proposal in the ticket MUST contain a full
description of the proposed field semantics and subtag syntax, and
MUST be conform to the ABNF syntax for "field" presented in
Section 2.2.
2.8. Committee Responses to Registration Proposals
The committee MUST post each proposal publicly within 2 weeks after The committee MUST post each proposal publicly within 2 weeks after
reception, to allow for comments. The committee must respond reception, to allow for comments. The committee must respond
publicly to each proposal within 4 weeks after reception. publicly to each proposal within 4 weeks after reception.
The response MAY: The response MAY:
o request more information or clarification o request more information or clarification
o accept the proposal, optionally with modifications to the subtag o accept the proposal, optionally with modifications to the subtag
or description or description
o reject the proposal, because of significant objections raised on o reject the proposal, because of significant objections raised on
the mailing list or due to problems with constraints in this the mailing list or due to problems with constraints in this
document or in [UTS35] document or in [UTS35]
Accepted tickets result an a new entry in the machine-readable CLDR Accepted tickets result in a new entry in the machine-readable CLDR
BCP47 data, or in the case of a clarified description, modifications BCP47 data, or in the case of a clarified description, modifications
to the description attribute value for an existing entry. to the description attribute value for an existing entry.
2.7. Machine-Readable Data 2.9. Machine-Readable Data
EDITORIAL NOTE: The following parallels the structure used for the EDITORIAL NOTE: The following parallels the structure used for the
'u' extension [RFC6067], for which the Unicode Consortium is the 'u' extension [RFC6067], for which the Unicode Consortium is the
maintaining authority. The data and specification will be available maintaining authority. The data and specification will be available
by the time this internet draft has been approved. The description by the time this internet draft has been approved. The description
field is in the process of being added to CLDR. field is in the process of being added to CLDR.
Beginning with CLDR version 1.7.2, machine-readable files are Beginning with CLDR version 1.7.2, machine-readable files are
available listing the data defined for BCP47 extensions for each available listing the data defined for BCP47 extensions for each
successive version of [UTS35]. These releases are listed on successive version of [UTS35]. These releases are listed on
skipping to change at page 11, line 15 skipping to change at page 12, line 22
| | with all and only that | Group of Experts on | | | with all and only that | Group of Experts on |
| | information necessary to | Geographical Names; | | | information necessary to | Geographical Names; |
| | distinguish one name from | American Library | | | distinguish one name from | American Library |
| | others with which it might be | Association-Library | | | others with which it might be | Association-Library |
| | confused. Descriptions are | of Congress | | | confused. Descriptions are | of Congress |
| | not intended to provide | | | | not intended to provide | |
| | general background | | | | general background | |
| | information. | | | | information. | |
| since | Indicates the first version | 1.9, 2.0.1 | | since | Indicates the first version | 1.9, 2.0.1 |
| | of CLDR where the name | | | | of CLDR where the name | |
| | appears. | | | | appears. (Required for new | |
| | items.) | |
| alias | Alternative name of the key | | | alias | Alternative name of the key | |
| | or type, not limited in | | | | or type, not limited in | |
| | number of characters. | | | | number of characters. | |
| | Aliases are intended for | | | | Aliases are intended for | |
| | backwards compatibility, not | | | | backwards compatibility, not | |
| | to provide all possible | | | | to provide all possible | |
| | alternate names or | | | | alternate names or | |
| | designations | | | | designations. (Optional) | |
+-------------+-------------------------------+---------------------+ +-------------+-------------------------------+---------------------+
The file for the transform extension is "transform.xml". The initial
version of that file contains the following information.
<key extension="t" name="m0" description=
"Transliteration extension mechanism"/>
<type name="ungegn" description=
"United Nations Group of Experts on Geographical Names"/>
<type name="alaloc" description=
"American Library Association-Library of Congress"/>
<type name="bgn" description=
"US Board on Geographic Names"/>
<type name="mcst" description=
"Korean Ministry of Culture, Sports and Tourism"/>
<type name="iso" description=
"International Organization for Standardization"/>
<type name="din" description=
"Deutsches Institut fuer Normung"/>
<type name="gost" description=
"Euro-Asian Council for Standardization, Metrology
and Certification"/>
</key>
To get the version information in XML when working with the data To get the version information in XML when working with the data
files, the XML parser must be validating. When the 'core.zip' file files, the XML parser must be validating. When the 'core.zip' file
is unzipped, the 'dtd' directory will be at the same level as the is unzipped, the 'dtd' directory will be at the same level as the
'bcp47' directory; that is required for correct validation. For each 'bcp47' directory; that is required for correct validation. For each
release after CLDR 1.8, types introduced in that release are also release after CLDR 1.8, types introduced in that release are also
marked in the data files by the XML attribute "since", such as in the marked in the data files by the XML attribute "since", such as in the
following example: following example:
<type name="adp" since="1.9"/> <type name="adp" since="1.9"/>
The data is also currently maintained in a source code repository, The data is also currently maintained in a source code repository,
 End of changes. 21 change blocks. 
56 lines changed or deleted 131 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/