--- 1/draft-ietf-iri-bidi-guidelines-01.txt 2012-03-09 19:13:59.478672315 +0100 +++ 2/draft-ietf-iri-bidi-guidelines-02.txt 2012-03-09 19:13:59.494670961 +0100 @@ -1,45 +1,46 @@ Internationalized Resource Identifiers M. Duerst (iri) Aoyama Gakuin University Internet-Draft L. Masinter Intended status: BCP Adobe -Expires: September 3, 2012 A. Allawi +Expires: September 10, 2012 A. Allawi Diwan Software Limited - March 2, 2012 + March 9, 2012 Guidelines for Internationalized Resource Identifiers with Bi- directional Characters (Bidi IRIs) - draft-ietf-iri-bidi-guidelines-01 + draft-ietf-iri-bidi-guidelines-02 Abstract - This specification gives guidelines for selection, use, presentation - of International Resource Identifiers (IRI) which include characters - with in inherent right-to-left (rtl) writing direction. + This specification gives guidelines for selection, use, and + presentation of International Resource Identifiers (IRIs) which + include characters with inherent right-to-left (rtl) writing + direction. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on September 3, 2012. + This Internet-Draft will expire on September 10, 2012. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -59,123 +60,137 @@ outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Logical Storage and Visual Presentation . . . . . . . . . . . . 3 - 3. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 4 + 3. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 5 4. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 6 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 8 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 8 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 - 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 - 9.1. Normative References . . . . . . . . . . . . . . . . . . . 8 + 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 + 9.1. Normative References . . . . . . . . . . . . . . . . . . . 9 9.2. Informative References . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 1. Introduction Some UCS characters, such as those used in the Arabic and Hebrew - scripts, have an inherent right-to-left (rtl) writing direction. - IRIs containing these characters (called bidirectional IRIs or Bidi - IRIs) require additional attention because of the non-trivial - relation between logical representation (used for digital - representation and for reading/spelling) and visual representation - (used for display/printing). + scripts, have an inherent right-to-left (rtl) writing direction as + opposed to characters, such as those in Latin scripts, that have an + inherent left-to-right (ltr) direction. IRIs containing rtl + characters (called bidirectional IRIs or Bidi IRIs) require + additional attention because of the non-trivial relation between + their logical and visual ordering. The logical order represents the + order in which the characters are read and stored on computers. The + visual order represents the order the characters are drawn on a + computer display or printout in the way a human expects to read them. - Because of the complex interaction between the logical + Generally, alphabetic characters in scripts like Arabic and Hebrew + are drawn rtl while numbers are drawn ltr. Symbols, such as slash + '/' and period '.' take their visual direction from the surrounding + chracters. + + Because of this complex interaction between the logical representation, the visual representation, and the syntax of a Bidi IRI, a balance is needed between various requirements. The main - requirements are + requirements are: 1. user-predictable conversion between visual and logical representation; 2. the ability to include a wide range of characters in various parts of the IRI; and 3. minor or no changes or restrictions for implementations. 1.1. Notation - In this document, Bidi Notation is used for bidirectional examples: - Lower case letters stand for Latin letters or other letters that are - written left to right, whereas upper case letters represent Arabic or - Hebrew letters that are written right to left. + In this document, "Bidi Notation" is used for the given Bidi IRI + examples as follows: Lower case letters a-z stand for characters that + are written with a left to right ordering (such as Latin characters), + whereas upper case letters A-Z represent characters that are written + right to left (such as Arqbic or Hebrew characters). Numbers and + symbols are the same. In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in [RFC2119]. 2. Logical Storage and Visual Presentation - When stored or transmitted in digital representation, bidirectional - IRIs MUST be in full logical order and MUST conform to the IRI syntax - rules (which includes the rules relevant to their scheme). This - ensures that bidirectional IRIs can be processed in the same way as - other IRIs. + When stored or transmitted in digital representation, Bidi IRIs MUST + be in full logical order and MUST conform to the IRI syntax rules + (which includes the rules relevant to their scheme). This ensures + that Bidi IRIs can be processed in the same way as other IRIs. - Bidirectional IRIs MUST be rendered by using the Unicode - Bidirectional Algorithm [UNIV6], [UNI9]. Bidirectional IRIs MUST be - rendered in the same way as they would be if they were in a left-to- - right embedding; i.e., as if they were preceded by U+202A, LEFT-TO- - RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL - FORMATTING (PDF). Setting the embedding direction can also be done - in a higher-level protocol (e.g., the dir='ltr' attribute in HTML). + Bidi IRIs MUST be visually ordered by the Unicode Bidirectional + Algorithm [UNIV6], [UNI9]. Bidi IRIs MUST be rendered in the same + way as they would be if they were in a left-to-right embedding. - There is no requirement to use the above embedding if the display is - still the same without the embedding. For example, a bidirectional - IRI in a text with left-to-right base directionality (such as used - for English or Cyrillic) that is preceded and followed by whitespace - and strong left-to-right characters does not need an embedding. - Also, a bidirectional relative IRI reference that only contains - strong right-to-left characters and weak characters and that starts - and ends with a strong right-to-left character and appears in a text - with right-to-left base directionality (such as used for Arabic or - Hebrew) and is preceded and followed by whitespace and strong - characters does not need an embedding. + In conformance with the Unicode Bidirectional Algorithm, embedding + MAY be done in one of two ways: - In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be - sufficient to force the correct display behavior. However, the - details of the Unicode Bidirectional algorithm are not always easy to - understand. Implementers are strongly advised to err on the side of - caution and to use embedding in all cases where they are not - completely sure that the display behavior is unaffected without the - embedding. + 1. precede the IRI with U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and + follow with U+202C, POP DIRECTIONAL FORMATTING (PDF); or + + 2. use a higher-level protocol (e.g., the dir='ltr' attribute in + HTML). + + Preceding and following the Bidi IRI with U+200E, LEFT-TO-RIGHT MARK + (LRM). Is NOT RECOMMENDED as, there are cases where this may not be + sufficient to match full left to right embedding. + + There is no requirement to use embedding if the display is still the + same without the embedding. For example, a Bidi IRI in a text with + left-to-right base directionality (such as used for English or + Cyrillic) that is preceded and followed by whitespace and strong + left-to-right characters does not need an embedding. Also, a + bidirectional relative IRI reference that only contains strong right- + to-left characters and weak characters (such as symbols) and that + starts and ends with a strong right-to-left character and appears in + a text with right-to-left base directionality (such as used for + Arabic or Hebrew) and is preceded and followed by whitespace and + strong characters does not need an embedding. + + However, Implementers are, RECOMMENDED to use embedding in all cases + where they are not completely sure that the display behavior is + unaffected without the embedding. The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits higher-level protocols to influence bidirectional rendering. Such changes by higher-level protocols MUST NOT be used if they change the rendering of IRIs. The bidirectional formatting characters that may be used before or after the IRI to ensure correct display are not themselves part of the IRI. IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of the IRI but do not appear themselves. It would therefore not be possible to input an IRI with such characters correctly. 3. Bidi IRI Structure - The Unicode Bidirectional Algorithm is designed mainly for running - text. To make sure that it does not affect the rendering of - bidirectional IRIs too much, some restrictions on bidirectional IRIs - are necessary. These restrictions are given in terms of delimiters - (structural characters, mostly punctuation such as "@", ".", ":", and - "/") and components (usually consisting mostly of letters and - digits). + The Unicode Bidirectional Algorithm is designed mainly for plain + text. To make sure that it does not affect the rendering of Bidi + IRIs outside of the requirements of this document, some restrictions + on Bidi IRIs are necessary. These restrictions are given in terms of + delimiters (structural characters, mostly punctuation such as "@", + ".", ":", and "/") and components (usually consisting mostly of + letters and digits). The following syntax rules from the ABNF of [RFC3987bis] correspond to components for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment. Specifications that define the syntax of any of the above components MAY divide them further and define smaller parts to be components according to this document. As an example, the restrictions of [RFC3490] on bidirectional domain names correspond to treating each @@ -219,22 +234,22 @@ 4. Input of Bidi IRIs Bidi input methods MUST generate Bidi IRIs in logical order while rendering them according to Section 2. During input, rendering SHOULD be updated after every new character is input to avoid end- user confusion. 5. Examples - This section gives examples of bidirectional IRIs, in Bidi Notation. - It shows legal IRIs with the relationship between logical and visual + This section gives examples of Bidi IRIs in Bidi Notation. It shows + legal IRIs with the relationship between their logical and visual representation and explains how certain phenomena in this relationship may look strange to somebody not familiar with bidirectional behavior, but familiar to users of Arabic and Hebrew. It also shows what happens if the restrictions given in Section 3 are not followed. The examples below can be seen at [BidiEx], in Arabic, Hebrew, and Bidi Notation variants. To read the bidi text in the examples, read the visual representation from left to right until you encounter a block of rtl text. Read the rtl block (including slashes and other special characters) from right @@ -267,42 +282,43 @@ Visual representation: "http://DC.BA.ef/gh/LK/JI.html" Each sequence of rtl components is read rtl, in the same way as each sequence of rtl words in an ltr text is read rtl. Example 5: Example 2, applied to components of different kinds: Logical representation: "http://ab.cd.EF/GH/ij/kl.html" Visual representation: "http://ab.cd.HG/FE/ij/kl.html" The inversion of the domain name label and the path component may be unexpected, but it is consistent with other bidi behavior. For reassurance that the domain component really is "ab.cd.EF", it may be - helpful to read aloud the visual representation following the bidi - algorithm. After "http://ab.cd." one reads the RTL block - "E-F-slash-G-H", which corresponds to the logical representation. + helpful to read aloud the visual representation following the Unicode + Bidirectional Algorithm. After "http://ab.cd." one reads the RTL + block "E-F-slash-G-H", which corresponds to the logical + representation. Example 6: Same as Example 5, with more rtl components: Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" Visual representation: "http://ab.JI/HG/FE.DC/kl.html" The inversion of the domain name labels and the path components may be easier to identify because the delimiters also move. Example 7: A single rtl component includes digits: Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" Numbers are written ltr in all cases but are treated as an additional embedding inside a run of rtl characters. This is completely consistent with usual bidirectional text. Example 8 (not allowed): Numbers are at the start or end of an rtl component: Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" - The sequence "1/2" is interpreted by the bidi algorithm as a + The sequence "1/2" is interpreted by the Bidirectional Algorithm as a fraction, fragmenting the components and leading to confusion. There are other characters that are interpreted in a special way close to numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". Example 9 (not allowed): The numbers in the previous example are percent-encoded: Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html" Example 10 (allowed but not recommended): @@ -360,31 +375,31 @@ [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [RFC3987bis] Duerst, M., Masinter, L., and M. Suignard, "Internationalized Resource Identifiers (IRIs)", August 2011, . - [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard - Annex #9, March 2004, + [UNI9] Davis, M., "The Unicode Bidirectional Algorithm", Unicode + Standard Annex #9, March 2004, . [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, ISBN 978-1-936213-01-6)", October 2010. 9.2. Informative References - [BidiEx] "Examples of bidirectional IRIs", + [BidiEx] "Examples of Bidi IRIs", . [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. Authors' Addresses Martin Duerst Aoyama Gakuin University 5-10-1 Fuchinobe