[Docs] [txt|pdf|xml|html] [Tracker] [WG] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03 04 05 RFC 6596

Network Working Group                                            M. Ohye
Internet-Draft                                                  J. Kupke
Intended status: Informational                             March 6, 2012
Expires: September 7, 2012


                      The Canonical Link Relation
                 draft-ohye-canonical-link-relation-05

Abstract

   RFC5988 specified a way to define relationships between links on the
   web.  This document describes a new type of such relationship,
   "canonical", to designate an IRI as preferred over resources with
   duplicative content.

Editorial Note (To be removed by RFC Editor)

   Distribution of this document is unlimited.  Comments should be sent
   to the IETF Apps-Discuss mailing list (see
   <https://www.ietf.org/mailman/listinfo/apps-discuss>).

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 7, 2012.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents



Ohye & Kupke            Expires September 7, 2012               [Page 1]

Internet-Draft         The Canonical Link Relation            March 2012


   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

1.  Introduction

   The canonical link relation specifies the preferred IRI from
   resources with duplicative content.  Common implementations of the
   canonical link relation are to specify the preferred version of an
   IRI from duplicate pages created with the addition of IRI parameters
   (e.g., session IDs), or to specify the single-page version as
   preferred over the same content separated on multiple component
   pages.

   In regard to the link relation type, "canonical" can be described
   informally as the author's preferred version of a resource.  More
   formally, the canonical link relation specifies the preferred IRI
   from a set of resources that return the context IRI's content in
   duplicated form.  Once specified, applications such as search engines
   can focus processing on the canonical, and references to the context
   (referring) IRI can be updated to reference the target (canonical)
   IRI.

2.  Notational Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

3.  The Canonical Link Relation

   The target (canonical) IRI MUST identify content that is either
   duplicative or a superset of the content at the context (referring)
   IRI.  Authors who declare the canonical link relation ought to
   anticipate that applications such as search engines can:

   o  Index content only from the target IRI (i.e. content from the
      context IRIs will be likely disregarded as duplicative)

   o  Consolidate IRI properties, such as link popularity, to the target
      IRI

   o  Display the target IRI as the representative IRI

   The target (canonical) IRI MAY:




Ohye & Kupke            Expires September 7, 2012               [Page 2]

Internet-Draft         The Canonical Link Relation            March 2012


   o  Specify a relative IRI (see [RFC3986] Section 4.2)

   o  Be self-referential (context IRI identical to target IRI)

   o  Exist on a different hostname or domain

   o  Have different scheme names, such as "http" to "https," or
      "gopher" to "ftp"

   o  Be a superset of the content at the context IRI

      *  As an example, each component page (e.g., page-1.html, page-
         2.html) of a multi-page article MAY specify the "view-all"
         version (e.g., page-all.html), the superset of their content,
         as the target IRI.  This is because the content from each
         component page is contained within the view-all version.  Given
         this implementation, applications can mark page-1.html and
         page-2.html as duplicates of page-all.html, process content
         only from page-all.html, and disregard the component pages.
         All references can then be made to the view-all version (page-
         all.html, the target IRI), and no content will have been lost
         in this process.

      *  Using the same example above, page-2.html SHOULD NOT designate
         page-1.html as the target (canonical) IRI because this may
         cause a loss of data.  When page-2.html desginates page-1.html
         as the canonical, only content from the target IRI, page-
         1.html, will be processed. page-2.html may be marked as a
         duplicate of page-1.html and its content disregarded.

   o  Be the source IRI of a temporary redirect.  For HTTP, this refers
      to status codes 302, 303, or 307 (Sections 10.3.3, 10.3.4, and
      10.3.8, respectively, of [RFC2616]).

   To better ensure that applications properly handle the canonical link
   relation, administrators ought to consider the following guidelines:

   o  Specify only one canonical link relation for a resource.  (It
      would be confusing to consider/label/designate more than one IRI
      as authoritative.)

   o  Avoid desginating the target (canonical) as:

      *  The source IRI of a permanent redirect (for HTTP, this refers
         to 300 and 301 response codes, defined in Sections 10.3.1 and
         10.3.2 of [RFC2616])





Ohye & Kupke            Expires September 7, 2012               [Page 3]

Internet-Draft         The Canonical Link Relation            March 2012


      *  An IRI that also specifies a canonical link relation to an IRI
         other than itself

      *  An IRI that returns an error code, such as 4xx response in HTTP
         (Section 10.4 of [RFC2616])

      *  The first page of a multi-page article or multi-page listing of
         items (since the first page is not duplicative or a superset of
         the context IRI).  For example, page-2.html and page-3.html of
         an article SHOULD NOT specify page-1.html as the canonical.
         This may cause a loss of data from page-2.html and page-3.html
         as they will be marked duplicative of page-1.html with only
         content from page-1.html being processed.

   When the canonical link relation is declared improperly, such as
   creating chained canonicals (i.e., target IRI specifies the source
   IRI of a permanent redirect) or designating a target IRI which
   returns a 4xx response, applications can use their own heuristics
   when processing the resource.  For instance, an application can
   choose to ignore any improper canonical designation and continue to
   process the remaining content on a page.

4.  Examples

   The following example illustrates:

   o  Three IRIs that serve duplicate content

   o  One IRI which is the canonical or "preferred version"

   o  Two IRIs with additional query parameters, making them the non-
      preferred version of the content (duplicates).  The canonical link
      relation is therefore specified on these duplicates.

   If the preferred version of a IRI and its content exists at:
   http://www.example.com/page.php?item=purse

   Then duplicate content IRIs such as:
   http://www.example.com/page.php?item=purse&category=bags
   http://www.example.com/page.php?item=purse&category=bags&sid=1234

   may designate the canonical link relation in HTML as specified in
   [REC-html401-19991224]:
   <link rel="canonical"
           href="http://www.example.com/page.php?item=purse">

   or as a relative IRI:
   <link rel="canonical" href="page.php?item=purse">



Ohye & Kupke            Expires September 7, 2012               [Page 4]

Internet-Draft         The Canonical Link Relation            March 2012


   or alternatively, in the HTTP header field as specified in Section 5
   of [RFC5988]:
   Link: <http://www.example.com/page.php?item=purse>; rel="canonical"

   This signals to applications, such as search engines, that these are
   duplicates of the target (canonical) IRI:
   http://www.example.com/page.php?item=purse.

   Applications may then select the canonical value as the display IRI
   (such as in search results), and additional IRI properties such as
   indexing and ranking signals, can be transferred as well.

5.  Recommendations

   Before adding the canonical link relation, verification of the
   following is RECOMMENDED:

   1.  The content of the context IRI is duplicated within the content
       of the target (canonical) IRI.

   2.  For HTTP, Permanent HTTP redirects (Section 10.3.2 of [RFC2616]),
       the traditional strong indicator that a IRI's content has been
       permanently moved, could not be implemented in place of the
       canonical link relation.

   3.  In the case where the target (canonical) IRI is a superset of
       content from the context IRI (i.e., the case where page-1.html
       and page-2.html designate page-all.html as the canonical), that
       the user experience is strongly taken into consideration, both in
       regard to possible increased load time and potential complexity
       in navigation.

6.  IANA Considerations

   IANA is asked to register the Canonical Link Relation below as per
   [RFC5988].

   Relation Name:

      CANONICAL

   Description:

      Designates the preferred version of a resource (the IRI and its
      contents).

   Reference:




Ohye & Kupke            Expires September 7, 2012               [Page 5]

Internet-Draft         The Canonical Link Relation            March 2012


      This specification.

   Notes:

      None.

   Application Data:

      None.

7.  Security Considerations

   When a site is compromised, the canonical link relation can be
   implemented with malicious intent to designate the attacker's IRI as
   the preferred version of the content.  While this technique is
   largely unnoticeable to humans, automated programs may cluster the
   compromised resource as duplicative of the attacker's target IRI,
   transferring properties such as link popularity away from the
   compromised resource to the attacker's designated canonical.
   (Naturally, even a site that is not compromised could provide
   inaccurate or misleading information about which URI is canonical.)

8.  Internationalisation Considerations

   In designating a canonical IRI, please see section 8 of [RFC5988] for
   information on URI encoding.

9.  Normative References

   [REC-html401-19991224]  Le Hors, A., Raggett, D., and I. Jacobs,
                           "HTML 4.01 Specification", W3C
                           Recommendation REC-html401-19991224,
                           December 1999, <http://www.w3.org/TR/1999/
                           REC-html401-19991224>.

                           Latest version available at
                           <http://www.w3.org/TR/html401>.

   [RFC2119]               Bradner, S., "Key words for use in RFCs to
                           Indicate Requirement Levels", BCP 14,
                           RFC 2119, March 1997.

   [RFC2616]               Fielding, R., Gettys, J., Mogul, J., Frystyk,
                           H., Masinter, L., Leach, P., and T. Berners-
                           Lee, "Hypertext Transfer Protocol --
                           HTTP/1.1", RFC 2616, June 1999.

   [RFC3986]               Berners-Lee, T., Fielding, R., and L.



Ohye & Kupke            Expires September 7, 2012               [Page 6]

Internet-Draft         The Canonical Link Relation            March 2012


                           Masinter, "Uniform Resource Identifier (URI):
                           Generic Syntax", STD 66, RFC 3986,
                           January 2005.

   [RFC5988]               Nottingham, M., "Web Linking", RFC 5988,
                           October 2010.

Appendix A.  Implementations

   Automated programs that implement functionality with regard for the
   canonical link relation include:

   o  Google, canonical link relation HTML and HTTP header support,
      within the same domain and across domains:

      *  <http://googlewebmastercentral.blogspot.com/2009/02/
         specify-your-canonical.html>

      *  <http://googlewebmastercentral.blogspot.com/2011/06/
         supporting-relcanonical-http-headers.html>

      *  <http://googlewebmastercentral.blogspot.com/2009/12/
         handling-legitimate-cross-domain.html>

   o  Yahoo, canonical link relation HTML support within the same
      domain:

      *  <http://www.ysearchblog.com/2009/02/12/
         fighting-duplication-adding-more-arrows-to-your-quiver/>

   o  Bing, canonical link relation HTML support within the same domain:

      *  <http://www.bing.com/community/site_blogs/b/webmaster/archive/
         2009/02/12/
         partnering-to-help-solve-duplicate-content-issues.aspx>

Authors' Addresses

   Maile Ohye

   EMail: maileohye@gmail.com
   URI:   http://maileohye.com/


   Joachim Kupke

   EMail: joachim@kupke.za.net




Ohye & Kupke            Expires September 7, 2012               [Page 7]


Html markup produced by rfcmarkup 1.107, available from http://tools.ietf.org/tools/rfcmarkup/