Unofficial Nameprep/IDNA/Stringprep Site

Scribe: Erik van der Poel (ideas and contributions from many, though they are not responsible)

1. Introduction
2. Summary of Issues
3. Homographs
3.1. Punctuation Mark Homographs
3.2. Letter and Digit Homographs
3.3. TrueType Homographs
4. Display Issues
5. Unicode Mapping and Normalization
6. Unicode Version
7. Stringprep
8. ACE Prefix
8.1. Whether the ACE Prefix Must Be Changed
8.2. How to Change the ACE Prefix
9. Recommendations for Registries
10. Recommendations for Applications
11. Related Work
12. Acknowledgements
13. Spec Issues

1. Introduction

This is an unofficial site for collecting issues and information regarding Nameprep (RFC 3491), IDNA (RFC 3490) and Stringprep (RFC 3454) . This is not intended to be a list of errata. Some of these issues may be resolved by making changes to the IDNA-related specs and guidelines. Stringprep revision is quite another matter, but is also discussed somewhat.

The official errata are: IDNA errata (for RFC 3490) and Punycode errata (for RFC 3492).

This document was created when it started to become difficult to keep track of all the issues discussed on the idn@ops.ietf.org mailing list. Comments are welcome. Please send them to idn@ops.ietf.org or to erik@vanderpoel.org.

2. Summary of Issues

A homograph is a character that resembles another. Although the IDN Working Group had identified and considered the problem of many Unicode characters including letters and punctuation marks looking similar to one another prior to publication of the RFCs, the punctuation mark homograph issue has received new attention recently since it was pointed out on the IDN mailing list that there is a slash homograph in Unicode that was not prohibited in Nameprep and could be used in a deep subdomain inside a URI to fool users.

There are problems with the display of certain Unicode characters in some implementations.

The letter homograph issue continues to receive attention. We may wish to consider mapping some of these homographs to other (base) characters or to provide more detailed warnings in any RFC revisions.

Experience in the field has shown that some of Unicode's mappings and normalizations confuse both registrants and users. This is not related to spoofing, since these characters are not there after Nameprepping. However, a registrant may be confused when the input character is transformed into a different one. A Nameprep revision might be able to address this issue.

New Unicode versions continue to add characters. Any RFC revision should consider the careful addition of new characters, mappings and normalizations from the latest Unicode version.

3. Homographs

3.1. Punctuation Mark Homographs

A recent email to the IDN mailing list focussed attention on the danger of punctuation marks in subdomains. Here is an example of a spoofed punctuation mark:
  http://good.com/very.evil.biz/phish.html

Here, the slash (/) after good.com is actually a homograph of the real ASCII slash, and is being used to trick the user into thinking that they are accessing the good.com domain. It is not possible to solve this problem at the level of a TLD or 2LD registry since any domain owner can create such a spoofed name at any level.

Prior to the advent of IDNA, applications solved the punctuation mark problem by avoiding such characters almost entirely. RFC 952 specified the LDH rule (Letters, Digits, Hyphen), and RFC 1123 (STD 3) relaxed it a bit, to allow a digit at the beginning too.

One possible solution for the punctuation mark spoofing problem is to restrict IDNs to Unicode characters of category Lx (letters), Mx (marks) and Nx (numbers), and the hyphen. [Check apostrophe (French, 2019, 0027, 02BC), middle dot in Catalan and palochka.]

Here are the legal punctuation marks used in various protocols. Some of them are relatively far from the domain name. They are listed in order of proximity to the domain name.

PunctuationExamples
DNS.foo.bar.com
URI/http://foo.com/bar.html
:http://foo.com:8080/
@http://user@host.com/
#http://foo.com/bar.html#anchor
?http://foo.com/bar.cgi?query=blah&name=value
;http://foo.com/bar.html;params
%http://foo.com/bar%AB%CD
&http://foo.com/bar.cgi?query=blah&name=value
=http://foo.com/bar.cgi?query=blah&name=value
email@jane@doe.org
>Jane Doe <jane@doe.org>
(jane@doe.org (Jane Doe)
<Jane Doe <jane@doe.org>
)jane@doe.org (Jane Doe)
""Jane D. Oe" <jane@doe.org>

And here are some of the homographs for some of the punctuation marks:

UnicodeGlyphCurrent Status
space
\
-
~
|
002E.label separator
3002label separator
FF0Elabel separator
FF61label separator
06D4۔bidi error
0702܂bidi error
002F/std3ascii errorhttp://good.com/evil.biz/
2044allowedhttp://good.com⁄evil.biz/
2215allowedhttp://good.com∕evil.biz/
3033allowed
30CEallowed
4E00
1160http://good.comᅠevil.biz/
16C1
202E
2758http://good.com❘evil.biz/
3009http://good.com〉evil.biz/
300Bhttp://good.com》evil.biz/

3.2. Letter and Digit Homographs

Letter homographs could be a problem in IDNs because of the phishing potential. This problem has been extensively discussed on the IDN list, both in the past and more recently. Instead of listing all the homographs that were discussed on the mailing list and running the risk of duplicating work already being performed at the Unicode Consortium, here are links to the Unicode Technical Report and its confusables table. We will have to investigate each homograph to determine its current status in Nameprep and then decide whether to prohibit, map or warn about them in Nameprep2. Note that the UTR also mentions numeric spoofs including an example where a string that looks like 89 actually has the numeric value 42.

3.3. TrueType Homographs

I have written a program that parses various TrueType font tables to determine which pairs of Unicode codepoints end up using the same glyphs.

4. Display Issues

U+1160 is HANGUL JUNGSEONG FILLER, normally used to transform nonstandard Korean syllables to standard ones. However, this transformation is an additional transformation not included in Unicode normalization. Some IDNA implementations display this character as a space. Phishers may try to take advantage of this.

If a domain label starts with a combining mark, it may combine with a preceding dot in some implementations. Try it.

There may be other Unicode display problems in the implementations. These may eventually be included in this section too.

5. Unicode Mapping and Normalization

Some of the Unicode case mappings and normalizations adopted by Nameprep have confused registrants and end-users. This is a different issue from the spoofing issue, since these characters do not appear in the output of the Nameprep process. In some cases, input characters appear to be changed in a significant or confusing way. One example of a potentially confusing character is U+2102 DOUBLE-STRUCK CAPITAL C = the set of complex numbers. Nameprep specifies that double-struck C is mapped to regular small 'c' using an "additional folding" rule that is based on Unicode's Normalization Form KC, where double-struck C is normalized to regular capital C. Also, since the double-struck C is not commonly typed on a keyboard, it is not clear why Nameprep should map it to something else, as opposed to simply prohibiting it.

The double-struck C has the Unicode category Lu (uppercase letter), so the Unicode category alone does not help us distinguish this type of character for the purposes of prohibition in Nameprep. Of course, the character code itself can be used to distinguish it, but there are a lot of characters in Unicode, so it would be a lot of error-prone and controversial work to investigate each character. Instead, we may wish to automate the process of generating tables of characters for the RFCs, as was done for their first versions.

The Character Decomposition Mapping for U+2102 has the tag <font> which means font variant. It may be possible to use these tags to distinguish Unicode characters for Nameprep prohibition. Some parts of Unicode are guaranteed to be stable, but this tag is listed as one of the things that might change. It may be a good idea to discuss this issue with Unicode, to try to ensure stability, or get some idea of when this part might start being stable.

If we do base Nameprep rules on these tags, we would not be using Normalization Form KC as is. It would be a slightly modified process. [Watch out for U+0E33 and 3 Lao compatibility decomposables.]

6. Unicode Version

The current versions of Nameprep, IDNA and Stringprep reference Unicode 3.2.0. Unicode 4.1.0 is due in March 2005. Nameprep and Stringprep have been designed in such a way that new Unicode characters can be accommodated.

Between 3.2 and 4.0, Unicode had an incompatible change in the normalization. A new version of Nameprep would have to specify whether the new normalizations are incorporated or not.

7. Stringprep

Stringprep is the base upon which Nameprep has been built. Nameprep can be revised without revising Stringprep, by simply having Nameprep provide some of its own character tables, rules, etc. On the other hand, if we believe that other Stringprep profiles might benefit from, and not be broken by, the types of changes that we have in mind for Nameprep, then we should certainly consider changing Stringprep too. This takes some very careful consideration of and coordination with other profiles of Stringprep.

Stringprep has a normative reference to the Unicode Normalization spec (UAX #15). The first version of Stringprep (RFC 3454) refers to Revision 22 of UAX #15. However, a mistake was found in UAX #15, and it was fixed in Revision 25. The mistake is described in Public Review Issue #29. We must decide whether to have a new version of Stringprep point to Revision 25 or higher of UAX #15.

8. ACE Prefix

8.1. Whether the ACE Prefix Must Be Changed

From Adam Costello:

Nameprep was designed so that if Unicode evolves in a backward compatible way (new characters added, old normalization mappings unchanged), then Nameprep can evolve along with it in a backward compatible way, with no need to change the ACE prefix.

(Actually Unicode normalization did change in an incompatible way between 3.2 and 4.0 when five erroneous compatibility mappings were fixed, but the NormalizationCorrections.txt file allows applications to choose whether to keep the errors for backward compatibility. A new Nameprep would have to specify which choice to make.)

Also from Adam Costello:

Coming up with the necessary and sufficient conditions will be tricky, but now that you've got me thinking about it, I think I can supply one sufficient condition: If the only changes you make are to add characters to the prohibited table, I don't think you need to change the ACE prefix. This would cause some valid IDN labels under the old spec to become invalid under the new spec, and would cause some valid ACE labels under the old spec to become bogo-ACE labels under the new space. (The bogo-ACE phenomenon already exists: there are labels that begin with the ACE prefix but don't validate during ToUnicode and therefore display as literal ASCII strings.) It would not cause anything to encode or decode to something different than it used to.

8.2. How to Change the ACE Prefix

[I'm just wondering whether it might be a good idea to explore what exactly would have to happen if we were to change the ACE prefix. Such an exploration might lead us to avoid changing Nameprep incompatibly. E.]

9. Recommendations for Registries

From: Asmus Freytag

For a registry that's not monolingual, the IICORE set might be useful, a set of approximately 10,000 ideographs that's designed to cover basic use for any language using Han ideographs.

It is under development by the IRG.

10. Recommendations for Applications

It is pretty clear that none of the organizations can completely solve the homograph problem on its own. Unicode can warn about these issues, but that is all they can do. They cannot remove characters. The IETF is currently discussing the prohibition of certain characters or character types. Even if the IETF publishes updated versions of the specs, there will still be the problem of certain characters being unfamiliar to many users (simply because they do not know all the legitimate characters in the world), thereby leaving them exposed to the phishers. The registries can enforce rules at their level, but nobody has yet shown that they can truly enforce any rules at other levels. So, the browser developers must address that problem.

There are several issues here. One is that domain names are typically displayed inside something else, e.g. a URI. This, in itself, gives the phishers something to work with. So the browser developers must think about other ways to display domain names. This is not very easy. People exchange URIs via email and other means all the time. Apps turn those URIs into clickable links, as a service to users. If not, they can copy and paste the URI into the URI field. Both of these methods could be improved to distinguish the domain name from its context in the interests of security. Note that Firefox displays the domain name of an HTTPS URI in the lower right corner of the window.

Another problem is that humans are only familiar with a small set of characters. Some humans know many characters (i.e. the East Asians), but most know a lot less than that. Now, within the set of characters that each user is familiar with, there are no homograph problems (or just a few). However, as soon as you stray outside any single user's familiar set, there are many homographs, near-homographs and unfamiliar symbols. When a typical computer user is faced with something unfamiliar, they are quite likely to shrug it off and assume it's just one of those "computer" things that they cannot understand. This is something that IDN phishers could take advantage of, if the browsers do not take steps to highlight or hide the unfamiliar characters (i.e. those outside the languages listed in the HTTP Accept-Language preference setting and/or the language of the browser localization). Of course, highlighting and hiding are not sufficient. Education is also very important.

11. Related Work

There are several organizations and individuals working in this area, and some of them are working on documents that could have significant overlap with this Web site or any documents that we may wish to start writing. Such overlap would be wasteful, and keeping our content in sync with theirs would be difficult, so here is a list to keep track of all the related work:

12. Acknowledgements

Adam Costello
Asmus Freytag
Cary Karp
Daniel Veditz
Darin Fisher
Doug Ewell
Edmon Chung
Eric Johanson
Gary Krall
George Gerrity
Gervase Markham
Jaap Akkerhuis
James Seng
JFC Jefsey Morfin
John Klensin
Kat Momoi
Kim Davies
Mark Davis
Martin Duerst
Martin Loewis
Michel Suignard
Naoki Hotta
Neil Harris
Pat Kane
Paul Hoffman
Roozbeh Pournader
Simon Josefsson
Soobok Lee
Stephane Bortzmeyer
Tedd Sperling
William Tan
Yngve Pettersen
Yoshiro Yoneya

13. Spec Issues

     No security issues such as string length increases or new
     allowed values are introduced by the encoding process or the
     use of these encoded values, apart from those introduced by
     the ACE encoding itself.

  What does this mean, exactly?  Are any new allowed values introduced
  by the ACE encoding?  This part could be clearer.

It might mean that IDNA does not introduce any new ASCII domain
names, it only introduces new non-ASCII domain names.  In any case,
that's true.  
  We knew that punctuation could be hazardous, and we expected that
  it would be severely restricted by the registries.  I don't think
  we understood that punctuation could be used to spoof top-level
  domains even if every top-level registry prohibited punctuation.

Maybe this is something to add to the Security section of Nameprep-bis.

  As for application implementors, we made no attempt to mention
  every kind of hazard we had thought of; we just wanted to give a
  motivating example to start them thinking about what safeguards
  would be appropriate for their applications.

I agree with this line of thinking, as long as the slash homograph
(or the "looks like a punctuation mark" problem) is mentioned.

  The draft UTR#36, unlike the IDNA spec, recommends using Nameprep
  for displayed domain names, to simplify detection of confusable
  names.

Maybe Nameprep-bis should also recommend lower-casing ASCII labels
for this reason?  

- In section 1.1, a sentence was added to the end of the first
paragraph
  to make it clear that implementations must follow the same steps
  in Stringprep, not just use the tables from Stringprep. The
  numbered list added as well:

   Specifically, Nameprep mandates a set of steps that must be
   executed in the same manner and the same order as they are
   described in Stringprep.
      1) Map 2) Normalize 3) Prohibit 4) Check bidi

- In section 5, removed "This profile MUST be used with the IDNA
  protocol" because it is expected that other protocols might use
  this profile as well.

RFC 3492, section 8:

   However, there can still be multiple Unicode representations of
   the "same" text, for various definitions of "same".  This problem
   is addressed to some extent by the Unicode standard under the
   topic of canonicalization, and this work is leveraged for domain
   names by Nameprep [NAMEPREP].

canonicalization -> normalization


RFC 3454, appendix C:

   The tables in this appendix consist of lines with one prohibited
   code point per line.  The format of the lines are the value of
   the code point, a semicolon, and a comment which is the name of
   the code point.

Not true. Some lines have more than one code point.


RFC 3454, appendix C: FFF9-FFFC appear in 2 subsections, without
explanation.