This is an unofficial site for collecting issues and information regarding Nameprep (RFC 3491), IDNA (RFC 3490) and Stringprep (RFC 3454) . This is not intended to be a list of errata. Some of these issues may be resolved by making changes to the IDNA-related specs and guidelines. Stringprep revision is quite another matter, but is also discussed somewhat.
The official errata are: IDNA errata (for RFC 3490) and Punycode errata (for RFC 3492).
This document was created when it started to become difficult to keep track of all the issues discussed on the idn@ops.ietf.org mailing list. Comments are welcome. Please send them to idn@ops.ietf.org or to erik@vanderpoel.org.
A homograph is a character that resembles another. Although the IDN Working Group had identified and considered the problem of many Unicode characters including letters and punctuation marks looking similar to one another prior to publication of the RFCs, the punctuation mark homograph issue has received new attention recently since it was pointed out on the IDN mailing list that there is a slash homograph in Unicode that was not prohibited in Nameprep and could be used in a deep subdomain inside a URI to fool users.
There are problems with the display of certain Unicode characters in some implementations.
The letter homograph issue continues to receive attention. We may wish to consider mapping some of these homographs to other (base) characters or to provide more detailed warnings in any RFC revisions.
Experience in the field has shown that some of Unicode's mappings and normalizations confuse both registrants and users. This is not related to spoofing, since these characters are not there after Nameprepping. However, a registrant may be confused when the input character is transformed into a different one. A Nameprep revision might be able to address this issue.
New Unicode versions continue to add characters. Any RFC revision should consider the careful addition of new characters, mappings and normalizations from the latest Unicode version.
http://good.com/very.evil.biz/phish.html
Here, the slash (/) after good.com is actually a homograph of the real ASCII slash, and is being used to trick the user into thinking that they are accessing the good.com domain. It is not possible to solve this problem at the level of a TLD or 2LD registry since any domain owner can create such a spoofed name at any level.
Prior to the advent of IDNA, applications solved the punctuation mark problem by avoiding such characters almost entirely. RFC 952 specified the LDH rule (Letters, Digits, Hyphen), and RFC 1123 (STD 3) relaxed it a bit, to allow a digit at the beginning too.
One possible solution for the punctuation mark spoofing problem is to restrict IDNs to Unicode characters of category Lx (letters), Mx (marks) and Nx (numbers), and the hyphen. [Check apostrophe (French, 2019, 0027, 02BC), middle dot in Catalan and palochka.]
Here are the legal punctuation marks used in various protocols. Some of them are relatively far from the domain name. They are listed in order of proximity to the domain name.
| Punctuation | Examples | |
| DNS | . | foo.bar.com |
| URI | / | http://foo.com/bar.html |
| : | http://foo.com:8080/ | |
| @ | http://user@host.com/ | |
| # | http://foo.com/bar.html#anchor | |
| ? | http://foo.com/bar.cgi?query=blah&name=value | |
| ; | http://foo.com/bar.html;params | |
| % | http://foo.com/bar%AB%CD | |
| & | http://foo.com/bar.cgi?query=blah&name=value | |
| = | http://foo.com/bar.cgi?query=blah&name=value | |
| @ | jane@doe.org | |
| > | Jane Doe <jane@doe.org> | |
| ( | jane@doe.org (Jane Doe) | |
| < | Jane Doe <jane@doe.org> | |
| ) | jane@doe.org (Jane Doe) | |
| " | "Jane D. Oe" <jane@doe.org> |
And here are some of the homographs for some of the punctuation marks:
| Unicode | Glyph | Current Status | |
| space | |||
| \ | |||
| - | |||
| ~ | |||
| | | |||
| 002E | . | label separator | |
| 3002 | 。 | label separator | |
| FF0E | . | label separator | |
| FF61 | 。 | label separator | |
| 06D4 | ۔ | bidi error | |
| 0702 | ܂ | bidi error | |
| 002F | / | std3ascii error | http://good.com/evil.biz/ |
| 2044 | ⁄ | allowed | http://good.com⁄evil.biz/ |
| 2215 | ∕ | allowed | http://good.com∕evil.biz/ |
| 3033 | 〳 | allowed | |
| 30CE | ノ | allowed | |
| 4E00 | 一 | ||
| 1160 | ᅠ | http://good.comᅠevil.biz/ | |
| 16C1 | ᛁ | ||
| 202E | | ||
| 2758 | ❘ | http://good.com❘evil.biz/ | |
| 3009 | 〉 | http://good.com〉evil.biz/ | |
| 300B | 》 | http://good.com》evil.biz/ |
Letter homographs could be a problem in IDNs because of the phishing potential. This problem has been extensively discussed on the IDN list, both in the past and more recently. Instead of listing all the homographs that were discussed on the mailing list and running the risk of duplicating work already being performed at the Unicode Consortium, here are links to the Unicode Technical Report and its confusables table. We will have to investigate each homograph to determine its current status in Nameprep and then decide whether to prohibit, map or warn about them in Nameprep2. Note that the UTR also mentions numeric spoofs including an example where a string that looks like 89 actually has the numeric value 42.
U+1160 is HANGUL JUNGSEONG FILLER, normally used to transform nonstandard Korean syllables to standard ones. However, this transformation is an additional transformation not included in Unicode normalization. Some IDNA implementations display this character as a space. Phishers may try to take advantage of this.
If a domain label starts with a combining mark, it may combine with a preceding dot in some implementations. Try it.
There may be other Unicode display problems in the implementations. These may eventually be included in this section too.
Some of the Unicode case mappings and normalizations adopted by Nameprep have confused registrants and end-users. This is a different issue from the spoofing issue, since these characters do not appear in the output of the Nameprep process. In some cases, input characters appear to be changed in a significant or confusing way. One example of a potentially confusing character is U+2102 DOUBLE-STRUCK CAPITAL C = the set of complex numbers. Nameprep specifies that double-struck C is mapped to regular small 'c' using an "additional folding" rule that is based on Unicode's Normalization Form KC, where double-struck C is normalized to regular capital C. Also, since the double-struck C is not commonly typed on a keyboard, it is not clear why Nameprep should map it to something else, as opposed to simply prohibiting it.
The double-struck C has the Unicode category Lu (uppercase letter), so the Unicode category alone does not help us distinguish this type of character for the purposes of prohibition in Nameprep. Of course, the character code itself can be used to distinguish it, but there are a lot of characters in Unicode, so it would be a lot of error-prone and controversial work to investigate each character. Instead, we may wish to automate the process of generating tables of characters for the RFCs, as was done for their first versions.
The Character Decomposition Mapping for U+2102 has the tag <font> which means font variant. It may be possible to use these tags to distinguish Unicode characters for Nameprep prohibition. Some parts of Unicode are guaranteed to be stable, but this tag is listed as one of the things that might change. It may be a good idea to discuss this issue with Unicode, to try to ensure stability, or get some idea of when this part might start being stable.
If we do base Nameprep rules on these tags, we would not be using Normalization Form KC as is. It would be a slightly modified process. [Watch out for U+0E33 and 3 Lao compatibility decomposables.]
The current versions of Nameprep, IDNA and Stringprep reference Unicode 3.2.0. Unicode 4.1.0 is due in March 2005. Nameprep and Stringprep have been designed in such a way that new Unicode characters can be accommodated.
Between 3.2 and 4.0, Unicode had an incompatible change in the normalization. A new version of Nameprep would have to specify whether the new normalizations are incorporated or not.
Stringprep is the base upon which Nameprep has been built. Nameprep can be revised without revising Stringprep, by simply having Nameprep provide some of its own character tables, rules, etc. On the other hand, if we believe that other Stringprep profiles might benefit from, and not be broken by, the types of changes that we have in mind for Nameprep, then we should certainly consider changing Stringprep too. This takes some very careful consideration of and coordination with other profiles of Stringprep.
Stringprep has a normative reference to the Unicode Normalization spec (UAX #15). The first version of Stringprep (RFC 3454) refers to Revision 22 of UAX #15. However, a mistake was found in UAX #15, and it was fixed in Revision 25. The mistake is described in Public Review Issue #29. We must decide whether to have a new version of Stringprep point to Revision 25 or higher of UAX #15.
From Adam Costello:
Nameprep was designed so that if Unicode evolves in a backward compatible way (new characters added, old normalization mappings unchanged), then Nameprep can evolve along with it in a backward compatible way, with no need to change the ACE prefix.
(Actually Unicode normalization did change in an incompatible way between 3.2 and 4.0 when five erroneous compatibility mappings were fixed, but the NormalizationCorrections.txt file allows applications to choose whether to keep the errors for backward compatibility. A new Nameprep would have to specify which choice to make.)
Also from Adam Costello:
Coming up with the necessary and sufficient conditions will be tricky, but now that you've got me thinking about it, I think I can supply one sufficient condition: If the only changes you make are to add characters to the prohibited table, I don't think you need to change the ACE prefix. This would cause some valid IDN labels under the old spec to become invalid under the new spec, and would cause some valid ACE labels under the old spec to become bogo-ACE labels under the new space. (The bogo-ACE phenomenon already exists: there are labels that begin with the ACE prefix but don't validate during ToUnicode and therefore display as literal ASCII strings.) It would not cause anything to encode or decode to something different than it used to.
From: Asmus Freytag
For a registry that's not monolingual, the IICORE set might be useful, a set of approximately 10,000 ideographs that's designed to cover basic use for any language using Han ideographs.
It is under development by the IRG.
It is pretty clear that none of the organizations can completely solve the homograph problem on its own. Unicode can warn about these issues, but that is all they can do. They cannot remove characters. The IETF is currently discussing the prohibition of certain characters or character types. Even if the IETF publishes updated versions of the specs, there will still be the problem of certain characters being unfamiliar to many users (simply because they do not know all the legitimate characters in the world), thereby leaving them exposed to the phishers. The registries can enforce rules at their level, but nobody has yet shown that they can truly enforce any rules at other levels. So, the browser developers must address that problem.
There are several issues here. One is that domain names are typically displayed inside something else, e.g. a URI. This, in itself, gives the phishers something to work with. So the browser developers must think about other ways to display domain names. This is not very easy. People exchange URIs via email and other means all the time. Apps turn those URIs into clickable links, as a service to users. If not, they can copy and paste the URI into the URI field. Both of these methods could be improved to distinguish the domain name from its context in the interests of security. Note that Firefox displays the domain name of an HTTPS URI in the lower right corner of the window.
Another problem is that humans are only familiar with a small set of characters. Some humans know many characters (i.e. the East Asians), but most know a lot less than that. Now, within the set of characters that each user is familiar with, there are no homograph problems (or just a few). However, as soon as you stray outside any single user's familiar set, there are many homographs, near-homographs and unfamiliar symbols. When a typical computer user is faced with something unfamiliar, they are quite likely to shrug it off and assume it's just one of those "computer" things that they cannot understand. This is something that IDN phishers could take advantage of, if the browsers do not take steps to highlight or hide the unfamiliar characters (i.e. those outside the languages listed in the HTTP Accept-Language preference setting and/or the language of the browser localization). Of course, highlighting and hiding are not sufficient. Education is also very important.
There are several organizations and individuals working in this area, and some of them are working on documents that could have significant overlap with this Web site or any documents that we may wish to start writing. Such overlap would be wasteful, and keeping our content in sync with theirs would be difficult, so here is a list to keep track of all the related work:
No security issues such as string length increases or new
allowed values are introduced by the encoding process or the
use of these encoded values, apart from those introduced by
the ACE encoding itself.
What does this mean, exactly? Are any new allowed values introduced
by the ACE encoding? This part could be clearer.
It might mean that IDNA does not introduce any new ASCII domain
names, it only introduces new non-ASCII domain names. In any case,
that's true.
We knew that punctuation could be hazardous, and we expected that it would be severely restricted by the registries. I don't think we understood that punctuation could be used to spoof top-level domains even if every top-level registry prohibited punctuation. Maybe this is something to add to the Security section of Nameprep-bis. As for application implementors, we made no attempt to mention every kind of hazard we had thought of; we just wanted to give a motivating example to start them thinking about what safeguards would be appropriate for their applications. I agree with this line of thinking, as long as the slash homograph (or the "looks like a punctuation mark" problem) is mentioned. The draft UTR#36, unlike the IDNA spec, recommends using Nameprep for displayed domain names, to simplify detection of confusable names. Maybe Nameprep-bis should also recommend lower-casing ASCII labels for this reason?
- In section 1.1, a sentence was added to the end of the first
paragraph
to make it clear that implementations must follow the same steps
in Stringprep, not just use the tables from Stringprep. The
numbered list added as well:
Specifically, Nameprep mandates a set of steps that must be
executed in the same manner and the same order as they are
described in Stringprep.
1) Map 2) Normalize 3) Prohibit 4) Check bidi
- In section 5, removed "This profile MUST be used with the IDNA
protocol" because it is expected that other protocols might use
this profile as well.
RFC 3492, section 8: However, there can still be multiple Unicode representations of the "same" text, for various definitions of "same". This problem is addressed to some extent by the Unicode standard under the topic of canonicalization, and this work is leveraged for domain names by Nameprep [NAMEPREP]. canonicalization -> normalization
RFC 3454, appendix C: The tables in this appendix consist of lines with one prohibited code point per line. The format of the lines are the value of the code point, a semicolon, and a comment which is the name of the code point. Not true. Some lines have more than one code point.
RFC 3454, appendix C: FFF9-FFFC appear in 2 subsections, without explanation.