Open Bug 406993 Opened 17 years ago Updated 2 years ago

Autoconversion of plaintext URLs into HTML Links fails for many obvious cases where URL is enclosed by delimiter characters like “ ”, ‹›,<> etc. (incorrect / useless parsing of various quotation marks or brackets)

Categories

(Thunderbird :: Message Reader UI, defect)

defect

Tracking

(Not tracked)

People

(Reporter: Otto.Stolz, Unassigned)

Details

(Keywords: useless-UI, ux-error-prevention)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11
Build Identifier: Version 2.0.0.9 (20071031)

A message contains the string
  ‹http://www.php.net/manual/en/migration5.php›,
Thunderbird includes the closing angle quotation mark in the highlighting of the link; if clicked at, the link refers to <http://de2.php.net/manual/en/migration5.php%E2%80%BA>, rather than to <http://de2.php.net/manual/en/migration5.php>.

Reproducible: Always

Steps to Reproduce:
1. Send yourself an UTF-8 encoded message, containing the string quoted above (including the angle quotes)
2. Open that message
3. Click on the migration5.php link, to see the problem.
4. In your bowser's address field, remove the characters »%E2%80%BA« from the URL, and hit the ENTER key, to see the link working without the characters added by Thunderbird.
Actual Results:  
Step 3: The browser reports a broken link.
Step 4: The browser displays the desired page.

Expected Results:  
Step 3: Thunderbird should not include the (UTF-8 encoded) angle quote in the URL sent to the browser.
In due course, the browser would display the desired page, in step 3.
Version: unspecified → 2.0
Of course, › shouldn't be used. The standard says < and >
Which standard? RFC 2396 «Uniform Resource Identifiers (URI): Generic Syntax» does not specify the delimiters to be used around an URI. Rather, it says, in section 2 «URI Characters and Escape Sequences»: «URI consist of a restricted set of characters [...]. Characters used conventionally as delimiters around URI were excluded.»

Hence, when attempting recognize an URI in the message text, Thunderbird should include only characters possibly belonging to an URI, and exclude any character definitely not belonging to it.

The easiest way to do so is to start at one of the common URI schemes followed by a colon, and include the longest possible sequence from the characters that could possibly be included in an URI, viz:
"!" | "#" | "$" | "%" | "&" | "'" | 
"(" | ")" | "*" | "+" | "," | "-" | "." | "/" | 
"0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | 
"8" | "9" | ":" | ";" | "=" | "?" |
"@" | "A" | "B" | "C" | "D" | "E" | "F" | "G" | 
"H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" |
"P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | 
"X" | "Y" | "Z" | "_" |
"a" | "b" | "c" | "d" | "e" | "f" | "g" | 
"h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" |
"p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | 
"x" | "y" | "z" | "~"

This would fix the bug reported.

However, a better and more secure method would be to parse, according to the complete URI syntax given in RFC 2396, chapter 11, the string starting with one of the well-known URI schemes, and stop just before the 1st character that does not fit in that syntax.

In any case, including an ANGLE QUOTATION MARK, or any other non-ASCII character, in an URI is definitely wrong.
Not necessarily, for recognition. Consider IDNs which are very much valid, and e.g. how urls are unescaped in the url-bar for firefox3.

See RFC 2396 appendix E.
Component: Mail Window Front End → Message Reader UI
QA Contact: front-end → message-reader
Otto do you agree with comment #3? It's Invalid because RFC 2396 appendix E?
Whiteboard: closeme 2010-07-05
My point is that Thunderbird should not include delimiting characters in the URL extracted from plain text contained in e-mail messages.

RFC 2396, appendix E, does not invalidade that pont; rather it adds more hints on dealing with white space, especially line breaks, possibly embedded in the URI (but not belonging to it). Thunderbird, should follow that recommendation, as far as possible; however, there is a problem with a possible hyphen preceding a line break: the hypen may, or may not, belong to the URI. RFC 2396, appendix E, recommends to try both possibilities; however, I do not see an easy way, how this could be handled between Thunderbird and Firefox -- let alone, an arbitrary browser. Perhaps, Thunderbird could remember the last URI it had related to the browser, and then try the other possibility, if the user clicks again to the same URI.

About my original point, RFC 2396, appendix E, is rather vague: it just gives three typical examples, but does not limit the list of possible delimiters in any way. Interestingly, it explicitely mentions “angle brackets”, but then illustrates that case with greater-than, and less-than, symbols. Likewise, it illustrates its claim on quote symbols with the (ASCII) QUOTATION MARK (U+0022), rather than the preferred characters, viz. LEFT DOUBLE QUOTATION MARK (U+201C) and RIGHT DOUBLE QUOTATION MARK (U+201D). Apparently, the author of RFC 2396, appendix E, has not spent much thought on Unicode; rather, the whole discussion appears to be focussed on ASCII text.

I have not spent much thought on ISDNs yet; so I am trying now to fill that gap. My central point is clear: when extracting an URI from plain text, Thunderbird must deliver a syntactically valid URI and must not include any characters beyond such URI. Hence, the RFCs to consult are RFC 3490, RFC 3491, and  RFC 3492.

The “ACE label” form is not problematic at all, as it complies with the original RFC 3986 URI syntax (according to RFC 3492, section 5, the non-ASCII characters are encoded using ASCII letters, digits, and hyphens, exclusively). Hence, the problem at hand is how to extract the “IDN” form from any surrounding plain text.

In principle, there are to strategies to accomplish this:
• start at a known URI scheme and include into the URI all subsequent characters that comply with the IDN syntax;
• start at a known URI scheme and include subsequent characters up to, but excluding a possible delimiter.

Apparently, Thunderbird has hitherto used the latter strategy; however, this depends on a complete list of possible delimiting characters. I deem this approach not feasable, as you never will cope with the creativity and fancy of the e-mail users.

Even after the introduction of IDNs, I deem promising the former strategy (and only the former one).

Note that IDNs follow the original RFC 3986 URI syntax, only expanded with many additional letters. Hence, you could use an enhanced syntax that includes these additional letters. To find those additional letters, you could peruse the list of allowed characters that the top-level domain registrars have published for their respective realms. For a start, cf. “http://en.wikipedia.org/wiki/Internationalized_domain_name#Top-level_domains_known_to_accept_IDN_registration” or “http://de.wikipedia.org/wiki/Internationalizing_Domain_Names_in_Applications#Zeichens.C3.A4tze”.

In a nutshell:
• Thunderbird should not include delimiter characters in URIs extracted from plain text.
• The set of possible delimiter characters is open, at least fuzzy; in contrast, the URI syntax is clearly defined. Hence, Thunderbird should peruse the latter rather than the former.
• Even with IDNs, the URI syntax is clearly defined. The set of characters possibly contained in URIs und the (fuzzy) set of possible delimiters are disjunct.
Whiteboard: closeme 2010-07-05
PS.

In contrast to Thunderbird 3.0.4, Firefox 3.6.3 apparently does it right – at least with U+2038 SINGLE RIGHT-POINTING ANGLE QUOTATION MARK. Just try the several URIs contained in this very bug report, when viewing it with Firefox.
I have pasted the string ‹http://www.php.net/manual/en/migration5.php› in a new message using Mozilla/5.0 (Windows NT 6.1; rv:7.0a1) Gecko/20110529 Thunderbird/7.0a1 and work fine for me: Otto please could you attache here a "malformed" message to test?
Whiteboard: [closeme 2011-06-17]
(In reply to comment #7)
> Otto please could you attach here a "malformed" message to test?

See below.

If you tell me your e-mail address, I’ll send you a copy of that sample directly to your e-mail account, so you can peruse it. My e-mail address can be found in the sample below.

Thank you for looking after this bug,
and best wishes,
  Otto Stolz

Return-Path: <Otto.Stolz@uni-konstanz.de>
Received: from uni-konstanz.de ([unix socket])
	 by uni-konstanz.de (Cyrus v2.3.16) with LMTPA;
	 Mon, 30 May 2011 18:36:50 +0200
X-Sieve: CMU Sieve 2.3
Received: from pyrimidin.rz.uni-konstanz.de (pyrimidin.rz.uni-konstanz.de [134.34.240.46])
	by uni-konstanz.de (Postfix) with ESMTP id 7C4C4540C
	for <Otto.Stolz@uni-konstanz.de>; Mon, 30 May 2011 18:36:50 +0200 (CEST)
Received: from nkongsamba.rz.uni-konstanz.de ([134.34.240.62])
  by unitis.rz.uni-konstanz.de with ESMTP; 30 May 2011 16:36:50 +0000
Received: from [192.168.1.33] (dslb-188-098-136-224.pools.arcor-ip.net [188.98.136.224])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by nkongsamba.rz.uni-konstanz.de (Postfix) with ESMTPSA id 50485A00A3
	for <Otto.Stolz@uni-konstanz.de>; Mon, 30 May 2011 18:36:50 +0200 (CEST)
Message-ID: <4DE3C7A2.4030206@uni-konstanz.de>
Date: Mon, 30 May 2011 18:36:50 +0200
From: Otto Stolz <Otto.Stolz@uni-konstanz.de>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.11) Gecko/20100711 Thunderbird/3.0.6
MIME-Version: 1.0
To: Otto Stolz <Otto.Stolz@uni-konstanz.de>
Subject: https://bugzilla.mozilla.org/show_bug.cgi?id=406993
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Less-than, and greater-than symbols:
   <http://www.php.net/manual/en/migration5.php> (003C and 003E)

With Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.11) 
Gecko/20100711 Thunderbird/3.0.6, only the variant above does
work correctly; all of the variants below still fail, as discussed
in Bugzilla, id=406993.

French style double angle quotation marks:
   «http://www.php.net/manual/en/migration5.php» (00BB and 00AB)

French style single angle quotation marks:
   ‹http://www.php.net/manual/en/migration5.php› (203A and 2039)

American style double quotation marks:
   “http://www.php.net/manual/en/migration5.php” (201C and 201D)

American style single quotation marks:
   ‘http://www.php.net/manual/en/migration5.php’ (2018 and 2019)

German style double angle quotation marks:
   «http://www.php.net/manual/en/migration5.php» (00AB and 00BB)

German style single angle quotation marks:
   ‹http://www.php.net/manual/en/migration5.php› (2039 and 203A)

German style double quotation marks:
   „http://www.php.net/manual/en/migration5.php“ (201E and 201C)

German style singlelequotation marks:
   ‚http://www.php.net/manual/en/migration5.php‘ (201A and 2018)

Angle brackets:
   〈http://www.php.net/manual/en/migration5.php〉 (2329 and 232A)
   〈http://www.php.net/manual/en/migration5.php〉 (3008 and 3009)
Whiteboard: [closeme 2011-06-17]
The way Thunderbird parses the plaintext URLs of comment 8 is complete nonsense.
You can see clearly that bugzilla does the correct parsing for comment 8, but TB doesn't.

Regardless of protocols, consider this:

1) TB parses known protocols as the beginning of an URL, and a (white)space or line break character as the end of that plaintext URL (so if you actually need a space inside your URL, you have to use %20 instead)
2) if the plaintext character *before* the protocol is *not* a whitespace or line break (e.g. it's a quote character), and the last character before the URL-terminating whitespace is the *same* or corresponding character (e.g. a quote character again), I believe the odds are 99 to 1 that this character is *not* part of the URL. More so if it's a special (delimiter?) character, which I believe are the only characters which we accept before the protocol.

Compare:
hhttp://asdf.com/h -> not parsed, because protocol not recognized (ok)
“http://asdf.com/” -> parsed as URL: [http://asdf.com/”] what a nonsense!(brackets added by me)
- protocol is correctly recognized in spite of leading “ which we don't parse the as part of the URL! When a corresponding character is at the end, before URL-terminating space, why do we parse it as part of URL?

Even without terminating space, we could (should) accept delimiter characters as delimiters:

> Did you see this“http://asdf.com/”wth!?

Again, it is 99% more likely that if there's a “ before the protocol, the next (corresponding) ” is intended as a delimiter and not as part of the URL!
Surprisingly, we *do* get it right for

> Did you see(http://asdf.com/)wth!?

Finally, when you copy URLs containing "special" characters from FF location bar, they are always escaped using %xy syntax. So the very presence of the real special (delimiter) character in plaintext is an indication that this is not part of the URL. And I guess the RFCs also recommend something similar.
Intelligent parsing needs to be limited to delimiter candidates (anything that looks like quotes or brackets).

E.g., I would not expect % or & to be recognized as a delimiter:
> %http://asdf.com/% -> correctly parsed as %[http://asdf.com/%] (brackets by me)
> &http://asdf.com/& -> correctly parsed as &[http://asdf.com/&]

This intelligent parsing is also needed because word processors often convert double quotes into styled quotes:
"http://asdf.com" -> “http://asdf.com/”
So when users copy from there, we shouldn't break things unnecessarily.
Confirming as a valid request for enhancement of the current nonsensic behaviour.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Hardware: x86 → All
Summary: Link enclosed in single angle quotation marks is incorrectly parsed → Autoconversion of plaintext URLs into HTML Links fails for many obvious cases where URL is enclosed by delimiter characters like “ ”, ‹›,<> etc. (incorrect / useless parsing of various quotation marks or brackets)
(In reply to Thomas D. from comment #10)
> Intelligent parsing needs to be limited to delimiter candidates

I repeat: Intelligent parsing should NOT try to guess delimiter characters.

Rather, URLs should be parsed according to the URL syntax (allowing for
white space that may be embedded), and terminate right before any character
that does not comply with said syntax.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.