Non-Latin URLs – Are You Ready for Testing?

Up until last week, Internet domain names were a pretty mature business.  Then the folks at ICANN decided to shake things up by enabling non-Latin character ccTLDs (country code Top Level Domains – like .co.il and .co.uk ).  What does that mean for you?  Well, here’s a quick test.  Try visiting this URL: http://موقع.وزارة-الأتصالات.مصر/.

What you’re looking at is an Internationalized Domain Name, or IDN for short.  It doesn’t contain western or “Latin” letters, and chances are everything you know about URLs is about to get turned backwards (in this case, literally).  What’s worse is that different browsers handle this kind of domain name differently, and there’s no one right answer.

Are you a software tester?  Then your ship has come in because IDNs open up a whole new category of software bugs.  Let’s take a look at a few big trouble areas, but hang on tight because this gets goofy fast.

From the ICANN annoucement:

The three new top-level domains are السعودية. (“Al-Saudiah”), امارات. ( “Emarat”) and مصر. (“Misr”). All three are Arabic script domains, and will enable domain names written fully right-to-left.

Right to Left TLDs
Take a look at the URL in the first paragraph (which goes to the Egyptian Ministry for Communications and Information Technology).  After the http:// you’ll see the Misr (Egypt) TLD, followed by a period, and then the domain name.  This makes sense because Arabic is written right-to-left, but it would be like reading the BBC’s URL as http://uk.co.bbc.www.

Of course, you can’t write out any old URL right-to-left – just those from certain languages.  Which means that when it comes to parsing domain names, figuring out the language is an important first step to knowing whether the TLD comes first or last.

New Opportunities for Phishing
The next problem is even worse, and there’s no good solution.  If you open the first URL in Firefox, you’ll notice that the URL bar shows it as a long string of Latin text.  Safari, on the other hand, displays it properly.  Click the images below to see what I mean.

Why does Firefox break the URL?  Because IDNs have the potential to be very dangerous for web security and phishing.  As more languages are approved for IDNs by ICANN, the number of valid character sets will grow.  This introduces conflicts with international characters that look very similar to Latin characters.

For example, Russian Cyrillic will be a huge problem according to this article from Mashable.  The Russian letters р, а, and у are treated as totally different characters from the Latin p, a, and y.  Conveniently, they’re also the first five letters to paypal, meaning the Cyrillic раураl.com is a totally different domain from paypal.com (copy and paste those two domains in your URL bar – you’ll see).

This opens a whole new approach for phishing attacks. For this reason, Firefox defaults to displaying IDNs as gibberish to help manage this confusion.  Safari, on the other hand tries to guess whether it should show the real text or gibberish. Either way, that’s only in the URL bar and not in actual links, meaning everyone has to be more careful.

Same Domain, Different Names
The fact that certain browsers handle these domains differently is yet another problem.  The valid form of the Egyptian URL above is http://موقع.وزارة-الأتصالات.مصر/, however it could also be http://xn--4gbrim.xn—-ymcbaaajlc6dj7bxne2c.xn--wgbh1c.  Those are one and the same, even though they look entirely different.

Conclusion
Software testing just got a lot more complicated, but here are a few ideas to get you started with IDNs:

  • First, does it matter if the app handles IDNs at all?  Not every app cares about URLs.  Use good judgment before testing.
  • Next, does the app handle international URLs with IDNs correctly?  Feel free to use this URL as a test:
    http://موقع.وزارة-الأتصالات.مصر/
  • Does the web app handle right-to-left domain names correctly?  Again, use the URL above as a test.
  • How does the app handle domains that could be phishing targets?
  • Is the app able to differentiate between the international and Latin versions of a domain?

Did I forget anything? Let me know below.

5 Responses to “Non-Latin URLs – Are You Ready for Testing?”

  1. Santhosh Shivanand Tuppad said:

    @Readers,
    Cool news. Now the challenge is for even Microsoft, Mozilla Firefox, Safari to make their browser compatible with NON-Latin characters domain names.

    More fun for testers to test for NON-Latin character domain names.

    Thanks,
    Santhosh Shivanand Tuppad

  2. Mislav said:

    Firefox has special protection in place that is supposed to show the IDN except when it detects it’s a phishing attempt. I don’t know how it works exactly or why it failed to display the IDN in your example.

    Safari protection against phishing involves blocking certain Unicode tables, most importantly Cyrillic. If the IDN has these characters, the whole IDN will be displayed decoded.

    Internet Explorer supports the IDNs if I remember correctly, and Chrome (at least the nightly) always displays them decoded.

    One more thing to test: auto-linking scripts, especially on Twitter.com and Twitter clients. These scripts are going to be the first that break horribly when people start pasting these URLs around.

  3. Eitan said:

    It is very good article!

  4. Stanton Champion said:

    Mislav – interesting points all around. Thanks for the insights.

  5. Stanton Champion said:

    Ack, this gets even worse. For email addresses, does the TLD come at the beginning or the end of the address on a right-to-left domain? So do you send email to موقع.وزارة-الأتصالات.مصر@testuser or testuser@موقع.وزارة-الأتصالات.مصر or something else entirely?

    This is going to break a whole lot of stuff.

Leave a Reply