phil ringnalda : Proper charset encoding in Mozilla Firefox toolbar search Sherlock .src files

Proper charset encoding in Mozilla Firefox toolbar search Sherlock .src files

Apparently it’s not very clear how to specify the charset that should be used to encode the query for Mozilla/Firefox search plugins (the Sherlock .src files that power the search box next to the addressbar): the documentation on the Mycroft page isn’t actually right (though as usual in programming documentation that allows user comments, it’s corrected in a comment), and not only are a great many of the plugins on Mycroft wrong, so are the default plugins that ship with Firefox and the Application Suite (though at least the Google plugin is wrong in a survivable way).

Elevator pitch:

If you don’t specify a queryCharset (or don’t correctly specify a queryEncoding as an integer that maps to a charset), the search is encoded with the user’s default charset, and since some of the default search pages expect ISO-8859-1 and some expect UTF-8, if you search for non-ASCII characters some will fail no matter what.

Full story in the extended entry, to protect the innocent, and those who will never write a search plugin.

It’s not too surprising that so many plugins get it wrong: Apple seems to have written queryCharset and queryEncoding completely out of their history of Sherlock, so now there’s only a page in the intl project and that comment in the Mycroft docs to explain it.

The charset that will be used to encode the query is determined like this (links will obviously bit-rot pretty quickly):

The original query string is encoded as UTF-8 (encodeURIComponent())
nsInternetSearchService.cpp first looks for a queryCharset attribute on the search element in the engine .src file
If there’s no queryCharset, it then looks for a queryEncoding attribute.
1. If there’s a queryEncoding, and it’s a known integer that maps to a charset, that’s used
2. If there’s no queryEncoding, or it’s not a (known) integer, then the user’s pref for intl.charset.default is used.
3. If all else has failed, ISO-8859-1 is used.
The original query string gets jerked around through some unescaping and reescaping, and pops out the other end encoded in whichever charset was selected above.

So, no matter what it’s going to find a charset to try to convert with: it may fail, if you use a queryCharset that won’t work with textToSubURI->ConvertAndEscape(), or it may be either the user’s default charset or ISO-8859-1 if you don’t specify anything, but it’s going to find something, and if you don’t specify the right thing, it’s going to be a problem.

With the default default charset, ISO-8859-1, search Yahoo! for Ã¼bel from Firefox’s toolbar, and you’ll search for %FCbel, despite telling Yahoo! you were using UTF-8 with &ei=UTF-8, and since the %FC is meaningless as UTF-8, Yahoo! will just search for bel.

Or, with your default charset changed to UTF-8, search Ask Jeeves in Seamonkey (you ship an Ask Jeeves plugin by default?!) for Ã¼bel and you’ll search for %C3%BCbel which, since Ask Jeeves uses ISO-8859-1, it will interpret as being a search for ÃƒÂ¼bel.

The solution is simple enough, the same one the validator insists you use for your own HTML: always specify a charset. Don’t use queryEncoding since it’s a strange legacy thing for a limited set of charsets, and people copying you won’t know to change your “2336” for EUC-JP to “2561” for Shift-JIS. Just use queryCharset, and make sure that if you are writing a plugin for something that will take more than one encoding (like Google or Yahoo!) that you also include an input for the variable that says which encoding you used, and use UTF-8 whenever you have a choice, since transcoding Greek to ISO-8859-1 just results in “??????”

For example, a working version of the Yahoo! plugin would include:

<SEARCH
   version = "7.1" 
   name="Yahoo"
   description="Yahoo Search"
   method="GET"
   action="http://search.yahoo.com/search" 
   queryCharset="UTF-8"
>
<input name="p" user>
<input name="ei" value="UTF-8">

To be done:

Either morph bug 270120 into a “fix all intl issues with all default Firefox plugins” bug or file a new one.
File or find a Seamonkey bug for their, er, interesting set of default plugins
Consider bugging someone about actually putting something up in /products/firefox/plugins/ where the defaults say to look for automatic updates.
Consider running through all the Aviary l10n plugins looking for trouble, since oddly enough all those intl people didn’t actually test their i18n very well. Decide that it’s SEP.

Sweet, there’s close to twenty tabs in two windows I can close now!

This entry was posted on Saturday, January 15th, 2005 at 4:04 pm and is filed under mozilla. You can follow any responses to this entry through the post feed. You can skip to the end and leave a response. Pinging is currently not allowed.

5 Comments

Comment by Axel Hecht #

2005-01-16 03:37:43

One thing that is *really* documented bad is the fact that the default encoding of the plugin itself is x-mac-roman. I actually forgot how to specify a different one, Phil, you just digged the code, could you add that?

Note that as long as some of the plugins on the l10n cvs rep reference the src’s on mozdev, we need to fix those on mozdev *and* in the l10n cvs rep. The plugin update does not only check for modification date, but also for length of the src, so it reverts your changes back to the original.

Reply to this comment

Comment by Phil Ringnalda #

2005-01-16 14:31:20

That is spectacularly ugly. I understand the reason behind it, because at the time we were being compatible with the way Apple did things with Sherlock at the time, but still. To get the engine name to display in your own charset, you specify sourceTextEncoding = {int}, where the int is one of the things in InternetSearchDataSource::MapScriptCodeToCharsetName.

So, say you want a Cyrillic name, Ð±Ñ‰ (I hope that’s meaningless). You specify sourceTextEncoding = "7" to say that you are encoding the name in x-mac-cyrillic, Ghu help you. Of course, you aren’t using a Mac that uses that as its native encoding, so you look at Kosta Kostis’ chart and see that means you want characters 225 and 249. Find a chart for your native encoding, copy-paste the characters for those codepoints, and hope it actually works (I started off with an example of Greek, but something seems a little off with the x-mac-greek to Unicode translation, or maybe I just confused myself).

If you need Shift-JIS (1), Big5 (2), EUC-KR (3), or GB2312 (25), you are set, otherwise you’ll be translating into a Mac charset (assuming one exists for the characters you need, and we can do something useful with it: our x-mac-thai conversion seems to be… not.).

I wish I thought it would be widely used enough to try to talk someone into implementing an equivalent to queryCharset, so that you could use a real charset name as sourceTextCharset and only fall back to sourceTextEncoding if that’s not found, but since even Baidu has to be known as much as Baidu as ç™¾åº¦ so that people will know it by URL, nobody seems to bother much with l18n of engine names: even the Netscape Japan plugin that was the impetus for fixing it in the first place isn’t named in Japanese (at least, not the version on Mycroft).

Reply to this comment

Comment by Jonathon Delacour #

2005-01-16 15:36:03

Is that why my Japanese Movie Database plugin — which worked fine on the PC that I wrote it on — didn’t work on my new Macintosh until I removed the queryEncoding="EUC-JP" line?

You might want to include in your Who’s My Audience categories, people who’ve written a Firefox search plugin without really understanding what they were doing.

Reply to this comment

Comment by Phil Ringnalda #

2005-01-16 16:59:48