Sunday 8 June 2008

Ultrasphinx & UTF-8 data

On filmamora.com we obviously have a lot of data that is not your standard ascii - Spanish film titles and Actor's name especially.

After switching from Ferret to Sphinx as a search engine (more about this in another post) everything was going great using the ultrasphinx plugin. That is until I tested searching on a name like 'Belén'. No films were found with an actress with that name - though I knew there were loads of actresses with that name and therefore films.

The problem was the data settings in the database. Even though I was displaying the names correctly this was due to a transliteration between the characters in the database and the webpage. The data was not being stored correctly.

You MUST create tables as default charset utf8 to get the correct results:

Step 1) Create the table in a migration

create_table(:films, :options => 'DEFAULT CHARSET=utf8') do |t|
t.string :title, :null => false
....
end


Step 2) Tell rails that the database is in utf-8 by putting this in your database.yml:

development:
adapter: mysql
database: filmamora_development
encoding: utf8


Step 3) Make sure the data going IN is in utf8. Filmamora gets a lot of data from scraping our wholesales site (using scrapi), so I do this:

film_info_es = dvdgo_movie.scrape(URI.parse("#{server}/#{page}"),
:parser_options => {"char-encoding" => "latin0",
"output-encoding" => "utf8", "quote-marks" => "true" },
:user_agent => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12")


Step 4) search!
@films = Ultrasphinx::Search.new(:query => @query, :class_names => ['Film'],
:page => params[:page]||1, :per_page => 12,
:filters => {"available" => 1}).run

Now, if you were (like us) unlucky enough to have NOT done this from the start and can't re-run your migrations (we have a catalogue of 35,000 items!), you can do this to your tables:

ALTER TABLE mytablename DEFAULT CHARACTER SET utf8

We then deleted all our character data (film titles, actor's name, synopsis) and reimported. We kept our id's (which we needed to make sure our existing orders and wishlists, etc didn't go wrong) but got the new data.

In theory you don't need to set the charset on each table. In MySQL you can create a database with a default charset:

CREATE TABLE tablename DEFAULT CHARACTER SET utf8;

And then all your tables should use it too.

But, I don't think that rake db:create does this... (maybe it does now that I have set the character set in the database.yml... hmmm). I also am not sure if the rake db tasks to create tables explicitly set the charset or not. Anyone?

I'm going to write a few posts on sphinx/ultrasphinx because it is so awesome and I had some issues finding clear info on how to use it, so I might as well write it up.

No comments: