Showing posts with label ultrasphinx. Show all posts
Showing posts with label ultrasphinx. Show all posts

Wednesday, 11 June 2008

Ultrasphinx: ' != "

Just ran into a wee ultrasphinx problem.

I had been doing this:

@query = "actor:'#{@person.name}' OR director:'#{@person.name}'"


but it turns out that did nothing! You need to surround query segments in double quotes for it to actually parse them as a single unit:

@query = "actor:\"#{@person.name}\" OR director:\"#{@person.name}\""

Tuesday, 10 June 2008

Ultrasphinx cheatsheet: indexing habtm relationships

In the previous post I quickly summarized how I am doing the indexing of fields of a model. Quite simple I know.

I struggled a little bit with getting the other models I wanted indexed to come through.

Before, as I mentioned, I was using Ferret. Now, despite all of its memory-hog bugginess it is simple to understand because you can modify stuff at the post-sql stage. So before, for instance, I had a method on the film model that did this:

def participants_name
participants.collect{|a| a.name}.join(' ')
end


That gave me a big long string and I could pass that to acts_as_ferret and it would use that to find all the films with Johnny Depp in them.

The problem came with Sphinx not doing any post-sql processing. I wanted the same results, but how?
The model looks like this:

class Film < ActiveRecord::Base
has_and_belongs_to_many :participants
...
end



At first I thought I could index participants and then step through all of their films, but that seemed too much work.

So... here comes the cheatsheet part!

How to index HABTM relationships

Again, maybe this is in the documentation but I had a hard time piecing it together.

I do this:

is_index :concatenate => [
{:class_name => "Participant",
:field => "name",
:table_alias => "p1",
:as => "actor",
:association_sql => "LEFT OUTER JOIN films_participants ON (films.`id` = films_participants.`film_id`) LEFT OUTER JOIN participants AS p1 ON (p1.`id` = films_participants.`participant_id`)"}
]


So, where as the :fields array of is_indexed is used for the fields of a model, :concatenate is where you set up the habtm relationships.
To break it down
1 - :class_name => the name of the class that this model relates to. We need to set this because you can reference by a different name as we'll see in a moment

2 - :field => same as the fields table on this model, what field do we want to index on. Not it is called 'field' and not 'fields'! You can only do one.

3 - :table_alias => i.e. GODSEND! I have 2 Participant habtm in my film model - one for actors and one for directors. The directors :concatentate definition is almost exactly the same as the actors one. If we didn't have :table_alias Sphinx wouldn't be able to tell them apart. Remember, Sphinx uses all this cleverness to generate one (very clever!) bit of sql.

4 - :as => what you want to call this in your search. I can then write a query that says "actor:'Johnny Depp'" and it will search only on that field.

5 - :association_sql => yes, Virginia, I am sorry but you have to write your own sql. It will ALWAYS look like this, so just copy and past mine and replace the table and column names.


That's it.

You can now do a query like this:

@films = Ultrasphinx::Search.new(:query => "Johnny Depp", :class_names => ['Film']).run



And it will search for any film titles with Johnny Depp in them AND any actors in the film. Sweet!

Now, didn't I say something about directors?

I have another habtm relationship:
 has_and_belongs_to_many :directors, :class_name => "Participant", 
:join_table => "directors_films"


Same other model (Participant). So now my :concatenate will look like this:

:concatenate => [              
{:class_name => "Participant", :field => "name", :table_alias => "p1",
:as => "actor", :association_sql => "LEFT OUTER JOIN films_participants ON (films.`id` = films_participants.`film_id`) LEFT OUTER JOIN participants AS p1 ON (p1.`id` = films_participants.`participant_id`)"},
{:class_name => "Participant", :field => "name", :table_alias => "p2",
:as => "director", :association_sql => "LEFT OUTER JOIN directors_films ON (films.`id` = directors_films.`film_id`) LEFT OUTER JOIN participants AS p2 ON (p2.`id` = directors_films.`participant_id`)"}]


Two references to Participant, but :table_alias saves the day.

Monday, 9 June 2008

Ultrasphinx cheatsheet: indexing fields you need!

The Sphinx/Ultrasphinx combination is great for searching. Very fast and much less load on your database than other options. It also takes a lot less memory and is more reliable.

But, the documentation is maddening. Here is what I have learned about implementing it.

To index a field on a model:

is_indexed :fields => [{:field => :field_name, :sortable => true}]


The sortable lets you sort on it later in searches. The thing to remember is that sphinx knows nothing except what you explicitly tell it. So it won't know about fields you don't mention, or what you might want to do with those fields. Don't include the :sortable flag and you can't sort on it.

To that end, we have a field called 'available' that we set/unset if an item in the catalogue shouldn't be displayed online. Now, we never really want to search on that, but, we do want to include it in searches as a filter. So, the first line of our model is really this:

is_indexed  :fields => [{:field => :title, :facet => true, :sortable => true}, :available]


That means we can do this:

@films = Ultrasphinx::Search.new(:query => "title:Foo", :class_names => ['Film'], :filters => {"available" => 1}).run


And we'll just get a list of available films named Red (or with Red in the title).

Sunday, 8 June 2008

Ultrasphinx & UTF-8 data

On filmamora.com we obviously have a lot of data that is not your standard ascii - Spanish film titles and Actor's name especially.

After switching from Ferret to Sphinx as a search engine (more about this in another post) everything was going great using the ultrasphinx plugin. That is until I tested searching on a name like 'Belén'. No films were found with an actress with that name - though I knew there were loads of actresses with that name and therefore films.

The problem was the data settings in the database. Even though I was displaying the names correctly this was due to a transliteration between the characters in the database and the webpage. The data was not being stored correctly.

You MUST create tables as default charset utf8 to get the correct results:

Step 1) Create the table in a migration

create_table(:films, :options => 'DEFAULT CHARSET=utf8') do |t|
t.string :title, :null => false
....
end


Step 2) Tell rails that the database is in utf-8 by putting this in your database.yml:

development:
adapter: mysql
database: filmamora_development
encoding: utf8


Step 3) Make sure the data going IN is in utf8. Filmamora gets a lot of data from scraping our wholesales site (using scrapi), so I do this:

film_info_es = dvdgo_movie.scrape(URI.parse("#{server}/#{page}"),
:parser_options => {"char-encoding" => "latin0",
"output-encoding" => "utf8", "quote-marks" => "true" },
:user_agent => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12")


Step 4) search!
@films = Ultrasphinx::Search.new(:query => @query, :class_names => ['Film'],
:page => params[:page]||1, :per_page => 12,
:filters => {"available" => 1}).run

Now, if you were (like us) unlucky enough to have NOT done this from the start and can't re-run your migrations (we have a catalogue of 35,000 items!), you can do this to your tables:

ALTER TABLE mytablename DEFAULT CHARACTER SET utf8

We then deleted all our character data (film titles, actor's name, synopsis) and reimported. We kept our id's (which we needed to make sure our existing orders and wishlists, etc didn't go wrong) but got the new data.

In theory you don't need to set the charset on each table. In MySQL you can create a database with a default charset:

CREATE TABLE tablename DEFAULT CHARACTER SET utf8;

And then all your tables should use it too.

But, I don't think that rake db:create does this... (maybe it does now that I have set the character set in the database.yml... hmmm). I also am not sure if the rake db tasks to create tables explicitly set the charset or not. Anyone?

I'm going to write a few posts on sphinx/ultrasphinx because it is so awesome and I had some issues finding clear info on how to use it, so I might as well write it up.