The limits of pg_search

At work, I’ve been tinkering with Postgres and trying to optimize the search that we have implemented for our company’s app. For those of you who don’t know, Postgres comes with a lot of great built in functionality to carry out full text search on your database. In our Rails app, we use a gem called pg_search that makes use of a lot of these features to quickly and easily set up a search engine for your website. Let’s say you wanted to be able to search through a list of contacts; with pg_search, you just have to set up a simple search scope.

class Contact < ActiveRecord::Base
  pg_search_scope :search, against: [:first_name, :last_name]
end

Now we can simply call the search method on the contacts table to find records quickly and easily.

Contact.search('Tim')

We make use of pg_search’s other search function, called multisearchable. The code required to set it up is very similar to a simple pg_search_scope.

class Contact < ActiveRecord::Base
  multisearchable against: [:first_name, :last_name]
end

The difference is that multisearchable utilizes a PgSearch::Document, which is a record that points back to the original search object. So if you wanted to search across multiple tables, PgSearch::Documents allow you to query the pg_search_documents table and pull up any records belonging to a specific type. The great thing about multisearchable is that it returns ActiveRecord::Relations, which means that you can chain commands onto your search result.

PgSearch.multisearch('Tim').where(searchable_type: 'Contact')
PgSearch.multisearch('Tim').where(searchable_type: 'Contact').where(last_name: 'Park').limit(5)

You can even make use of more advanced search techniques such as employing use of dictionaries, indexes, and weighting.

PgSearch.multisearch_options = {
  using: {
    tsearch: {
      prefix: true,
      dictionary: 'english'
    }
  }
}

One of the problems we come across is that while multisearchable is convenient and fast, it takes up a lot of space. Any time you create a new record, you are also creating a PgSearch::Document to point to it. And because our PgSearch::Documents make use of a GIN-indexed tsvector column, creating and updating our pg_search_documents table is taxing on our system and an expensive use of space. Unfortunately, we can’t use the simple pg_search_scope because it doesn’t really support searches that perform table JOINs.

In an effort to try and remove the need for the pg_search_documents table, I tried to create a method that would be able to search our database without the need for pg_search and multisearchable. I began by indexing all of the tables we search with a tsvector, which simply represents documents into an easily searchable form. Thoughtbot published a great blog post detailing how to generate a migration to create an indexed tsvector on your tables. We use GIN indexing, but it’s likely that for smaller apps, a GIST index will be better. For more information, you can read the Postgres documentation about it, but here’s the main takeaways:

In choosing which index type to use, GiST or GIN, consider these performance differences:

  GIN index lookups are about three times faster than GiST

  GIN indexes take about three times longer to build than GiST

  GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 54.3.1 for details)

  GIN indexes are two-to-three times larger than GiST indexes

As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

Ultimately, I was able to create an ‘alt_search’ method that executed a raw SQL query to return a list of results. It’s important to note that on the contacts table, the :tsv column is a tsvector of :first_name, and :last_name and the :tsv column on the addresses table is a tsvector of :address, :city, :state and :zip_code. The problem with the SQL query is that it return a hash where each ActiveRecord::Relation is returned as a JSON object. To end up with an array of ActiveRecord::Relations, I had to do an ActiveRecord query through the contacts table once again to find our results by ID.

def self.alt_search(query)
  sql = <<-SQL_QUERY
    SELECT contacts.id
    FROM contacts
    INNER JOIN (
      SELECT contacts.id AS pg_search_id,
      (
        ts_rank(
          (contacts.tsv),
          (to_tsquery('english', '#{query}')), 0
        )
      ) AS rank
      FROM contacts
      WHERE (
        ((contacts.tsv) @@ (to_tsquery('english', '#{query}')))
      )
      ORDER BY rank DESC
      OFFSET 0
    ) pg_search ON contacts.id = pg_search.pg_search_id

    UNION

    SELECT contacts.id
    FROM contacts
    INNER JOIN (
    SELECT addresses.id AS pg_search_id,
      (
        ts_rank(
          (to_tsvector(addresses.city)),
          (to_tsquery('english', '#{query}')), 0
      )
    ) AS rank
    FROM addresses
    WHERE (
      ((addresses.tsv) @@ (to_tsquery('english', '#{query}')))
    )
    ORDER BY rank DESC
    OFFSET 0
  ) pg_search_phones ON contacts.phone_id = pg_search_phones.pg_search_id
  ;
SQL_QUERY
  @matched_contacts = []
  results = ActiveRecord::Base.connection.exec_query(sql)
  if results.present?
    results.each do |msg_hash|
      @matched_contacts << (msg_hash['id'].to_i)
    end
    Contact.where(id: @matched_contacts)
  else
    nil
  end
end

At first, this method proved to seem promising as it was almost an order of magnitude faster than multisearchable queries. However, when you start working with large tables (millions of records), pg_search shines. The power of the PgSearch::Document is that it serves as an indexed table of its own. You can query all of your tables at once and then filter through the documents. In addition, it’s obvious that implementing pg_search is much easier. Imagine trying to refactor that SQL query when your tables change or you add new model attributes!


For now, it seems as though we’ll have to live PgSearch::Documents and look for other options. Many of the most visited websites in the world make use of Apache Solr and it seems as though we may be headed that way. But if you’re in a similar situation, I recommend that you take the time to at least GIN-index your pg_search_documents table. We have found that this increases search speeds by up to 50%, which can be dramatic when you’re searching a table with millions of records.

EDIT: After some research and experimentation, we decided to use Elasticsearch. Our Elasticsearch node is hosted by Bonsai, which is an easily available add-on through the Heroku marketplace. Elasticsearch allowed us to drop our PgSearch::Documents table, which was almost 10 GB!!! I highly recommend you look into Elasticsearch if you would like to include some search capability into your app. If you’re looking for some direction, shoot me an email!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s