Elasticsearch: A Brief Introduction

At my previous company, we relied on PostgreSQL’s built in text search to search through data across multiple tables. While Postgres Search was adequate for some time, it was clear that our growing customer base needed a faster and more reliable solution. So I decided to use Elasticsearch, a powerful search engine based on Apache Lucene. It has been a journey learning how to effectively use Elasticsearch’s query DSL and I wanted to share some thoughts for those of you who are interested in learning more.

If you are really interested in becoming a pro, I recommend reading through Elasticsearch: The Definitive Guide. It’s digestible and I read it front to back before starting on my project in order to have a grand scope of the possibilities and best practices of this tool.


Elasticsearch is a JSON document-based, schema-free database. This means that instead of your standard relational database with each record being a row in a table, a record in an Elasticsearch cluster is simply a JSON document. The attributes for each record exist in the document itself, like below. This system allows us to index documents in a distributed manner because no two like documents are restricted to the same table – more on that later. The sophisticated built-in indexing and querying tools also allow us to retrieve documents quickly, sifting through millions of records almost instantly. But that’s just the beginning; Elasticsearch has many other features that make it easy to use and reliable.

{
    "_index": "users",
    "_type": "user",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
      "id": 1,
      "name": "Tim",
  }
}

Database Architecture

Elasticsearch has three major components that help organize data in a safe and easily accessible manner. At the highest level is a cluster, which you can think of as the database itself. Within a cluster is a collection of nodes, or servers. Nodes talk to each other to manage data, but only the Master Node coordinates how data are stored and retrieved. Each node can contain any number of shards, which – for the time being – you can think of as an index itself.

In the figure below (source), you have three primary shards and three replica shards in green and grey respectively. The primary shard, P1, might contain my /users index, while P2 and P3 are my /articles and /comments indices. Because of the distributed nature of the shards, only two indices are competing for each node’s (server’s) computing resources. In addition, replica shards, designated with an R, will index all new documents as well and can handle search requests simultaneously. This makes indexing and retrieval much faster overall.

elasticsearch-cluster

But there’s another added benefit – replica shards never exist on the same node as its primary complement. This means that in the event that a node fails, another node’s replica will be automatically promoted to the primary and take over incoming requests. There’s no downtime and you can happily work on rescuing your failed node while the cluster continues to handle requests with ease. Compare that to constantly backing up a SQL database!

One Index, Multiple Shards

Let’s dig into the concept of indices and shards a little more. Before, I said documents are indexed in a distributed manner. What I meant is that documents are not restricted to a single table. This can be easily demonstrated with primary and replica shards. Every primary and replica shard indexes all documents equally so if the primary fails, the replica is promoted to the new primary. We can extend this concept to a single index with multiple shards. In the figure above, imagine we have three shards that serve ONE index, /users.

In that situation, all documents for the /users index get stored in a distributed manner. One third of the documents get indexed in a single primary/replica shard pair. Now this is very different from a relational database where ALL documents related to each other are store on a table. But the advantage is that each shard can perform a search on the index’s documents at the same time.

For example, let’s say that I want to search for contact’s whose first name is Tim. The GET query goes to the Elasticsearch cluster and the Master Node sends it to the /users index. It knows that the /users index exists within shards P1, P2, P3. So, then each shard searches through their documents and returns all relevant hits. Those hits get aggregated and the final search results is returned to the requesting client. Compare that to a normal table search in a relational database where each document in a table must be searched. In Elasticsearch, each shard is only responsible for one third of the documents, which makes searching much faster.


Search Made Easy

The essential quality of a good search engine is accuracy. This is easy to do with simple queries, but what happens when your user clicks on that dreaded button called “Advanced Search”? How do you find a user on a dating site who is male, from Colorado, at least 30 years old, likes oranges, but not apples, and is available on weekends from 10 AM to 4 PM?  Luckily, Elasticsearch’s query DSL is incredibly powerful and simple to use. Just compare the two queries below, written in SQL and Elasticsearch.

SELECT *
FROM users
WHERE first_name = 'Tim'
OR last_name = 'Tim'
GET  users/_search
{
  "query": {
    "multi_match": {
      "name": "Tim",
      "field": ["first_name^2", "last_name"]
    }
  }
}      

We’re matching multiple fields, first_name and last_name, and we want the field to contain the word “Tim”. The SQL query is quite clear, but the Elasticsearch query gives us a slightly more transparent language, which is crucial with more complex searches. We also get a glimpse into the power of Elasticsearch. What if we think people with “Tim” as a first_name are twice as relevant as people with “Tim” as a last name? With a simple caret, we can say that the first_name field is 2 times more important than last_name. Try doing that with SQL!

In order to make those complex searches easier to digest, you can customize how certain attributes are indexed and analyzed. This increases the efficiency and speed of filtering documents. Within an index, your configurations are called mappings and they can be manipulated to yield powerful results.

For example, Elasticsearch has a built-in “english” analyzer that helps chop long strings into digestible chunks. One of the things it does is stem words (“dogs” becomes “dog”) and take out pre-determined stop words, like “not”. It also has a “standard” analyzer that simply separates words at whitespaces. To simplify an Elasticsearch example, the english analyzer would convert the following sentence into discrete tokens:

I'm not happy about the foxes

Tokens: i'm, happi, about, fox

This is great because it reduces the amount of data we need to store for each string. But if you’re looking for an exact match on “not happy”, then we have a problem. Strings where people are “happy” and “not happy” will be equally scored in the search results. One elegant solution would be to use multifields in Elasticsearch to store the field twice, with both analyzers.

Below, we’re storing (PUTing) a mapping into “my_index”. We’re telling Elasticsearch that a story object has an attribute called title, which is a string that should be analyzed with the english analyzer. Within the title field, we can create a “fields” parameter which also stores the same string with the (default) standard analyzer. This extra sub-field can be accessed with a simple dot, “title.standard”.

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": { 
          "type":      "string",
          "analyzer":  "english",
          -------------
          "fields": {
            "standard": { 
              "type":  "string"
            }
          }
          -------------
        }
      }
    }
  }
}

Now we can simply do a multi_match query that searches both the english analyzed and standard analyzed fields.

GET /my_index/_search
{
  "query": {
    "multi_match": {
      "query":    "not happy foxes",
      "fields": [ "title", "title.standard" ]
    }
  }
}

To drive the point home, let’s think of another solution. We can tweak the mapping in the index to have the sub-field be called “original”. This time, we will make sure it’s “not_analyzed” meaning that the original string is preserved as is – “I’m not happy about the foxes”.

-------------
"fields": {
  "original": { 
    "type":  "string",
    "index"  "not_analyzed"
  }
}
-------------

Now, in general we can search for stories with “happy fox” in the title. But if we want an advanced search to get an exact match, we can simply search in the “title.original” field and forget about the english analyzed field.

GET /my_index/_search
{
  "query": {
    "multi_match": {
      "query":    "I'm not happy about the foxes",
      "fields":   [ "title.original" ]
    }
  }
}

NOTE: For searching one field, we wouldn’t have to use the “multi_match” query, but I kept it in to make things easy to follow.


Aggregations

Aggregations build upon the search query DSL so you can use your data for all sorts of analyses. With two simple concepts, metrics and buckets, you can perform all sorts of complex analyses with already indexed data. A metric is what it sounds like, a metric measure over a set of documents such as a count or average. A bucket is an aggregation of documents that are filtered by some distinct field. The power of aggregations comes from being able to nest buckets! Since the filtering mechanism works the same way as a search query, you can get analyze tens of millions of documents very quickly. Let’s take a quick look.

Perhaps we have the test scores of every student in the state of Colorado. We can easily perform an aggregation that will tell us the average grade ( 0 – 100) for all students. We query Elasticsearch with the “aggregations” keyword, name our query “avg_grade_all_users”, and then simply ask for the average for the “grade” field.

{
  "aggregations" : {
    "avg_grade_all_users" : {
      "avg" : {
        "field" : "grade" 
      } 
    }
  }
}

Simple, right? In tandem with the search query DSL, you can even start refining your search. Let’s say we want the average grade of all student over 16 years old. We simply prepend our aggregation with a “query” that asks for all documents where the student’s age range is greater than (gt) 16. That’s it! The separation of the search query filter and the aggregation makes it easy to see what exactly is happening. First, we only look for records for students over 16 years old, then we take the average of their grades.

{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gt": 16
      }
    }
  },
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "avg" : {
        "field" : "grade" 
      } 
    }
  }
}

Let’s take it one step further by separating test scores for students over 16 by gender. We take the pre-existing aggregations query and add a “filters” argument to tell Elasticsearch to split our parent bucket into two child buckets. While our query looks a bit longer, it’s not hard to follow what’s happening. We have a high-level aggregation that is splitting up boys and girls into two buckets. Within the “boys” and “girls” bucket is a lower-level aggregation that averages the grade for the bucket itself.

{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gt": 16
      }
    }
  },
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "filters": {
        "filters": {
          "boys": {
            "match": { "gender": "male" }
          },
          "girls": {
            "match": { "gender": "female" }
          }
        }
      },
      "aggregations": {
        "avg_grade": {
          "avg" : {
            "field" : "grade" 
          }
        } 
      }
    }
  }
}

The response looks something like this:

{
  ...,
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "buckets" : {
        "boys" : {
          "doc_count" : 146,
          "avg_grade" : {
            "value" : 83
          }
        },
        "girls" : {
          "doc_count" : 181,
          "avg_grade" : {
            "value" : 89
          }
        }
      }
    }
  }
}

It’s impossible to cover everything that Elasticsearch can do, but I hope that this gives you a glimpse into its utility. The greatest aspect of Elasticsearch is that for most people, you can just load it up and start using it right out of the box. It already determines how to distribute shards across nodes and will index documents dynamically, even if you don’t know anything about mappings. But once you dig into the search query DSL, you uncover the power of conducting complex searches in seconds and running aggregations to analyze data over millions of documents at once.

Advertisements

Testing RSpec model validations

I recently came across an issue while trying to write RSpec tests for model validations that I think are worth addressing. I think it’s worth testing validations because they may change over time and you want to make sure your test suite is accounting for them. For example, I have character limits for messages, which is a critical model validation. If I were to accidentally delete the validation or change the character limit, there would be a lot of unexpected consequences.

The first validation I wanted to test was a character limit on messages that was only triggered if the message was sent by a user. To do so, I created a from_user method that checks to see if the message.sender_type == 'user'. The model looks like this:

class Message < ActiveRecord::Base
  validates_presence_of :body, :status, :contact
  validates :body, length: { in: 1..420 }, if: :from_user?, on: :create

  def from_user?
    self.sender_type == 'user'
  end
end

At first, it seemed like I could easily use thoughtbot’s shoulda gem to write a simple test.

describe 'character limit for user-sent messages' do
  it {should_validate_length_of(:body).is_at_least(1).is_at_most(420).on(:create) }
end

However, the validation kept failing. After a little bit of digging, I realized it was because I never specified a subject. In particular, I never specified that my subject was a user-sent message. I found that there are two solutions to this problem.

The first is to specify a subject with a FactoryGirl.build. It’s important that you use ‘build’ and not ‘create’, because the ‘create’ method will trigger all model validations, which is exactly what you’re trying to test. In addition, when you ‘build’ the model, make sure you leave out the attribute that you’re testing. In my case, I’m trying to test a body validation, so it wouldn’t make sense to do a FactoryGirl.build where I specify the message body.

The second solution is to write a ‘before’ hook that lets the test assume that the subject is from a user, which will trigger the validation. Either method works well, but I opted to use the second option to practice a new technique.

Option 1

describe 'character limit for user-sent messages' do
  subject { FactoryGirl.build :message, contact: contact, status: 'sent' }
  it {should_validate_length_of(:body).is_at_least(1).is_at_most(420).on(:create) }
end

Option 2

describe 'character limit for user-sent messages' do
  before { allow(subject).to receive(:from_user?).and_return(true) }
  it {should_validate_length_of(:body).is_at_least(1).is_at_most(420).on(:create) }
end

Keeping track of Rails transactions

Ever heard of transaction_include_any_action?([:action])? You may never need to use this specialized part of ActiveRecord, but I thought I would shed some light on it anyway.

Recently, I was writing a module that helped incorporate several models into Elasticsearch. Concerns::Searchable, as I named it, helps create mappings for models, provides a base method for specialized searches, and also has an after_commit hook to index records. I first looked into ActiveRecord::Transactions to customize when a record is indexed into Elasticsearch.

module Concerns
  module Searchable
    extend ActiveSupport::Concern

    included do
      include Elasticsearch::Model
      after_commit :update_elasticsearch_index

      ...

      def update_elasticsearch_index
        if transaction_include_any_action?([:destroy]) || (transaction_include_any_action?([:update]) && self.archived?)
          ElasticsearchIndexWorker.perform_async(:delete, self.id)
        else
          ElasticsearchIndexWorker.perform_async(:index, self.id)
        end
      end
    end
  end
end

What I wanted was for Elasticsearch to delete records that were deleted in Postgres OR “archived”. That way, when a user goes to search for a record, they won’t accidentally get results from “archived” records that are being kept in Postgres for other purposes. Using ActiveRecord::Transactions helped me customize the logic for indexing after_commit.


NOTE: One small hiccup you may encounter is that the argument passed to the method has to be an array.

This is okay:

transaction_include_any_action?([:destroy, :update])

This is not:

transaction_include_any_action?(:destroy)

Adding foreign key constraints in Rails

Here’s a great article about adding foreign key constraints to handle cascading deletion of records. It’s a good idea to handle foreign key constraints through the database because Rails doesn’t always fire an object’s callbacks upon deletion. So if you have a dependent: :destroy callback set up, it might get skipped in some situations.

https://robots.thoughtbot.com/referential-integrity-with-foreign-keys

Include vs. Extend…why not both?

If you don’t already know, there is a subtle difference between including and extending a module into one of your classes. include allows you to access instance methods while extend allows you to access class methods. See the example below for clarification:

module FooModule
  def foo
    puts 'foo'
  end
end

----------

class Bar
  include FooModule
end

Bar.new.foo => 'foo'
Bar.foo => NoMethodError: undefined method 'foo'

----------

class Bar
  extend FooModule
end

Bar.new.foo => NoMethodError: undefined method 'foo'
Bar.foo => 'foo'

So what if you want to create a module that will grant you access to both class and instance methods simultaneously? Luckily, using Rails’ ActiveSupport::Concern will help you separate your instance and class methods easily.

module FooModule
  extend ActiveSupport::Concern

  def foo
    puts 'foo'
  end

  module ClassMethods
    def baz
      puts 'baz'
    end
  end
end

----------

class Bar
  include FooModule
end

Bar.new.foo => 'foo'
Bar.foo => NoMethodError: undefined method 'foo'
Bar.new.baz => NoMethodError: undefined method 'foo'
Bar.baz => 'baz'