Elasticsearch: A Brief Introduction

At my previous company, we relied on PostgreSQL’s built in text search to search through data across multiple tables. While Postgres Search was adequate for some time, it was clear that our growing customer base needed a faster and more reliable solution. So I decided to use Elasticsearch, a powerful search engine based on Apache Lucene. It has been a journey learning how to effectively use Elasticsearch’s query DSL and I wanted to share some thoughts for those of you who are interested in learning more.

If you are really interested in becoming a pro, I recommend reading through Elasticsearch: The Definitive Guide. It’s digestible and I read it front to back before starting on my project in order to have a grand scope of the possibilities and best practices of this tool.


Elasticsearch is a JSON document-based, schema-free database. This means that instead of your standard relational database with each record being a row in a table, a record in an Elasticsearch cluster is simply a JSON document. The attributes for each record exist in the document itself, like below. This system allows us to index documents in a distributed manner because no two like documents are restricted to the same table – more on that later. The sophisticated built-in indexing and querying tools also allow us to retrieve documents quickly, sifting through millions of records almost instantly. But that’s just the beginning; Elasticsearch has many other features that make it easy to use and reliable.

{
    "_index": "users",
    "_type": "user",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
      "id": 1,
      "name": "Tim",
  }
}

Database Architecture

Elasticsearch has three major components that help organize data in a safe and easily accessible manner. At the highest level is a cluster, which you can think of as the database itself. Within a cluster is a collection of nodes, or servers. Nodes talk to each other to manage data, but only the Master Node coordinates how data are stored and retrieved. Each node can contain any number of shards, which – for the time being – you can think of as an index itself.

In the figure below (source), you have three primary shards and three replica shards in green and grey respectively. The primary shard, P1, might contain my /users index, while P2 and P3 are my /articles and /comments indices. Because of the distributed nature of the shards, only two indices are competing for each node’s (server’s) computing resources. In addition, replica shards, designated with an R, will index all new documents as well and can handle search requests simultaneously. This makes indexing and retrieval much faster overall.

elasticsearch-cluster

But there’s another added benefit – replica shards never exist on the same node as its primary complement. This means that in the event that a node fails, another node’s replica will be automatically promoted to the primary and take over incoming requests. There’s no downtime and you can happily work on rescuing your failed node while the cluster continues to handle requests with ease. Compare that to constantly backing up a SQL database!

One Index, Multiple Shards

Let’s dig into the concept of indices and shards a little more. Before, I said documents are indexed in a distributed manner. What I meant is that documents are not restricted to a single table. This can be easily demonstrated with primary and replica shards. Every primary and replica shard indexes all documents equally so if the primary fails, the replica is promoted to the new primary. We can extend this concept to a single index with multiple shards. In the figure above, imagine we have three shards that serve ONE index, /users.

In that situation, all documents for the /users index get stored in a distributed manner. One third of the documents get indexed in a single primary/replica shard pair. Now this is very different from a relational database where ALL documents related to each other are store on a table. But the advantage is that each shard can perform a search on the index’s documents at the same time.

For example, let’s say that I want to search for contact’s whose first name is Tim. The GET query goes to the Elasticsearch cluster and the Master Node sends it to the /users index. It knows that the /users index exists within shards P1, P2, P3. So, then each shard searches through their documents and returns all relevant hits. Those hits get aggregated and the final search results is returned to the requesting client. Compare that to a normal table search in a relational database where each document in a table must be searched. In Elasticsearch, each shard is only responsible for one third of the documents, which makes searching much faster.


Search Made Easy

The essential quality of a good search engine is accuracy. This is easy to do with simple queries, but what happens when your user clicks on that dreaded button called “Advanced Search”? How do you find a user on a dating site who is male, from Colorado, at least 30 years old, likes oranges, but not apples, and is available on weekends from 10 AM to 4 PM?  Luckily, Elasticsearch’s query DSL is incredibly powerful and simple to use. Just compare the two queries below, written in SQL and Elasticsearch.

SELECT *
FROM users
WHERE first_name = 'Tim'
OR last_name = 'Tim'
GET  users/_search
{
  "query": {
    "multi_match": {
      "name": "Tim",
      "field": ["first_name^2", "last_name"]
    }
  }
}      

We’re matching multiple fields, first_name and last_name, and we want the field to contain the word “Tim”. The SQL query is quite clear, but the Elasticsearch query gives us a slightly more transparent language, which is crucial with more complex searches. We also get a glimpse into the power of Elasticsearch. What if we think people with “Tim” as a first_name are twice as relevant as people with “Tim” as a last name? With a simple caret, we can say that the first_name field is 2 times more important than last_name. Try doing that with SQL!

In order to make those complex searches easier to digest, you can customize how certain attributes are indexed and analyzed. This increases the efficiency and speed of filtering documents. Within an index, your configurations are called mappings and they can be manipulated to yield powerful results.

For example, Elasticsearch has a built-in “english” analyzer that helps chop long strings into digestible chunks. One of the things it does is stem words (“dogs” becomes “dog”) and take out pre-determined stop words, like “not”. It also has a “standard” analyzer that simply separates words at whitespaces. To simplify an Elasticsearch example, the english analyzer would convert the following sentence into discrete tokens:

I'm not happy about the foxes

Tokens: i'm, happi, about, fox

This is great because it reduces the amount of data we need to store for each string. But if you’re looking for an exact match on “not happy”, then we have a problem. Strings where people are “happy” and “not happy” will be equally scored in the search results. One elegant solution would be to use multifields in Elasticsearch to store the field twice, with both analyzers.

Below, we’re storing (PUTing) a mapping into “my_index”. We’re telling Elasticsearch that a story object has an attribute called title, which is a string that should be analyzed with the english analyzer. Within the title field, we can create a “fields” parameter which also stores the same string with the (default) standard analyzer. This extra sub-field can be accessed with a simple dot, “title.standard”.

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": { 
          "type":      "string",
          "analyzer":  "english",
          -------------
          "fields": {
            "standard": { 
              "type":  "string"
            }
          }
          -------------
        }
      }
    }
  }
}

Now we can simply do a multi_match query that searches both the english analyzed and standard analyzed fields.

GET /my_index/_search
{
  "query": {
    "multi_match": {
      "query":    "not happy foxes",
      "fields": [ "title", "title.standard" ]
    }
  }
}

To drive the point home, let’s think of another solution. We can tweak the mapping in the index to have the sub-field be called “original”. This time, we will make sure it’s “not_analyzed” meaning that the original string is preserved as is – “I’m not happy about the foxes”.

-------------
"fields": {
  "original": { 
    "type":  "string",
    "index"  "not_analyzed"
  }
}
-------------

Now, in general we can search for stories with “happy fox” in the title. But if we want an advanced search to get an exact match, we can simply search in the “title.original” field and forget about the english analyzed field.

GET /my_index/_search
{
  "query": {
    "multi_match": {
      "query":    "I'm not happy about the foxes",
      "fields":   [ "title.original" ]
    }
  }
}

NOTE: For searching one field, we wouldn’t have to use the “multi_match” query, but I kept it in to make things easy to follow.


Aggregations

Aggregations build upon the search query DSL so you can use your data for all sorts of analyses. With two simple concepts, metrics and buckets, you can perform all sorts of complex analyses with already indexed data. A metric is what it sounds like, a metric measure over a set of documents such as a count or average. A bucket is an aggregation of documents that are filtered by some distinct field. The power of aggregations comes from being able to nest buckets! Since the filtering mechanism works the same way as a search query, you can get analyze tens of millions of documents very quickly. Let’s take a quick look.

Perhaps we have the test scores of every student in the state of Colorado. We can easily perform an aggregation that will tell us the average grade ( 0 – 100) for all students. We query Elasticsearch with the “aggregations” keyword, name our query “avg_grade_all_users”, and then simply ask for the average for the “grade” field.

{
  "aggregations" : {
    "avg_grade_all_users" : {
      "avg" : {
        "field" : "grade" 
      } 
    }
  }
}

Simple, right? In tandem with the search query DSL, you can even start refining your search. Let’s say we want the average grade of all student over 16 years old. We simply prepend our aggregation with a “query” that asks for all documents where the student’s age range is greater than (gt) 16. That’s it! The separation of the search query filter and the aggregation makes it easy to see what exactly is happening. First, we only look for records for students over 16 years old, then we take the average of their grades.

{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gt": 16
      }
    }
  },
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "avg" : {
        "field" : "grade" 
      } 
    }
  }
}

Let’s take it one step further by separating test scores for students over 16 by gender. We take the pre-existing aggregations query and add a “filters” argument to tell Elasticsearch to split our parent bucket into two child buckets. While our query looks a bit longer, it’s not hard to follow what’s happening. We have a high-level aggregation that is splitting up boys and girls into two buckets. Within the “boys” and “girls” bucket is a lower-level aggregation that averages the grade for the bucket itself.

{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gt": 16
      }
    }
  },
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "filters": {
        "filters": {
          "boys": {
            "match": { "gender": "male" }
          },
          "girls": {
            "match": { "gender": "female" }
          }
        }
      },
      "aggregations": {
        "avg_grade": {
          "avg" : {
            "field" : "grade" 
          }
        } 
      }
    }
  }
}

The response looks something like this:

{
  ...,
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "buckets" : {
        "boys" : {
          "doc_count" : 146,
          "avg_grade" : {
            "value" : 83
          }
        },
        "girls" : {
          "doc_count" : 181,
          "avg_grade" : {
            "value" : 89
          }
        }
      }
    }
  }
}

It’s impossible to cover everything that Elasticsearch can do, but I hope that this gives you a glimpse into its utility. The greatest aspect of Elasticsearch is that for most people, you can just load it up and start using it right out of the box. It already determines how to distribute shards across nodes and will index documents dynamically, even if you don’t know anything about mappings. But once you dig into the search query DSL, you uncover the power of conducting complex searches in seconds and running aggregations to analyze data over millions of documents at once.

Advertisements

Indexing with the Elasticsearch Bulk API

If you’ve ever worked with the elasticsearch-rails gem, then you’ve likely used the import method to index your local Elasticsearch indexes. This is great for developing because it enables you to quickly index documents and work on the important stuff, like modifying search queries to your desired specification.

However, this technique doesn’t hold its water when you try to index a large number of documents on a production or staging environment. You should always work with Elasticsearch on a staging environment first to see how if your search queries behave as expect with several thousand documents before pushing to production. Similarly, you want to have a plan for indexing several million documents quickly.

The best strategy is to use Elasticsearch’s built in Bulk API, which enables you to index tens of thousands of documents in one request! I elected to use a Sidekiq worker to handle the creation of a bulk request with 1000 documents. This way, I can iterate over an entire table and batch my indexing into 1000 document-sized requests.

class ElasticsearchBulkIndexWorker
  include Sidekiq::Worker
  sidekiq_options retry: 5, queue: 'low'

  def client
    @client = Elasticsearch::Client.new host: CONFIG['ELASTICSEARCH_URL']
  end

  def perform(model, starting_index)
    klass = model.capitalize.constantize
    batch_for_bulk = []
    klass.where(id: starting_index..(starting_index+999)).each do |record|
      batch_for_bulk.push({ index: { _id: record.id, data: record.as_indexed_json } }) unless record.try(:archived)
    end
    klass.__elasticsearch__.client.bulk(
      index: "#{model.pluralize}_v1",
      type: model,
      body: batch_for_bulk
    )
  end
end

In order to iterate over an entire table, I simply created a rake task to handle counting and feeding in numbers to the worker. Notice that the ElasticsearchBulkIndexWorker takes a starting_index argument to handle batching. Also, it takes a string, which is the name of the model used to find the appropriate records and match them with the similarly-named index.

namespace :elasticsearch do
  task build_article_index: :environment do
    (1..Article.last.id).step(1000).each do |starting_index|
      ElasticsearchBulkIndexWorker.perform_async('article', starting_index)
    end
  end

  task build_comment_index: :environment do
    (1..Comment.last.id).step(1000).each do |starting_index|
      ElasticsearchBulkIndexWorker.perform_async('comment', starting_index)
    end
  end
end

This should help you index a high number of documents quickly. For more information, check out the Elasticsearch Bulk API documentation to figure out how to improve write consistency during your bulk indexing. Good luck!