Elasticsearch: A Brief Introduction

At my previous company, we relied on PostgreSQL’s built in text search to search through data across multiple tables. While Postgres Search was adequate for some time, it was clear that our growing customer base needed a faster and more reliable solution. So I decided to use Elasticsearch, a powerful search engine based on Apache Lucene. It has been a journey learning how to effectively use Elasticsearch’s query DSL and I wanted to share some thoughts for those of you who are interested in learning more.

If you are really interested in becoming a pro, I recommend reading through Elasticsearch: The Definitive Guide. It’s digestible and I read it front to back before starting on my project in order to have a grand scope of the possibilities and best practices of this tool.


Elasticsearch is a JSON document-based, schema-free database. This means that instead of your standard relational database with each record being a row in a table, a record in an Elasticsearch cluster is simply a JSON document. The attributes for each record exist in the document itself, like below. This system allows us to index documents in a distributed manner because no two like documents are restricted to the same table – more on that later. The sophisticated built-in indexing and querying tools also allow us to retrieve documents quickly, sifting through millions of records almost instantly. But that’s just the beginning; Elasticsearch has many other features that make it easy to use and reliable.

{
    "_index": "users",
    "_type": "user",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
      "id": 1,
      "name": "Tim",
  }
}

Database Architecture

Elasticsearch has three major components that help organize data in a safe and easily accessible manner. At the highest level is a cluster, which you can think of as the database itself. Within a cluster is a collection of nodes, or servers. Nodes talk to each other to manage data, but only the Master Node coordinates how data are stored and retrieved. Each node can contain any number of shards, which – for the time being – you can think of as an index itself.

In the figure below (source), you have three primary shards and three replica shards in green and grey respectively. The primary shard, P1, might contain my /users index, while P2 and P3 are my /articles and /comments indices. Because of the distributed nature of the shards, only two indices are competing for each node’s (server’s) computing resources. In addition, replica shards, designated with an R, will index all new documents as well and can handle search requests simultaneously. This makes indexing and retrieval much faster overall.

elasticsearch-cluster

But there’s another added benefit – replica shards never exist on the same node as its primary complement. This means that in the event that a node fails, another node’s replica will be automatically promoted to the primary and take over incoming requests. There’s no downtime and you can happily work on rescuing your failed node while the cluster continues to handle requests with ease. Compare that to constantly backing up a SQL database!

One Index, Multiple Shards

Let’s dig into the concept of indices and shards a little more. Before, I said documents are indexed in a distributed manner. What I meant is that documents are not restricted to a single table. This can be easily demonstrated with primary and replica shards. Every primary and replica shard indexes all documents equally so if the primary fails, the replica is promoted to the new primary. We can extend this concept to a single index with multiple shards. In the figure above, imagine we have three shards that serve ONE index, /users.

In that situation, all documents for the /users index get stored in a distributed manner. One third of the documents get indexed in a single primary/replica shard pair. Now this is very different from a relational database where ALL documents related to each other are store on a table. But the advantage is that each shard can perform a search on the index’s documents at the same time.

For example, let’s say that I want to search for contact’s whose first name is Tim. The GET query goes to the Elasticsearch cluster and the Master Node sends it to the /users index. It knows that the /users index exists within shards P1, P2, P3. So, then each shard searches through their documents and returns all relevant hits. Those hits get aggregated and the final search results is returned to the requesting client. Compare that to a normal table search in a relational database where each document in a table must be searched. In Elasticsearch, each shard is only responsible for one third of the documents, which makes searching much faster.


Search Made Easy

The essential quality of a good search engine is accuracy. This is easy to do with simple queries, but what happens when your user clicks on that dreaded button called “Advanced Search”? How do you find a user on a dating site who is male, from Colorado, at least 30 years old, likes oranges, but not apples, and is available on weekends from 10 AM to 4 PM?  Luckily, Elasticsearch’s query DSL is incredibly powerful and simple to use. Just compare the two queries below, written in SQL and Elasticsearch.

SELECT *
FROM users
WHERE first_name = 'Tim'
OR last_name = 'Tim'
GET  users/_search
{
  "query": {
    "multi_match": {
      "name": "Tim",
      "field": ["first_name^2", "last_name"]
    }
  }
}      

We’re matching multiple fields, first_name and last_name, and we want the field to contain the word “Tim”. The SQL query is quite clear, but the Elasticsearch query gives us a slightly more transparent language, which is crucial with more complex searches. We also get a glimpse into the power of Elasticsearch. What if we think people with “Tim” as a first_name are twice as relevant as people with “Tim” as a last name? With a simple caret, we can say that the first_name field is 2 times more important than last_name. Try doing that with SQL!

In order to make those complex searches easier to digest, you can customize how certain attributes are indexed and analyzed. This increases the efficiency and speed of filtering documents. Within an index, your configurations are called mappings and they can be manipulated to yield powerful results.

For example, Elasticsearch has a built-in “english” analyzer that helps chop long strings into digestible chunks. One of the things it does is stem words (“dogs” becomes “dog”) and take out pre-determined stop words, like “not”. It also has a “standard” analyzer that simply separates words at whitespaces. To simplify an Elasticsearch example, the english analyzer would convert the following sentence into discrete tokens:

I'm not happy about the foxes

Tokens: i'm, happi, about, fox

This is great because it reduces the amount of data we need to store for each string. But if you’re looking for an exact match on “not happy”, then we have a problem. Strings where people are “happy” and “not happy” will be equally scored in the search results. One elegant solution would be to use multifields in Elasticsearch to store the field twice, with both analyzers.

Below, we’re storing (PUTing) a mapping into “my_index”. We’re telling Elasticsearch that a story object has an attribute called title, which is a string that should be analyzed with the english analyzer. Within the title field, we can create a “fields” parameter which also stores the same string with the (default) standard analyzer. This extra sub-field can be accessed with a simple dot, “title.standard”.

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": { 
          "type":      "string",
          "analyzer":  "english",
          -------------
          "fields": {
            "standard": { 
              "type":  "string"
            }
          }
          -------------
        }
      }
    }
  }
}

Now we can simply do a multi_match query that searches both the english analyzed and standard analyzed fields.

GET /my_index/_search
{
  "query": {
    "multi_match": {
      "query":    "not happy foxes",
      "fields": [ "title", "title.standard" ]
    }
  }
}

To drive the point home, let’s think of another solution. We can tweak the mapping in the index to have the sub-field be called “original”. This time, we will make sure it’s “not_analyzed” meaning that the original string is preserved as is – “I’m not happy about the foxes”.

-------------
"fields": {
  "original": { 
    "type":  "string",
    "index"  "not_analyzed"
  }
}
-------------

Now, in general we can search for stories with “happy fox” in the title. But if we want an advanced search to get an exact match, we can simply search in the “title.original” field and forget about the english analyzed field.

GET /my_index/_search
{
  "query": {
    "multi_match": {
      "query":    "I'm not happy about the foxes",
      "fields":   [ "title.original" ]
    }
  }
}

NOTE: For searching one field, we wouldn’t have to use the “multi_match” query, but I kept it in to make things easy to follow.


Aggregations

Aggregations build upon the search query DSL so you can use your data for all sorts of analyses. With two simple concepts, metrics and buckets, you can perform all sorts of complex analyses with already indexed data. A metric is what it sounds like, a metric measure over a set of documents such as a count or average. A bucket is an aggregation of documents that are filtered by some distinct field. The power of aggregations comes from being able to nest buckets! Since the filtering mechanism works the same way as a search query, you can get analyze tens of millions of documents very quickly. Let’s take a quick look.

Perhaps we have the test scores of every student in the state of Colorado. We can easily perform an aggregation that will tell us the average grade ( 0 – 100) for all students. We query Elasticsearch with the “aggregations” keyword, name our query “avg_grade_all_users”, and then simply ask for the average for the “grade” field.

{
  "aggregations" : {
    "avg_grade_all_users" : {
      "avg" : {
        "field" : "grade" 
      } 
    }
  }
}

Simple, right? In tandem with the search query DSL, you can even start refining your search. Let’s say we want the average grade of all student over 16 years old. We simply prepend our aggregation with a “query” that asks for all documents where the student’s age range is greater than (gt) 16. That’s it! The separation of the search query filter and the aggregation makes it easy to see what exactly is happening. First, we only look for records for students over 16 years old, then we take the average of their grades.

{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gt": 16
      }
    }
  },
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "avg" : {
        "field" : "grade" 
      } 
    }
  }
}

Let’s take it one step further by separating test scores for students over 16 by gender. We take the pre-existing aggregations query and add a “filters” argument to tell Elasticsearch to split our parent bucket into two child buckets. While our query looks a bit longer, it’s not hard to follow what’s happening. We have a high-level aggregation that is splitting up boys and girls into two buckets. Within the “boys” and “girls” bucket is a lower-level aggregation that averages the grade for the bucket itself.

{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gt": 16
      }
    }
  },
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "filters": {
        "filters": {
          "boys": {
            "match": { "gender": "male" }
          },
          "girls": {
            "match": { "gender": "female" }
          }
        }
      },
      "aggregations": {
        "avg_grade": {
          "avg" : {
            "field" : "grade" 
          }
        } 
      }
    }
  }
}

The response looks something like this:

{
  ...,
  "aggregations" : {
    "avg_grade_over_sixteen" : {
      "buckets" : {
        "boys" : {
          "doc_count" : 146,
          "avg_grade" : {
            "value" : 83
          }
        },
        "girls" : {
          "doc_count" : 181,
          "avg_grade" : {
            "value" : 89
          }
        }
      }
    }
  }
}

It’s impossible to cover everything that Elasticsearch can do, but I hope that this gives you a glimpse into its utility. The greatest aspect of Elasticsearch is that for most people, you can just load it up and start using it right out of the box. It already determines how to distribute shards across nodes and will index documents dynamically, even if you don’t know anything about mappings. But once you dig into the search query DSL, you uncover the power of conducting complex searches in seconds and running aggregations to analyze data over millions of documents at once.

Advertisements

Testing RSpec model validations

I recently came across an issue while trying to write RSpec tests for model validations that I think are worth addressing. I think it’s worth testing validations because they may change over time and you want to make sure your test suite is accounting for them. For example, I have character limits for messages, which is a critical model validation. If I were to accidentally delete the validation or change the character limit, there would be a lot of unexpected consequences.

The first validation I wanted to test was a character limit on messages that was only triggered if the message was sent by a user. To do so, I created a from_user method that checks to see if the message.sender_type == 'user'. The model looks like this:

class Message < ActiveRecord::Base
  validates_presence_of :body, :status, :contact
  validates :body, length: { in: 1..420 }, if: :from_user?, on: :create

  def from_user?
    self.sender_type == 'user'
  end
end

At first, it seemed like I could easily use thoughtbot’s shoulda gem to write a simple test.

describe 'character limit for user-sent messages' do
  it {should_validate_length_of(:body).is_at_least(1).is_at_most(420).on(:create) }
end

However, the validation kept failing. After a little bit of digging, I realized it was because I never specified a subject. In particular, I never specified that my subject was a user-sent message. I found that there are two solutions to this problem.

The first is to specify a subject with a FactoryGirl.build. It’s important that you use ‘build’ and not ‘create’, because the ‘create’ method will trigger all model validations, which is exactly what you’re trying to test. In addition, when you ‘build’ the model, make sure you leave out the attribute that you’re testing. In my case, I’m trying to test a body validation, so it wouldn’t make sense to do a FactoryGirl.build where I specify the message body.

The second solution is to write a ‘before’ hook that lets the test assume that the subject is from a user, which will trigger the validation. Either method works well, but I opted to use the second option to practice a new technique.

Option 1

describe 'character limit for user-sent messages' do
  subject { FactoryGirl.build :message, contact: contact, status: 'sent' }
  it {should_validate_length_of(:body).is_at_least(1).is_at_most(420).on(:create) }
end

Option 2

describe 'character limit for user-sent messages' do
  before { allow(subject).to receive(:from_user?).and_return(true) }
  it {should_validate_length_of(:body).is_at_least(1).is_at_most(420).on(:create) }
end

Thoughts as a Developer Advocate

It has been almost a year since I graduated from DaVinci Coders, where I learned about Ruby and Rails. Since then, I have been a TA, worked as a consultant, and ultimately landed a job at TextUs. It hasn’t been easy, but it has been fun, exciting, and rewarding. Part of the reason I decided to pursue this career path was because of my brother, Nate, who encouraged me to take advantage of my technical savvy. Here are his thoughts on graduating from DevBootcamp in San Francisco, and how going to a bootcamp doesn’t necessarily mean you have to become a developer.

Nathan M Park – A Letter to bootcamp grads, 6 months as a Developer Advocate

Keeping track of Rails transactions

Ever heard of transaction_include_any_action?([:action])? You may never need to use this specialized part of ActiveRecord, but I thought I would shed some light on it anyway.

Recently, I was writing a module that helped incorporate several models into Elasticsearch. Concerns::Searchable, as I named it, helps create mappings for models, provides a base method for specialized searches, and also has an after_commit hook to index records. I first looked into ActiveRecord::Transactions to customize when a record is indexed into Elasticsearch.

module Concerns
  module Searchable
    extend ActiveSupport::Concern

    included do
      include Elasticsearch::Model
      after_commit :update_elasticsearch_index

      ...

      def update_elasticsearch_index
        if transaction_include_any_action?([:destroy]) || (transaction_include_any_action?([:update]) && self.archived?)
          ElasticsearchIndexWorker.perform_async(:delete, self.id)
        else
          ElasticsearchIndexWorker.perform_async(:index, self.id)
        end
      end
    end
  end
end

What I wanted was for Elasticsearch to delete records that were deleted in Postgres OR “archived”. That way, when a user goes to search for a record, they won’t accidentally get results from “archived” records that are being kept in Postgres for other purposes. Using ActiveRecord::Transactions helped me customize the logic for indexing after_commit.


NOTE: One small hiccup you may encounter is that the argument passed to the method has to be an array.

This is okay:

transaction_include_any_action?([:destroy, :update])

This is not:

transaction_include_any_action?(:destroy)

Adding foreign key constraints in Rails

Here’s a great article about adding foreign key constraints to handle cascading deletion of records. It’s a good idea to handle foreign key constraints through the database because Rails doesn’t always fire an object’s callbacks upon deletion. So if you have a dependent: :destroy callback set up, it might get skipped in some situations.

https://robots.thoughtbot.com/referential-integrity-with-foreign-keys

Indexing with the Elasticsearch Bulk API

If you’ve ever worked with the elasticsearch-rails gem, then you’ve likely used the import method to index your local Elasticsearch indexes. This is great for developing because it enables you to quickly index documents and work on the important stuff, like modifying search queries to your desired specification.

However, this technique doesn’t hold its water when you try to index a large number of documents on a production or staging environment. You should always work with Elasticsearch on a staging environment first to see how if your search queries behave as expect with several thousand documents before pushing to production. Similarly, you want to have a plan for indexing several million documents quickly.

The best strategy is to use Elasticsearch’s built in Bulk API, which enables you to index tens of thousands of documents in one request! I elected to use a Sidekiq worker to handle the creation of a bulk request with 1000 documents. This way, I can iterate over an entire table and batch my indexing into 1000 document-sized requests.

class ElasticsearchBulkIndexWorker
  include Sidekiq::Worker
  sidekiq_options retry: 5, queue: 'low'

  def client
    @client = Elasticsearch::Client.new host: CONFIG['ELASTICSEARCH_URL']
  end

  def perform(model, starting_index)
    klass = model.capitalize.constantize
    batch_for_bulk = []
    klass.where(id: starting_index..(starting_index+999)).each do |record|
      batch_for_bulk.push({ index: { _id: record.id, data: record.as_indexed_json } }) unless record.try(:archived)
    end
    klass.__elasticsearch__.client.bulk(
      index: "#{model.pluralize}_v1",
      type: model,
      body: batch_for_bulk
    )
  end
end

In order to iterate over an entire table, I simply created a rake task to handle counting and feeding in numbers to the worker. Notice that the ElasticsearchBulkIndexWorker takes a starting_index argument to handle batching. Also, it takes a string, which is the name of the model used to find the appropriate records and match them with the similarly-named index.

namespace :elasticsearch do
  task build_article_index: :environment do
    (1..Article.last.id).step(1000).each do |starting_index|
      ElasticsearchBulkIndexWorker.perform_async('article', starting_index)
    end
  end

  task build_comment_index: :environment do
    (1..Comment.last.id).step(1000).each do |starting_index|
      ElasticsearchBulkIndexWorker.perform_async('comment', starting_index)
    end
  end
end

This should help you index a high number of documents quickly. For more information, check out the Elasticsearch Bulk API documentation to figure out how to improve write consistency during your bulk indexing. Good luck!

GlueCon 2016

The past two days, I have been attending Glue Conference in Broomfield, CO. As they say, “Glue is a case study of what a conference is supposed to be.” Being a new software developer, it was the first tech conference I’ve ever attended, so I don’t have a gauge on whether or not it was better or worse than the average. That being said, I can say that I really enjoyed myself and learned A TON. Here are some of my reflections on the past 48 hours.


Pros
  1. You learn a LOT about emerging technologies and best practices for design and development. At my work, we have an API that our customers can interact with to integrate with their own software. We use Swagger to document our API, which works pretty well for our needs. One of the seminar tracks offered at GlueCon was all about APIs: design, implementation, maintenance, and deployment. What I took for granted at work was suddenly demystified; suddenly I could understand WHY our API was designed the way it is. Hearing people talk about big ideas helps you go beyond the HOW of developing.
  2. You get to have fun! I came into the conference with a very serious attitude. I told myself, I would be professional, network, and learn as much as I could. But everyone else seemed to have a different agenda. Obviously, people are there to learn and network, but they are also there to catch up with friends, make new ones, and just talk about what they’re interested in. I had a great time meeting tons of people, exchanging information, and just being able to share my excitement about being an up and coming developer. As an added bonus, you get a ton of free swag at these events – STICKERS GALORE!
  3. The keynote speakers are excellent. While some of the breakaway optional lectures were not the most exciting, all of the keynote speakers were quite good. While I wasn’t always able to follow them exactly (like the one about data encryption and authorization), I did get a sense of the scope of their ideas and the issues they were trying to address. More importantly, I learned how I fit in this grand ecosystem of development. It’s exciting when you begin to understand the role you play as a developer.
  4. You get to ask questions. Go to the booths and talk to people. Usually they want you to know all about their product, and if you don’t know what they’re talking about, they are willing to break it down and explain it to you. Before yesterday, I really had no idea what proxies were, but a gentleman at linkerd was willing to help me understand how their product fit in side-by-side with any existing app. Because there are so many people with so many specialties, you have the opportunity to be brave, ask questions, and learn so much from experts.

Cons
  1. You will feel like you know nothing. As a new developer, I was overwhelmed by the many facets of development that I still don’t understand. Most of the attendees were much older than me and had be in the tech industry for multiple decades. They would make jokes about the old days when they had to manage fires with their deployment and all I really know is how to use heroku to deploy my apps. That being said, it motivate me to work harder and try to be a more complete developer. I don’t necessarily need to be an expert in everything, but understanding my place will make me a stronger coder in the long run.
  2. Some people are not friendly. Just a couple of times, I mentioned that I was a Ruby on Rails developer and the person I was talking to straight up scoffed at me. Another time, someone tried to make a joke about how they felt sorry for my struggles. Look, I understand that maybe you don’t like Ruby or you might have opinions on my language of choice, but that doesn’t mean you have to be dismissive. I found the best way to approach this behavior was to just smile, carry on, and try to leave as soon and politely as possible. I’d rather spend my time talking to people who are willing to help me grow as a coder and not just patronize my career path.
  3. Some presentations are not good. There’s no getting around attending a few sub-par presentations. Not everyone can be a rock-star public speaker/developer/comedian/motivational speaker/coder/graphic designer. There was one talk I attended that was difficult to follow, and I tried my best to stay engaged and learn something. But I learned nothing because the speaker was too quiet and the content wasn’t presented well. Since conferences are very long and exhausting, my advice would be to leave the room, take a walk, get some fresh air, and recharge for the next few hours.

Advice
  1. Go register for a conference! They are AWESOME.
  2. Look for scholarships online or ask your company for a subsidy. Maybe your boss is super nice and will support you if you’re going for a specific reason.
  3. Email people you want to meet before hand. Personally, I got a lot out of GlueCon specifically because I was able to meet with Lee Hinman from Elasticsearch, who gave me some great advice on implementing Elasticsearch into our Rails app at work.
  4. Go to the keynotes addresses. You may not understand everything the speakers are talking about, but you’ll definitely learn something.
  5. Take breaks. Go to the vendors and get some swag or go for a walk. It’s hard to sit through 6 straight hours of lectures.

Personally, I think GlueCon is amazing and I am definitely going to return next year. Hope to see you there!


2016-05-26 13.40.51

People start to trickle in and the exhibitors get fired up

2016-05-26 13.41.55

Joe Beda talks about authorization in the modern world