Instructions for deploying an Elasticsearch Cluster with Titan

ElasticSearch

Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

Step 1: Installation

NOTE: These instructions assume you’ve installed Java6 or later.

By far, the best installation mechanism to install eleasticsearch on an Ubuntu EC2 instance is the Debian package that is provided as a download. This package installs an init.d script and places the configuration files in /etc/elasticsearch and generally creates goodness that we don’t have to deal with. You can find the .deb on the elastic search download page.

$ cd /tmp
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.deb.sha1.txt
$ sha1sum elasticsearch-0.90.7.deb && cat elasticsearch-0.90.7.deb.sha1.txt 

Note that you may have to use the flag --no-check-certificate or you could use curl, just ensure that you use the correct filenames. Also ensure that the checksums match (and be even more paranoid and check the elasticsearch website). Installation is simple:

$ sudo dpkg -i elasticsearch-0.90.7.deb

Elasticsearch will now be running on your machine with the default configuration. To check this you can do the following:

$ sudo status elasticsearch
$ ps -ef | grep elasticsearch

But while we configure it, it doesn’t really need to be running:

$ sudo service elasticsearch stop

In particular this does the following things you should be aware of:

  1. Creates the elasticsearch:elasticsearch user and group
  2. Installs the library into /usr/share/elasticsearch
  3. Creates the logging directory at /var/log/elasticsearch
  4. Creates the configuration directory at /etc/elasticsearch
  5. Creates a data directory at /var/lib/elasticsearch
  6. Creates a temp work directory at /tmp/elasticsearch
  7. Creates an upstart script at /etc/init.d/elasticsearch
  8. Creates an upstart configuration at /etc/default/elasticsearch

Because of our particular Titan deployment, this is not good enough for what we’re trying to accomplish, so the next step is configuration.

Step 2: Configuration

The configuration we’re looking for is an auto-discovered EC2 elastic cluster that is bound to the default ports, and works with data on the attached volume rather than on the much small root disk. In order to autodiscover on EC2 we have to install an AWS plugin, which can be found on the cloud aws plugin Github page:

$ cd /usr/share/elasticsearch
$ bin/plugin -install elasticsearch/elasticsearch-cloud-aws/1.15.0

Elasticsearch is configured via a YAML file in /etc/elasticsearch/elasticsearch.yml so open up your editor, and use the configurations as we added them below:

path:
    conf: /etc/elasticsearch
    data: /raid0/elasticsearch
    work: /raid0/tmp/elasticsearch
    logs: /var/log/elasticsearch
cluster:
    name: DC2
cloud:
    aws:
        access_key: ${AWS_ACCESS_KEY_ID}
        secret_key: ${AWS_SECRET_ACCESS_KEY}
discovery:
    type: ec2

For us, the other defaults worked just fine. So let’s go through this a bit. First off, for all the paths, make sure that they exist, you’ve created them, and that they have the correct permissions. The raid0 folder is where we have mounted an EBS volume that contains enough non-ephemeral storage for our data services. Although this does add some network overhead, it prevents data loss when the instance is terminated. However, if you’re not working with EBS or you’ve mounted in a different location, using the root directory defaults is probably fine.

$ sudo mkdir /raid0/elasticsearch
$ sudo chown elasticsearch:elasticsearch /raid0/elasticsearch
$ sudo chmod 775 elasticsearch
$ sudo mkdir -p /raid0/tmp/elasticsearch
$ sudo chmod 777 /raid0/tmp
$ sudo chown elasticsearch:elasticsearch /raid0/tmp/elasticsearch
$ sudo chmod 775 /raid0/tmp/elasticsearch

Editor’s Note: I just discovered that you can actually set these options with the dpkg command so that you don’t have to do it manually. See the elasticsearch as a service on linux guide for more.

The cluster name, in our case DC2, needs to be the same for every node on the cluster, this is also vital for EC2. The default, elasticsearch, could make the discovery more difficult. Also note that each node can be named separately, but by default the name is selected randomly on a list of 3000 or so Marvel characters. The cloud and discovery options allow discovery through EC2.

You should now be able to run the cluster:

$ sudo service elasticsearch start

Check the logs to make sure there are no errors, and that the cluster is running. If so, you should be able to navigate to the following URL:

http://localhost:9200/_cluster/health?pretty=true

By replacing localhost with the hostname, you can see the status of the cluster, as well as the number of nodes. But wait, why are there no more nodes being added? Don’t keep waiting! The reason is because Titan has probably already been configured to use local Elasticsearch, and is blocking port 9300, the communication and control port for the ES cluster.

Configuring Titan

Titan is blocking the cluster elasticsearch with its own local elasticsearch, and anyway, we want Titan to use the elasticsearch cluster! Let’s reconfigure Titan. First, open up your favorite editor and change the configuration of /opt/titan/config/yourgraph.properties to the following:

storage.backend=cassandra
storage.hostname=${LOCAL_IPADDR}

storage.index.search.backend=elasticsearch
storage.index.search.client-only=true
storage.index.search.hostname=${ES_ADDR},${ES_ADDR},${ES_ADDR}

Hopefully you don’t have to replace the storage.backend and storage.hostname configurations. Remove the storage.index.search.local-mode configuration as well as the storage.index.search.directory configuration, and add the configurations above as follows.

For storage.index.search.hostname, add a comma separated list of every node in the ES cluster (for now).

That’s it! Reload Titan, and you should soon see the cluster grow to include all the nodes you configured, as well as a speed up in queries to the Titan graph!

The following two tabs change content below.

Benjamin Bengfort

Chief Data Scientist at Cobrain
Benjamin is a data scientist with a passion for massive machine learning involving gigantic natural language corpora, and has been leveraging that passion to develop a keen understanding of recommendation algorithms at Cobrain in Bethesda, MD where he serves as the Chief Data Scientist. With a professional background in military and intelligence, and an academic background in economics and computer science, he brings a unique set of skills and insights to his work. Ben believes that data is a currency that can pave the way to discovering insights and solve complex problems. He is also currently pursuing a PhD in Computer Science at the University of Maryland.

Latest posts by Benjamin Bengfort (see all)

This entry was posted in Resources, Tutorials and tagged , , , . Bookmark the permalink.

One Pingback/Trackback