Setting up a Titan Cluster on Cassandra and ElasticSearch on AWS EC2

Guest Post by Jenny Kim

This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I’ve learned along the way. This walkthrough will utilize the following versions of each software package:

Versions

The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.

NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.

Part 1: Setup a Cassandra cluster

I followed Titan’s EC2 instructions for standing up Titan on a Cassandra cluster using the Datastax Auto-Clustering AMI:

Step 1: Setting up Security Group

Navigate to the EC2 Console Dashboard, then click on Security Groups under Network & Security.

Create a new security group. Click Inbound. Set the “Create a new rule” dropdown menu to “Custom TCP rule”.

Add a rule for port 22 from source 0.0.0.0/0.

Add a rule for ports 1024-65535 from the security group members. If you don’t want to open all unprivileged ports among security group members, then at least open 7000, 7199, and 9160 among security group members.

Tip: the “Source” dropdown will autocomplete security group identifiers once “sg” is typed in the box, so you needn’t have the exact value ready beforehand.

Step 2: Launch DataStax Cassandra AMI

Launch the DataStax AMI in your desired zone

On the Instance Details page of the Request Instances Wizard, set “Number of Instances” to your desired number of Cassandra nodes (i.e. – 2)

Set “Instance Type” to at least m1.large.

On the Advanced Instance Options page of the Request Instances Wizard, set the “as text” radio button under “User Data”, then fill this into the text box.

--clustername [cassandra-cluster-name]
--totalnodes [number-of-instances]
--version community 
--opscenter no

[number-of-instances] in this configuration must match the number of EC2 instances configured on the previous wizard page (i.e. – 2). [cassandra-cluster-name] can be any string used for identification. For example:

--clustername titan-staging
--totalnodes 2
--version community 
--opscenter no

On the Tags page of the Request Instances Wizard you can apply any desired configurations. These tags exist only at the EC2 administrative level and have no effect on the Cassandra daemons’ configuration or operation.

  • It is useful here to set a tag for ElasticSearch to discover this node when identifying its cluster nodes. We will revisit this tag in the ElasticSearch section.

On the Create Key Pair page of the Request Instances Wizard, either select an existing key pair or create a new one. The PEM file containing the private half of the selected key pair will be required to connect to these instances.

On the Configure Firewall page of the Request Instances Wizard, select the security group created earlier.

Review and launch instances on the final wizard page. The AMI will take a few minutes to load.

Step 3: Verify Successful Instance Launch

SSH into any Cassandra instance node:

ssh -i [your-private-key].pem ubuntu@[public-dns-name-of-any-cassandra-instance]

Run the Cassandra nodetool nodetool -h 127.0.0.1 ring to inspect the state of the Cassandra token ring.

  • You should see as many nodes in this command’s output as instances launched in the previous steps. Status should say UP for all rows.
  • Note, that the AMI takes a few minutes to configure each instance. A shell prompt will appear upon successful configuration when you SSHinto the instance.

If upon shelling in, Cassandra still appears to be loading, Ctrl-C to quit and restart Cassandra with:

sudo service cassandra restart

Part 2: Install Titan

Titan can be embedded within each Cassandra node-instance, or installed remotely from the cluster. I installed Titan on each Cassandra instance, but do not run in embedded mode.

Step 1: Download Titan

SSH into a Cassandra instance node and within the ubuntu home directory, download the Titan 0.4.1 server distribution ZIP:

wget http://s3.thinkaurelius.com/downloads/titan/titan-server-0.4.1.zip

Unzip the Titan directory and move to /opt/:

unzip titan-server-0.4.1.zip
sudo mv titan-server-0.4.1 /opt/

cd to /opt/ and create a symlink from /opt/titan to /opt/titan-server-0.4.1

cd /opt/titan
sudo ln -s titan-server-0.4.1 titan

Step 2: Configure Titan

We need to create a specific Titan configuration file that can be used when we run the Gremlin shell. This configuration will include our storage settings, cache settings, and search index settings.

Create a new properties file (i.e. – mygraph.properties) within the /opt/titan/conf folder:

vim /opt/titan/conf/mygraph.properties

The storage settings should specify the backend as Cassandra and include the Private IP to one of the Cassandra nodes. However, additional Cassandra configurations are listed here: https://github.com/thinkaurelius/titan/wiki/Using-Cassandra#cassandra-specific-configuration

storage.backend=cassandra
storage.hostname=172.12.191.2

The database cache settings should be enabled in a Production environment. Full documentation is found here: https://github.com/thinkaurelius/titan/wiki/Database-Cache. For our purposes, we will just enable the db-cache, set the clean time (milliseconds to wait to clean cache), cache-time (max milliseconds to hold items in cache), and cache-size (percentage of total heap space available to the JVM that Titan runs in).

cache.db-cache = true
cache.db-cache-clean-wait = 50
cache.db-cache-time = 10000
cache.db-cache-size = 0.25

The search index settings should specify “elasticsearch” as the external search backend, and configure it for the remote ElasticSearch cluster.

  • First you must set up an ElasticSearch cluster, which we have done using the same nodes as the Cassandra cluster.
  • Refer to the Deploying an ElasticSearch Cluster post for instructions on how to do this.
  • Make note of the cluster name and host IPs for all nodes in the ES cluster.

Based on the above ES cluster settings, add the following to you properties file, replacing hostnames and cluster-name with your specific settings:

storage.index.search.backend=elasticsearch
storage.index.search.hostname=;
storage.index.search.cluster-name=;
storage.index.search.index-name=titan
storage.index.search.client-only=true
storage.index.search.sniff=false
storage.index.search.local-mode=false

Save the file, and test in Gremlin:

bin/gremlin.sh
gremlin> g = TitanFactory.open('conf/mygraph.properties')

You should see Gremlin connect to the Cassandra cluster and return a blank Gremlin prompt. Success! Keep Gremlin open for the next Step.

Step 3: Run Indices

Now before we add any data to our graph, we need to do a one-time setup of any Titan and ElasticSearch property and label indices. This must be done with caution because in Titan, indexes cannot be modified, dropped, or added on existing properties (Titan Limitations).

Note that we created a script for our indices and tracked them in Github to quickly adapt our indices when updating and reloading a new Graph. Also keep in mind that Titan 0.4.1 has a new index syntax that is different and not backwards compatible with the old Titan 0.3.2 syntax.

In the Gremlin shell, copy and paste the indices script. If all indices run successfully, commit, shutdown and exit:

gremlin> g.commit()
gremlin> g.shutdown()
gremlin> exit

(Optional) Part 3: Load GraphSON

If you are doing a bulk load of GraphSON into Titan, you can do so via Faunus or Gremlin. The GraphSON format for each method is unique, so you will need to ensure that your GraphSON format adheres to the expected rules. This walkthrough will focus on the Gremlin GraphSON load.

Save the graphSON file (i.e. – gremlin.graphson.json) to the root of the Titan directory.

Edit the bin/gremlin.sh script file to increase the JVM heap size max to 4GB (ensuring enough memory on the machine), to avoid “GC overhead exceeded” errors.

(bin/gremlin.sh)
(Line 25)  JAVA_OPTIONS="-Xms32m -Xmx4096m"

Create a new groovy script to load the GraphSON and auto-commit:

(loader.groovy)
g = TitanFactory.open('conf/mygraph.properties')
g.loadGraphSON('gremlin.graphson.json')
g.commit()

Run the Groovy script through Gremlin

bin/gremlin.sh -e loader.groovy

This will take a while…a 500 MB graphSON file generally takes about 1.5 hours to finish loading (assuming no errors).

Part 4: Configure Rexster

Titan 0.4.1 Server now ships with Rexster Server, so you can run a Titan-configured Rexster server by running the rexster.sh script from Titan’s bin directory.

Step 1: Rexster Configuration

To run Rexster, you will need to create a Rexster configuration XML which Rexster by default expects to be under $TITAN_HOME/rexhome/config/rexster.xml (i.e. – /opt/titan/rexhome/config/rexster.xml).

Under /opt/titan/rexhome create a config directory

 mkdir /opt/titan/rexhome/config

Create a rexster.xml document under the config directory. Alternately, copy the /opt/titan/conf/rexster-cassandra-es.xml file into /opt/titan/rexhome/config/rexster.xml

The Rexster configuration needs a few changes to properly connect to the Cassandra cluster and ElasticSearch cluster. Here is an example configuration: https://gist.github.com/spaztic1215/7e4303b75184098e64fc

Update the base-uri property on Line 5 to the current instance’s DNS:

http://ec2-54-193-46-179.us-west-1.compute.amazonaws.com

Update the graph properties to connect to Cassandra and ES:

    
        graph
        com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration
        false

            cassandra
            secondaryOne
            100
            titan
            true
            0.3
            elasticsearch
            false
            secondaryOne,secondaryTwo
            fife
            titan
            false
            false

                tp:gremlin

Step 2: Test Rexster

Now we should test Rexster Server to make sure it is connecting properly to Titan and ES:

Change directory to $TITAN_HOME (i..e – /opt/titan) and start Rexster manually

bin/rexster.sh -s

You should see a bunch of Rexster console messages that indicate that it has connected to the Cassandra cluster and ES cluster. You can verify that Rexster is running in a browser by going to:

http://:8182

Verify that the Titan graph is found by Rexster:

http://:8182/graphs/graph/

Step 3: Create a Rexster Upstart Script

Once you’ve confirmed that Rexster can successfully start, we will create an upstart script to manage the Rexster start/stop process.

Under /etc/init, create a configuration called rexster.conf

A sample rexster upstart configuration can be found here: https://gist.github.com/spaztic1215/5bfc2ee2d370b933c8ca

Note that the above configuration assumes that you are using the Datastax AMI which includes a raid0 directory, and thus logs to the /raid0/log/rexster directory. You must create this directory before starting the script:

mkdir /raid0/log/rexster

Save file, and start rexster with upstart:

sudo start rexster

Check the log to ensure successful startup:

cd /raid0/log/rexster
tail rexster.log

Success!

You now have a fully configured Cassandra, ElasticSearch, Titan, Rexster setup on a single node. Once you’ve applied this configuration to all your nodes, you can start Rexster Server on the entire cluster, set up an ELB that contains your Rexster instances binding port 80 to 8182, and start accepting Rexster requests from the ELB’s domain.


Jenny Kim

Jenny Kim is a senior software engineer at Cobrain, where she works with the data science team. Jenny graduated from the Uuniversity of Maryland with a B.S. in Computer Science and a B.A. in American Studies. She acquired her Masters in Information Systems Technology from The George Washington University in December 2013. In her free time, Jenny enjoys volunteering at local film festivals, obsessive vacuuming, and relaxing with the family Shih Tzu.

The following two tabs change content below.
This entry was posted in GuestPost, Tutorials. Bookmark the permalink.

3 Pingbacks/Trackbacks