Guest Post by Jenny Kim
This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I’ve learned along the way. This walkthrough will utilize the following versions of each software package:
- Datastax Cassandra Auto-Clustering Community AMI Version 2.4
- Oracle Java 1.7 (should be automatically included in the Datastax AMI)
- Titan 0.4.1 Full Distribution
- ElasticSearch 0.90.7
The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.
NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.
Part 1: Setup a Cassandra cluster
I followed Titan’s EC2 instructions for standing up Titan on a Cassandra cluster using the Datastax Auto-Clustering AMI:
Step 1: Setting up Security Group
Navigate to the EC2 Console Dashboard, then click on Security Groups under Network & Security.
Create a new security group. Click Inbound. Set the “Create a new rule” dropdown menu to “Custom TCP rule”.
Add a rule for port 22 from source 0.0.0.0/0.
Add a rule for ports 1024-65535 from the security group members. If you don’t want to open all unprivileged ports among security group members, then at least open 7000, 7199, and 9160 among security group members.
Tip: the “Source” dropdown will autocomplete security group identifiers once “sg” is typed in the box, so you needn’t have the exact value ready beforehand.
Step 2: Launch DataStax Cassandra AMI
Launch the DataStax AMI in your desired zone
On the Instance Details page of the Request Instances Wizard, set “Number of Instances” to your desired number of Cassandra nodes (i.e. – 2)
Set “Instance Type” to at least m1.large.
On the Advanced Instance Options page of the Request Instances Wizard, set the “as text” radio button under “User Data”, then fill this into the text box.
--clustername [cassandra-cluster-name] --totalnodes [number-of-instances] --version community --opscenter no
[number-of-instances] in this configuration must match the number of EC2 instances configured on the previous wizard page (i.e. – 2). [cassandra-cluster-name] can be any string used for identification. For example:
--clustername titan-staging --totalnodes 2 --version community --opscenter no
On the Tags page of the Request Instances Wizard you can apply any desired configurations. These tags exist only at the EC2 administrative level and have no effect on the Cassandra daemons’ configuration or operation.
- It is useful here to set a tag for ElasticSearch to discover this node when identifying its cluster nodes. We will revisit this tag in the ElasticSearch section.
On the Create Key Pair page of the Request Instances Wizard, either select an existing key pair or create a new one. The PEM file containing the private half of the selected key pair will be required to connect to these instances.
On the Configure Firewall page of the Request Instances Wizard, select the security group created earlier.
Review and launch instances on the final wizard page. The AMI will take a few minutes to load.
Step 3: Verify Successful Instance Launch
SSH into any Cassandra instance node:
ssh -i [your-private-key].pem ubuntu@[public-dns-name-of-any-cassandra-instance]
Run the Cassandra
nodetool nodetool -h 127.0.0.1 ring to inspect the state of the Cassandra token ring.
- You should see as many nodes in this command’s output as instances launched in the previous steps. Status should say UP for all rows.
- Note, that the AMI takes a few minutes to configure each instance. A shell prompt will appear upon successful configuration when you SSHinto the instance.
If upon shelling in, Cassandra still appears to be loading, Ctrl-C to quit and restart Cassandra with:
sudo service cassandra restart
Part 2: Install Titan
Titan can be embedded within each Cassandra node-instance, or installed remotely from the cluster. I installed Titan on each Cassandra instance, but do not run in embedded mode.
Step 1: Download Titan
SSH into a Cassandra instance node and within the ubuntu home directory, download the Titan 0.4.1 server distribution ZIP:
Unzip the Titan directory and move to /opt/:
unzip titan-server-0.4.1.zip sudo mv titan-server-0.4.1 /opt/
cd to /opt/ and create a symlink from /opt/titan to /opt/titan-server-0.4.1
cd /opt/titan sudo ln -s titan-server-0.4.1 titan
Step 2: Configure Titan
We need to create a specific Titan configuration file that can be used when we run the Gremlin shell. This configuration will include our storage settings, cache settings, and search index settings.
Create a new properties file (i.e. – mygraph.properties) within the /opt/titan/conf folder:
The storage settings should specify the backend as Cassandra and include the Private IP to one of the Cassandra nodes. However, additional Cassandra configurations are listed here: https://github.com/thinkaurelius/titan/wiki/Using-Cassandra#cassandra-specific-configuration
The database cache settings should be enabled in a Production environment. Full documentation is found here: https://github.com/thinkaurelius/titan/wiki/Database-Cache. For our purposes, we will just enable the db-cache, set the clean time (milliseconds to wait to clean cache), cache-time (max milliseconds to hold items in cache), and cache-size (percentage of total heap space available to the JVM that Titan runs in).
cache.db-cache = true cache.db-cache-clean-wait = 50 cache.db-cache-time = 10000 cache.db-cache-size = 0.25
The search index settings should specify “elasticsearch” as the external search backend, and configure it for the remote ElasticSearch cluster.
- First you must set up an ElasticSearch cluster, which we have done using the same nodes as the Cassandra cluster.
- Refer to the Deploying an ElasticSearch Cluster post for instructions on how to do this.
- Make note of the cluster name and host IPs for all nodes in the ES cluster.
Based on the above ES cluster settings, add the following to you properties file, replacing hostnames and cluster-name with your specific settings:
storage.index.search.backend=elasticsearch storage.index.search.hostname=; storage.index.search.cluster-name=; storage.index.search.index-name=titan storage.index.search.client-only=true storage.index.search.sniff=false storage.index.search.local-mode=false
Save the file, and test in Gremlin:
bin/gremlin.sh gremlin> g = TitanFactory.open('conf/mygraph.properties')
You should see Gremlin connect to the Cassandra cluster and return a blank Gremlin prompt. Success! Keep Gremlin open for the next Step.
Step 3: Run Indices
Now before we add any data to our graph, we need to do a one-time setup of any Titan and ElasticSearch property and label indices. This must be done with caution because in Titan, indexes cannot be modified, dropped, or added on existing properties (Titan Limitations).
Note that we created a script for our indices and tracked them in Github to quickly adapt our indices when updating and reloading a new Graph. Also keep in mind that Titan 0.4.1 has a new index syntax that is different and not backwards compatible with the old Titan 0.3.2 syntax.
In the Gremlin shell, copy and paste the indices script. If all indices run successfully, commit, shutdown and exit:
gremlin> g.commit() gremlin> g.shutdown() gremlin> exit
(Optional) Part 3: Load GraphSON
If you are doing a bulk load of GraphSON into Titan, you can do so via Faunus or Gremlin. The GraphSON format for each method is unique, so you will need to ensure that your GraphSON format adheres to the expected rules. This walkthrough will focus on the Gremlin GraphSON load.
Save the graphSON file (i.e. – gremlin.graphson.json) to the root of the Titan directory.
Edit the bin/gremlin.sh script file to increase the JVM heap size max to 4GB (ensuring enough memory on the machine), to avoid “GC overhead exceeded” errors.
(bin/gremlin.sh) (Line 25) JAVA_OPTIONS="-Xms32m -Xmx4096m"
Create a new groovy script to load the GraphSON and auto-commit:
(loader.groovy) g = TitanFactory.open('conf/mygraph.properties') g.loadGraphSON('gremlin.graphson.json') g.commit()
Run the Groovy script through Gremlin
bin/gremlin.sh -e loader.groovy
This will take a while…a 500 MB graphSON file generally takes about 1.5 hours to finish loading (assuming no errors).
Part 4: Configure Rexster
Titan 0.4.1 Server now ships with Rexster Server, so you can run a Titan-configured Rexster server by running the rexster.sh script from Titan’s bin directory.
Step 1: Rexster Configuration
To run Rexster, you will need to create a Rexster configuration XML which Rexster by default expects to be under $TITAN_HOME/rexhome/config/rexster.xml (i.e. – /opt/titan/rexhome/config/rexster.xml).
Under /opt/titan/rexhome create a config directory
Create a rexster.xml document under the config directory. Alternately, copy the /opt/titan/conf/rexster-cassandra-es.xml file into /opt/titan/rexhome/config/rexster.xml
The Rexster configuration needs a few changes to properly connect to the Cassandra cluster and ElasticSearch cluster. Here is an example configuration: https://gist.github.com/spaztic1215/7e4303b75184098e64fc
Update the base-uri property on Line 5 to the current instance’s DNS:
Update the graph properties to connect to Cassandra and ES:
graph com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration false cassandra secondaryOne 100 titan true 0.3 elasticsearch false secondaryOne,secondaryTwo fife titan false false tp:gremlin
Step 2: Test Rexster
Now we should test Rexster Server to make sure it is connecting properly to Titan and ES:
Change directory to $TITAN_HOME (i..e – /opt/titan) and start Rexster manually
You should see a bunch of Rexster console messages that indicate that it has connected to the Cassandra cluster and ES cluster. You can verify that Rexster is running in a browser by going to:
Verify that the Titan graph is found by Rexster:
Step 3: Create a Rexster Upstart Script
Once you’ve confirmed that Rexster can successfully start, we will create an upstart script to manage the Rexster start/stop process.
Under /etc/init, create a configuration called rexster.conf
A sample rexster upstart configuration can be found here: https://gist.github.com/spaztic1215/5bfc2ee2d370b933c8ca
Note that the above configuration assumes that you are using the Datastax AMI which includes a raid0 directory, and thus logs to the /raid0/log/rexster directory. You must create this directory before starting the script:
Save file, and start rexster with upstart:
sudo start rexster
Check the log to ensure successful startup:
cd /raid0/log/rexster tail rexster.log
You now have a fully configured Cassandra, ElasticSearch, Titan, Rexster setup on a single node. Once you’ve applied this configuration to all your nodes, you can start Rexster Server on the entire cluster, set up an ELB that contains your Rexster instances binding port 80 to 8182, and start accepting Rexster requests from the ELB’s domain.
Jenny Kim is a senior software engineer at Cobrain, where she works with the data science team. Jenny graduated from the Uuniversity of Maryland with a B.S. in Computer Science and a B.A. in American Studies. She acquired her Masters in Information Systems Technology from The George Washington University in December 2013. In her free time, Jenny enjoys volunteering at local film festivals, obsessive vacuuming, and relaxing with the family Shih Tzu.
Latest posts by Guest Author (see all)
- Will big data bring a return of sampling statistics? And a review of Aaron Strauss’s talk at DSDC - March 11, 2014
- Backbone, The Primer - March 10, 2014
- DC NLP February Meetup Announcement: Sentiment Analysis - February 10, 2014