We (Ben Bengfort and Sean Murphy) are very excited to be holding the Teaching the Elephant to Read tutorial at the sold out Strata + Hadoop World 2013 on Monday, the 28th of October. We will be discussing and using numerous software packages that can be time consuming to install on various operating systems and laptops. If you are taking our tutorial, we strongly encourage you to set aside an hour or two this weekend to follow the instructions below to install and configure the virtual machine needed for the class. The instructions have been tested and debugged so you shouldn’t have too many problems (famous last words .

Important Notes

1. you will need a 64-bit machine and operating system for this tutorial. The virtual machine/image that we will be building and using has been tested on Mac OS X (up through Mavericks) and 64-bit Windows.
2. this process could take an hour or longer depending on the bandwidth of your connection as you will need to download approximately 1 GB of software.

1) Install and Configure your Virtual Machine

First, you will need to install Virtual Box, free software from Oracle. Go here to download the 64-bit version appropriate for your machine.

Once Virtual Box is installed, you will need to grab a Ubuntu x64 Server 12.04 LTS image and you can do that directly from Ubuntu here.

Ubuntu Image

There are numerous tutorials online for creating a virtual machine from this image with Virtual Box. We recommend that you configure your virtual machine with at least 1GB of RAM and a 12 GB hard drive.

2) Setup Linux

Honestly, you don’t have to do this. If you have a user account that can already sudo, you are good to go and can skip to the install some software. But if not, use the following commands.

Log out and log back in as “hadoop.”

Now you need to install some software.

The above installs may take some time.

At this point you should probably generate some ssh keys (for hadoop and so you can ssh in and get out of the VM terminal.)

Make sure that you leave the password as blank, hadoop will need the keys if you’re setting up a cluster for more than one user. Also note that it is good practice to keep the administrator seperate from the hadoop user- but since this is a development cluster, we’re just taking a shortcut and leaving them the same.

One final step, copy allow that key to be authorized for ssh.

Hadoop requires Java – and since we’re using Ubuntu, we’re going to use OpenJDK rather than Sun because Ubuntu doesn’t provide a .deb package for Oracle Java. Hadoop supports OpenJDK with a few minor caveats: java versions on hadoop. If you’d like to install a different version, see installing java on hadoop.

Do a quick check to make sure you have the right version of Java installed:

Now we need to disable IPv6 on Ubuntu- there is one issue when hadoop binds on 0.0.0.0 that it also binds to the IPv6 address. This isn’t too hard: simply edit (with the editor of your choice, I prefer vim) the /etc/sysctl.conf file using the following command

and add the following lines to the end of the file:

Unfortunately you’ll have to reboot your machine for this change to take affect. You can then check the status with the following command (0 is enabled, 1 is disabled):

Go ahead and unpack in a location of your choice. We’ve debated internally what directory to place Hadoop and other distributed services like Cassandra or Titan in- but we’ve landed on /srv thanks to this post. Unpack the file, change the permissions to the hadoop user and then create a symlink from the version to a local hadoop link. This will allow you to set any version to the latest hadoop without worrying about losing versioning.

Now we have to configure some environment variables so that everything executes correctly, while we’re at it will create a couple aliases in our bash profile to make our lives a bit easier. Edit the ~/.profile file in your home directory and add the following to the end:

We’ll continue configuring the Hadoop environment. Edit the following files in /srv/hadoop/conf/:

core-site.xml

hdfs-site.xml

mapred-site.xml

That’s it configuration over! But before we get going we have to format the distributed filesystem in order to use it. We’ll store our file system in the /app/hadoop/tmp directory as per Michael Noll and as we set in the core-site.xml configuration. We’ll have to set up this directory and then format the name node.

You should now be able to run Hadoop’s start-all.sh command to start all the relevant daemons:

And you can use the jps command to see what’s running:

Furthermore, you can access the various hadoop web interfaces as follows:

To stop Hadoop simply run the stop-all.sh command.

4) Install Python Packages and the Code for the Class

To run the code in this section, you’ll need to install some Python packages as dependencies, and in particular the NLTK library. The simplest way to install these packages is with the requirements.txt file that comes with the code library in our repository. We’ll clone it into a repository called tutorial.

However, if you simply want to install the dependencies yourself, here are the contents of the requirements.txt file:

You’ll also have to download the NLTK data packages which will install to /usr/share/nltk_data unless you set an environment variable called NLTK_DATA. The best way to install all this data is as follows: