Experimenting: Apache Cassandra on Azure

I found the time to have a serious go at Cassandra. Last weekend I experimented with getting a load-balanced cluster up and running in Windows Azure. Here’s how that went.

Disclaimer: I’m no expert, and this may not be the absolute best approach to Cassandra on Azure, but it worked for me :).

Introduction & Resources:

As I found out, there are a couple of ways to do this:

  • Debian with vanilla version of Cassandra, from cassandra.apache.org. Requires a bit more effort to get it up and running, so far as I could see the configuration was mostly to be done manually.
  • Debian with Datastax Cassandra installation, simple enough and installs as a service to make things a bit easier.
  • Debian with Datastax OpsCenter Community edition. Gives you a web interface to build and manage clusters.

Having found the first two approaches a bit fiddly, I went with the third.

Setting up Virtual Machines in Azure:

Login to WindowsAzure, select Virtual Machines on the right, and click the new button down the bottom. At the time of writing, portal.azure.com does not support managing virtual machines. Create a new Ubuntu virtual machine, I selected version 13.10 as that was available at the time of writing this. I selected small VM size with 1 core and 1.75GB of RAM – since Cassandra makes heavy use of memtables, this would not be suitable for a production quality Cassandra implementation. For more information on creating virtual machines on Azure, go here.

Once created, go to the Endpoints tab for the virtual machine and add a new stand-alone endpoint for Cassandra:

Cassandra and OpsCenter endpoints for Azure

Port 9160 is the default Thrift port, which is the protocol Cassandra uses. Don’t worry about creating a load-balanced set yet, we will do this later.

I setup my first Cassandra node to also be my OpsCenter host, this should be fine for a learning experience, but is not advisable for serious cassandra setups. Since we’ll be installing OpsCenter on this node, might as well go ahead and add the OpsCenter port of 8888. If you’re setting up multiple Azure nodes, you only need the Cassandra port. 

Configuring the Virtual Machines:

SSH into the machine.

Cassandra requires the Oracle JVM, which is not standard on the Ubuntu instance we just installed.

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ subo apt-get install oracle-java7-installer

Next we’ll be installing the Datastax OpsCenter:

$ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/datastax.community.list
$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install opscenter
$ sudo service opscentered start

Note that we add the Community version of Datastax, this is a free version that we can use for playing with Cassandra. There is also an Enterprise version that requires a registered Datastax account.

Now you should be able to go to http://yourServiceName.cloudapp.net:8888 and hit the OpsCenter. You should get a dark looking page offering you to either manage an existing cluster, or build a new one. Since we want to plan to use more than just this node, we’ll come back to this later.

 Adding more Virtual Machines:

To fully utilize Cassandra’s distributed nature, we’ll go ahead and add some more machines to what will be our cluster. First, go back to your original VM, and go back to the Endpoints tab. Select the Cassandra endpoint we created, and click Edit at the bottom to bring up the endpoint details. At the bottom there is an checkbox to “Create a Load-Balanced Set”, check it and then hit save. Applying this will take a few minutes while Azure configures VIP’s and routing. What this means, is that when a request goes to yourServiceName.cloudapp.net:9160, it will be load balanced across the available machines that implement that endpoint. This is a key element of Cassandra’s robust and scale-able nature.

Now we can spin up some new machines as we did earlier.

  • Ensure that the SSH key, or the provided usernames and passwords are the same as for the first machine we made.
  • Add the machine to the same cloud-service as the first machine. This ensures that all the VM’s can implement the same DNS name for load balancing, as well as keeping them all in the same data center.
  • I kept mine in the same storage account, although you don’t have to.
  • Don’t worry about availability sets for now.
  • Don’t worry about the endpoints option for now, we’ll set this up after the machine is created.

Once the machines are created, go to the Endpoints tab for each machine and add a new endpoint. Select “Add an endpoint to an existing load-balanced set”, and select the one we created earlier.

Now we’ll need to SSH into each of these machines and add the Oracle JVM, as well as the Datastax repository that will be used to install Cassandra.

$ sudo add-apt-repository ppa:webupd8team/java
$ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/datastax.community.list
$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -
$ sudo apt-get update
$ subo apt-get install oracle-java7-installer

And we’re all set to create our cluster.

Creating a Cassandra Cluster in OpsCenter:

Head back to that URL we visited earlier (http://yourServiceName.cloudapp.net:8888where we were presented with the option to create a new cluster, and lets go ahead and do that.

You should see something like this (image courtesy of datastax.com):

In the package option, select DataStax Community with the latest version. This negates the need to put in DataStax credentials.

The Node IP’s are the internal IP addresses of the VM’s we created. These machines will talk to each other to distribute the data replicas, and its more efficient if this communication is kept internal to the data center. The internal address can be found on the dashboard for each VM.

azure_vip_internal_ip

Add the SSH credentials (this is why it was important all the VM’s had the same user created). The SSH port for each machine internally is still port 22. If you visit the Endpoints tab for each machine you’ll see it has a different public facing port. This is important because the OpsCenter will attempt to bootstrap each of our clustered machines via SSH.

Hit Build! OpsCenter will now SSH into each VM, download the latest community version of Apache Cassandra and install it, as well as a service that is used to communicate with the OpsCenter controller.

Once this is done, you will be able to go to the Nodes tab and see your cluster! From here you can add new nodes to the cluster, or manage nodes individually. Data replication is done on a per-keyspace basis, so don’t worry if this didn’t come up during our configuration.

Looking Forward:

I’m currently building a load-generator with Node.js to see if I can give my cluster a bit of a thrashing. It shouldn’t be hard considering its on 3 very small VM’s. Hopefully I’ll write more on this later.

If you have any other thoughts or corrections on this article, please let me know on twitter (@flacnut).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: