Tuesday, March 10, 2015

Apache Storm: How to add external JARs or packages into CLASSPATH while running 'storm jar'?

I have been playing with 'storm-kafka' and 'storm-hbase' lately. Basically, they are projects/tools that one can use to integrate Kafka and HBase with storm.

My project was to have Kafka as spout and HBase as bolt. In other words, my application will pull data from Kafka and then write the output to HBase. In other other words (pun intended :), I need to have the storm-kafka and storm-hbase JAR included when I run my Storm topology.

There are a few ways to do this:
(1) Put the JARs under STORM_BASE_DIR
(2) Put the JARs under STORM_BASE_DIR/lib
(3) Put the package under STORM_CONF_DIR
(4) Include the package into the topology JAR

After trying the above few methods, my favourite is method #4. However, it is not without its own pain points.

Let me explain why I do not like the other methods.

Method #1
By putting JARs into STORM_BASE_DIR, I have a feeling that I have 'corrupted' the directory. Messing up a standard directory of a product is not my cup of tea.

Method #2
See Method #1 above.

Method #3
Since the storm.py codes do not search the directory declared as STORM_CONF_DIR (or USER_CONF_DIR) for JARs, you would have to put the package files in that directory. How many times have I said 'messy'? :)

Now, let's discuss Method #4. I say it is my favourite, but I never say it is the best. That is because it will grow your JAR file size greatly if you have some really big external JARs to include (beside your topology). However, I feel that it is the most acceptable approach because it is more manageable than the other methods (at least to me :).

Hence, if you are looking into including or adding external JARs while running your Storm topology, I would suggest you to include those JARs into your topology JAR for the time being until there is a neater way to do this!

Environment = HDP 2.2 (Storm 0.9.3)

Source codes of storm.py and how CLASSPATH is determined

Tuesday, March 3, 2015

Apache Mesos: Mesos + Marathon + Docker = ?

In my previous post, I talked about merging resources from multiple nodes into one using Apache Mesos.

I also talked about the 2 reasons I decided to pick up Mesos:
(1) Google Kubernetes
(2) Docker

In this post, I am going to share the method you can use to deploy Docker container on a Mesos cluster.

* Mesos and Zookeeper have to be up and running. For installation instruction, please refer to my previous post.
** I used CentOS 6.6 as the platform. So, some commands might differ on other platforms.
*** Make sure "docker-io" "(the Docker package) is installed and the daemon is running on all Mesos slave nodes.

(1) Check and update (if needed) the /etc/mesos/zk file on the node you wish to install Marathon.

vi /etc/mesos/zk

** Make sure the IP address of the Zookeeper server is written there

(2) Install Marathon (using the Mesosphere repository created in the previous post) on the node selected above (Master node is recommended for testing purpose and ease of maintenance):

yum install marathon

(3) Make sure Marathon is up and running after the installation. Otherwise, start it using:

initctl start marathon 

** Use "ps -ef" command to verify the Marathon process is running with the proper IP addresses (instead of 'localhost') of the zookeeper-server and Master node. If it is not, check the /etc/mesos/zk file again and restart.

(4) Update all the Mesos slave nodes with the following:

echo 'docker,mesos' > /etc/mesos-slave/containerizers

echo '5 mins' > /etc/mesos-slave/executor_registration_timeout

(5) Restart all the Mesos slave nodes:

initctl restart mesos-slave 

(6) By now, you should be ready to deploy Docker container on the Mesos cluster. To do so, you have to create a JSON file for the Docker container you wish to deploy:


   "container": {
      "type": "DOCKER",
      "docker": {
         "image": "",
         "network": "HOST"
   "id": "centos63",
   "instances": 1,
   "cpus": 4,
   "mem": 2048,
   "uris": [],
   "cmd": "/usr/sbin/httpd -DFOREGROUND"


There are a few things to take note:
(a) For the "image" parameter, you would need to specify an image that is reachable by all of the Mesos slaves, because you would not know for sure which slave or slaves the Master will select to run the container.

** If you are using your own insecured private registry, please make sure you edit the docker "default" file (eg. /etc/sysconfig/docker or /etc/default/docker) to declare the registry as insecured and restart the Docker service (service docker restart):


(b) AFAIK, Mesos (0.21.1) only support 2 network modes now - HOST or BRIDGE. 

(c) The "cmd" parameter works like CMD in Docker.

(7) Once the JSON file is ready, you can submit to Marathon using the POST method:

curl -X POST -H "Content-Type: application/json" http://<marathon host>:8080/v2/apps -d@<JSON filename> 

Marathon GUI: After the CURL command and when Mesos is deploying the container

Marathon GUI: The container is successfully deployed and RUNNING

Mesos GUI: Shows one active task running on "mesos4" slave node

Mesos GUI: Clicking on the task shows the details (it's a SANDBOX)

Mesos GUI: STDOUT and STDERR are streamed from the container to the sandbox

On 'mesos4' node, the image is downloaded from the private repo and a container is running

On 'mesos4' node, 'docker inspect <container id>' shows the networking mode is HOST as configured

On 'mesos3' node, a HTTP connection shows HTTPD container is indeed running on 'mesos4' node