Saturday, September 12, 2015

Running Apache Spark on Mesos

Running Apache Spark on Mesos is easier than I thought!

I have 3 nodes at home (hdp1, hdp2 and hdp3) which are running Hortonworks Data Platform (HDP). I have HDFS, Zookeeper and Spark installed on the cluster.

To test running Spark on Mesos, I decided to reuse the cluster for simplicity's sake.

All I did was:
(1) install Mesos master on node hdp1.
(2) install Mesos slave on node hdp2 and hdp3.
(3) configure the master and slaves accordingly.

(i) For more information about the installation and configuration of Mesos, you can refer to this blog entry.
(ii) I reuse Zookeeper that was installed with HDP.

Once Mesos is up and running, I decided to carry out a very simple test that is described as follow:
(1) Use "spark-shell" as the Spark client to connect to the Mesos cluster.
(2) Load a sample text file into HDFS.
(3) Use spark-shell to perform a line count on the loaded text file.

First, let's see what is the command that you can use to connect spark-shell to a running Mesos cluster.

TIPS: All you need to do is to use the "--master" option to point to the Mesos cluster at "mesos://<IP-or-hostname of Mesos master>:5050".

Then, let's load a sample text file into HDSF using the "hadoop fs -put" command.

Once the sample text file is loaded, let's create a spark RDD using it.

TIPS: You can use the "sc.textFile" function and points it to "hdfs://<namenode>:8020/<path to file>".

After the RDD is created, let's run a count on it using the "count()" function.

You can see from the screenshot above and below that, Spark (through Mesos) has submitted 2 tasks that are executed on node hdp2 and hdp3 respectively.

You can also see from the screenshot above that the end result is returned as "4" (which means 4 lines in the file - which is correct!).

So, that is how easy it is to run Spark on Mesos.

Hope that you are going to try it out!

