Tuesday, March 10, 2015

Apache Storm: How to add external JARs or packages into CLASSPATH while running 'storm jar'?

I have been playing with 'storm-kafka' and 'storm-hbase' lately. Basically, they are projects/tools that one can use to integrate Kafka and HBase with storm.

My project was to have Kafka as spout and HBase as bolt. In other words, my application will pull data from Kafka and then write the output to HBase. In other other words (pun intended :), I need to have the storm-kafka and storm-hbase JAR included when I run my Storm topology.

There are a few ways to do this:
(1) Put the JARs under STORM_BASE_DIR
(2) Put the JARs under STORM_BASE_DIR/lib
(3) Put the package under STORM_CONF_DIR
(4) Include the package into the topology JAR

After trying the above few methods, my favourite is method #4. However, it is not without its own pain points.

Let me explain why I do not like the other methods.

Method #1
=======
By putting JARs into STORM_BASE_DIR, I have a feeling that I have 'corrupted' the directory. Messing up a standard directory of a product is not my cup of tea.

Method #2
=======
See Method #1 above.

Method #3
=======
Since the storm.py codes do not search the directory declared as STORM_CONF_DIR (or USER_CONF_DIR) for JARs, you would have to put the package files in that directory. How many times have I said 'messy'? :)

Now, let's discuss Method #4. I say it is my favourite, but I never say it is the best. That is because it will grow your JAR file size greatly if you have some really big external JARs to include (beside your topology). However, I feel that it is the most acceptable approach because it is more manageable than the other methods (at least to me :).

Hence, if you are looking into including or adding external JARs while running your Storm topology, I would suggest you to include those JARs into your topology JAR for the time being until there is a neater way to do this!

NOTE:
Environment = HDP 2.2 (Storm 0.9.3)



Source codes of storm.py and how CLASSPATH is determined

7 comments:

  1. Can you throw more light on method #4? I packaged into one jar with the external jars and still there is the same problem. It cannot locate the external jars.

    ReplyDelete
  2. Hi, sorry for the late reply, have been busy with work and family lately.
    I have a few questions for you:
    (1) How did you package the external jars together with your topology? Can you share the "jar tvf " output?
    (2) When you say it cannot locate the external JARs, I would assume you got errors when executing "storm jar", right?

    ReplyDelete
    Replies
    1. HI, Thanks for the reply.
      1. I exported the project as jar keeping all the required jars in the project root and adding them to the build path.
      2. Yes. While trying to execute with storm jar only it gave the error.

      Delete
    2. And also can you answer this http://stackoverflow.com/questions/30800240/storm-supervisor-exits-when-processing-event ? I have this error in production cluster.

      Delete
  3. Hi. I am very sorry for late reply.

    I was wrong in saying that I packaged into one jar. Could you suggest some method in packaging into one jar. I do not know how to use maven also. I have written all my code and starting new project will be a painful task for me.

    The sole purpose of using external jars for me now is because i am using log4j jars for logging and cannot use slf4j because of some dependencies. I would want to use log4j for my topology only without disturbing the storm logback logging.

    ReplyDelete
    Replies
    1. Hi Mani,
      I am overwhelmed with work and am learning some more new technologies (graph database, etc). I feel so shameful to have neglected this blog (something which I plan to rectify in the near future).

      I hope you have managed to resolve your problem.

      If not, hopefully the following will provide some clues to you:
      (1) the most straightforward way to package your files into one jar is the "jar cvf" command.
      https://docs.oracle.com/javase/8/docs/technotes/tools/unix/jar.html

      It is not necessary to use Maven if you have all the files (including dependencies) ready in a directory.

      (2) For log4j, as long as you have the log4j JAR included in your final JAR (for Storm), you should not have any problem referring to it in your codes.

      Hope that helps!

      Delete
  4. If you are using HDP with storm 0.9.3 then using Ambari UI, under custom storm-site, you can add a custom variable called topology.classpath and put the path to the 3rd party jars there.
    Take a look at https://issues.apache.org/jira/browse/STORM-54

    ReplyDelete