My project was to have Kafka as spout and HBase as bolt. In other words, my application will pull data from Kafka and then write the output to HBase. In other other words (pun intended :), I need to have the storm-kafka and storm-hbase JAR included when I run my Storm topology.
There are a few ways to do this:
(1) Put the JARs under STORM_BASE_DIR
(2) Put the JARs under STORM_BASE_DIR/lib
(3) Put the package under STORM_CONF_DIR
(4) Include the package into the topology JAR
After trying the above few methods, my favourite is method #4. However, it is not without its own pain points.
Let me explain why I do not like the other methods.
Method #1
=======
By putting JARs into STORM_BASE_DIR, I have a feeling that I have 'corrupted' the directory. Messing up a standard directory of a product is not my cup of tea.
Method #2
=======
See Method #1 above.
Method #3
=======
Since the storm.py codes do not search the directory declared as STORM_CONF_DIR (or USER_CONF_DIR) for JARs, you would have to put the package files in that directory. How many times have I said 'messy'? :)
Now, let's discuss Method #4. I say it is my favourite, but I never say it is the best. That is because it will grow your JAR file size greatly if you have some really big external JARs to include (beside your topology). However, I feel that it is the most acceptable approach because it is more manageable than the other methods (at least to me :).
Hence, if you are looking into including or adding external JARs while running your Storm topology, I would suggest you to include those JARs into your topology JAR for the time being until there is a neater way to do this!
NOTE:
Environment = HDP 2.2 (Storm 0.9.3)
Source codes of storm.py and how CLASSPATH is determined |