Running Apache Spark on Mesos is easier than I thought!
I have 3 nodes at home (hdp1, hdp2 and hdp3) which are running Hortonworks Data Platform (HDP). I have HDFS, Zookeeper and Spark installed on the cluster.
To test running Spark on Mesos, I decided to reuse the cluster for simplicity's sake.
All I did was:
(1) install Mesos master on node hdp1.
(2) install Mesos slave on node hdp2 and hdp3.
(3) configure the master and slaves accordingly.
NOTE:
(i) For more information about the installation and configuration of Mesos, you can refer to this blog entry.
(ii) I reuse Zookeeper that was installed with HDP.
Once Mesos is up and running, I decided to carry out a very simple test that is described as follow:
(1) Use "spark-shell" as the Spark client to connect to the Mesos cluster.
(2) Load a sample text file into HDFS.
(3) Use spark-shell to perform a line count on the loaded text file.
First, let's see what is the command that you can use to connect spark-shell to a running Mesos cluster.
TIPS: All you need to do is to use the "--master" option to point to the Mesos cluster at "mesos://<IP-or-hostname of Mesos master>:5050".
Then, let's load a sample text file into HDSF using the "hadoop fs -put" command.
Once the sample text file is loaded, let's create a spark RDD using it.
TIPS: You can use the "sc.textFile" function and points it to "hdfs://<namenode>:8020/<path to file>".
After the RDD is created, let's run a count on it using the "count()" function.
You can see from the screenshot above and below that, Spark (through Mesos) has submitted 2 tasks that are executed on node hdp2 and hdp3 respectively.
You can also see from the screenshot above that the end result is returned as "4" (which means 4 lines in the file - which is correct!).
So, that is how easy it is to run Spark on Mesos.
Hope that you are going to try it out!
I am a technology enthusiast and application support engineer by profession. All posts shared in this blog is based on my best knowledge. The objective of this blog is to share information and experience in hope that it will help save some of you some time while exploring these fun and exciting technologies.
Saturday, September 12, 2015
Saturday, September 5, 2015
Neo4j: How to import data using neo4j-import?
One of the few first things to do while learning a new database is to load some sample data and there is no difference when it comes to Neo4j.
With Neo4j, my preferred way of data loading is through the neo4j-import command.
While the documentation available is decent, new users might face some challenges accomplishing it (at least I was struggling at one point of time :).
That is the purpose of me writing this blog entry - to help those that need a quick-start guide for loading data into Neo4j.
Let's do it in a cookbook manner...
(1) Read the following documentations:
http://neo4j.com/docs/stable/import-tool.html
It's ok if you do not understand fully. Please allow me to share with you an example.
(2) Let's create the CSV needed to represent NODE and RELATIONSHIP.
The example that I have here is a simple banking system that contains the following NODEs:
(a) CUSTOMER
(b) ACCOUNT
(c) ATM
and the following RELATIONSHIPs:
(a) CUSTOMER->ACCOUNT [OWNS]
(b) ACCOUNT->ACCOUNT [TXFER]
(c) CUSTOMER->ATM [WITHDRAW]
(3) First, let's create the CSV files for the NODEs.
According to the documentation (see excerpt below), the CSV header for a NODE must contain an ID and a LABEL.
[http://neo4j.com/docs/stable/import-tool-header-format.html]
With Neo4j, my preferred way of data loading is through the neo4j-import command.
While the documentation available is decent, new users might face some challenges accomplishing it (at least I was struggling at one point of time :).
That is the purpose of me writing this blog entry - to help those that need a quick-start guide for loading data into Neo4j.
Let's do it in a cookbook manner...
(1) Read the following documentations:
http://neo4j.com/docs/stable/import-tool.html
It's ok if you do not understand fully. Please allow me to share with you an example.
(2) Let's create the CSV needed to represent NODE and RELATIONSHIP.
The example that I have here is a simple banking system that contains the following NODEs:
(a) CUSTOMER
(b) ACCOUNT
(c) ATM
and the following RELATIONSHIPs:
(a) CUSTOMER->ACCOUNT [OWNS]
(b) ACCOUNT->ACCOUNT [TXFER]
(c) CUSTOMER->ATM [WITHDRAW]
(3) First, let's create the CSV files for the NODEs.
According to the documentation (see excerpt below), the CSV header for a NODE must contain an ID and a LABEL.
[http://neo4j.com/docs/stable/import-tool-header-format.html]
Nodes
The following field types do additionally apply to node data sources:
- ID
- Each node must have a unique id which is used during the import. The ids are used to find the correct nodes when creating relationships. Note that the id has to be unique across all nodes in the import, even nodes with different labels.
- LABEL
- Read one or more labels from this field. For multiple labels, the values are separated by the array delimiter.
- How do we do that? Simple...
- [danielyeap@myhost neo4j-data]$ cat customer.csv
- custID:ID(CUSTOMER),firstname,lastname,since:int,:LABEL
- 1,Daniel,Yeap,1999,CUSTOMER
- 2,John,Smith,1999,CUSTOMER
- 3,Tod,Pitt,2010,CUSTOMER
- 4,Isabel,Grager,2014,CUSTOMER
- [danielyeap@myhost neo4j-data]$ cat atm.csv
- atmId:ID(ATM),location,:LABEL
- 1,Damansara Uptown,ATM
- 2,Kota Damansara,ATM
- 3,Jalan Ipoh,ATM
- [danielyeap@myhost neo4j-data]$ cat account.csv
- custId:int,acctno:ID(ACCOUNT),:LABEL
- 1,11111,ACCOUNT
- 1,11112,ACCOUNT
- 2,22221,ACCOUNT
- 2,22222,ACCOUNT
- 3,33331,ACCOUNT
- 4,44441,ACCOUNT
- (4) Since the NODEs are taken care of, let's craft our RELATIONSHIP CSV files.
Relationships
For relationship data sources, there’s three mandatory fields:- TYPE
- The relationship type to use for the relationship.
- START_ID
- The id of the start node of the relationship to create.
- END_ID
- The id of the end node of the relationship to create.
[danielyeap@myhost neo4j-data]$ cat rel-cust-acct.csv
:START_ID(CUSTOMER),:END_ID(ACCOUNT),:TYPE
1,11111,OWNS
1,11112,OWNS
2,22221,OWNS
2,22222,OWNS
3,33331,OWNS
4,44441,OWNS
[danielyeap@myhost neo4j-data]$ cat rel-txfer.csv
TXID,:START_ID(ACCOUNT),:END_ID(ACCOUNT),amount:int,:TYPE
1,11111,11112,100,TXFER
2,11111,22222,200,TXFER
3,22221,33331,300,TXFER
4,44441,11112,400,TXFER
5,33331,22221,500,TXFER
6,11111,22222,600,TXFER
7,11111,33331,700,TXFER
8,11111,33331,800,TXFER
[danielyeap@myhost neo4j-data]$ cat rel-withdraw.csv
TXID,:START_ID(CUSTOMER),:END_ID(ATM),amount:int,:TYPE
1,1,1,100,WITHDRAW_FROM
2,1,2,200,WITHDRAW_FROM
3,2,3,300,WITHDRAW_FROM
4,4,1,400,WITHDRAW_FROM
5,3,2,500,WITHDRAW_FROM
6,1,2,600,WITHDRAW_FROM
7,1,3,700,WITHDRAW_FROM
8,1,3,800,WITHDRAW_FROM
:START_ID(CUSTOMER),:END_ID(ACCOUNT),:TYPE
1,11111,OWNS
1,11112,OWNS
2,22221,OWNS
2,22222,OWNS
3,33331,OWNS
4,44441,OWNS
[danielyeap@myhost neo4j-data]$ cat rel-txfer.csv
TXID,:START_ID(ACCOUNT),:END_ID(ACCOUNT),amount:int,:TYPE
1,11111,11112,100,TXFER
2,11111,22222,200,TXFER
3,22221,33331,300,TXFER
4,44441,11112,400,TXFER
5,33331,22221,500,TXFER
6,11111,22222,600,TXFER
7,11111,33331,700,TXFER
8,11111,33331,800,TXFER
[danielyeap@myhost neo4j-data]$ cat rel-withdraw.csv
TXID,:START_ID(CUSTOMER),:END_ID(ATM),amount:int,:TYPE
1,1,1,100,WITHDRAW_FROM
2,1,2,200,WITHDRAW_FROM
3,2,3,300,WITHDRAW_FROM
4,4,1,400,WITHDRAW_FROM
5,3,2,500,WITHDRAW_FROM
6,1,2,600,WITHDRAW_FROM
7,1,3,700,WITHDRAW_FROM
8,1,3,800,WITHDRAW_FROM
(5) Now that we have all the files that we need to perform the data loading, it is time to execute the command:
[danielyeap@myhost neo4j-data]$ neo4j-import --into ~danielyeap/neo4j/data/graph.db --nodes customer.csv --nodes account.csv --nodes atm.csv --relationships rel-txfer.csv --relationships rel-cust-acct.csv --relationships rel-withdraw.csv
Importing the contents of these files into /home/danielyeap/neo4j/data/graph.db:
Nodes:
/home/danielyeap/neo4j-data/customer.csv
/home/danielyeap/neo4j-data/account.csv
/home/danielyeap/neo4j-data/atm.csv
Relationships:
/home/danielyeap/neo4j-data/rel-txfer.csv
/home/danielyeap/neo4j-data/rel-cust-acct.csv
/home/danielyeap/neo4j-data/rel-withdraw.csv
Available memory:
Free machine memory: 11.08 GB
Max heap memory : 2.73 GB
Nodes
[*>:??------------------------------------------------------------------------------||NODE:7.||] 10k
Done in 481ms
Prepare node index
[*DETECT:7.63 MB-------------------------------------------------------------------------------] 0
Done in 11ms
Calculate dense nodes
[*>:??-----------------------------------------------------------------------------------------] 0
Done in 297ms
Relationships
[*>:??--------------------------------------------------------------------------------------|P|] 10k
Done in 68ms
Node --> Relationship
[*>:??-----------------------------------------------------------------------------------------] 10k
Done in 10ms
Relationship --> Relationship
[*>:??-----------------------------------------------|v:??-------------------------------------] 10k
Done in 10ms
Node counts
[*COUNT:76.29 MB-------------------------------------------------------------------------------] 10k
Done in 218ms
Relationship counts
[*>:??------------------------------------------|COUNT-----------------------------------------] 10k
Done in 10ms
IMPORT DONE in 2s 664ms. Imported:
13 nodes
22 relationships
66 properties
==============================
NOTE:
(i) You can refer to the documentation about the command line HERE.
(ii) The "--into" option must point to the directory where neo4j database is located (look in <neo4j_install_dir/conf/neo4j-server.properties) =>
org.neo4j.server.database.location=/home/danielyeap/neo4j/data/graph.db
(iii) If there is an error about existing database, you can use the following procedure to remove it:
~ While Neo4j is up and running, go to "org.neo4j.server.database.location" and run "rm -rf *".
~ After all the files under that directory are removed, execute the "neo4j-import" command again.
~ After the import is successful, restart Neo4j (command = neo4j restart).
~ Login to Neo4j web interface and you should see your data loaded.
Labels:
:ID,
:LABEL,
data,
import,
neo4j,
neo4j-import,
relationship
Subscribe to:
Posts (Atom)