Showing posts with label HADOOP. Show all posts
Showing posts with label HADOOP. Show all posts

Thursday, 12 June 2014

Dynamic Partitioning With Hive


Automatic (Dynamic) Partitioning With Hive


Partitioning in Hive is usually considered to be tedious task while loading the data. You have to manually segregate the data and have to tell Hive to which partition you want to to load data. This requires a lot of manual intervention and effort. 

Further partitioning is usually done on the columns that does not exist in the data. What else if we want to partition on the column that exist in our data. I usually became clueless in this scenario.

Another problem is querying the one large HDFS file that contains historical data. If we can somehow divide the large file into small pieces we can get a faster turn around time while querying.

Consider a situation when we have one file that has two fields:ORDERID and ORDERDATE. Lets say each day we receive large number of orders. We want to view that file in the Hive but creating a one large table and querying the data is a heavy operation. So another way is to view this file as Hive Table partitioned by Date and all the partitions fills automatically depending upon value in the input file.

This problem can be solved by a two step process:

1) Set couple of properties in Hive

2)Create a external staging table "staging_order" and load the input files data to this table.

3) Create a main production external table "production_order" with the date as one of the partitioned columns. 

4) Load the production table from the staging table so that data is distributed in partitions automatically.



Executing Code With Mahout Without Using Maven


Recently I have started learning Mahout using Mahout In Action. Topic that I first choose was Un-Supervised Learning. Very first example that I found was about Clustering. I was able to code the example of K-Means Clustering and was able to compiled it successfully. Then I followed the standard procedure of creating Jar and Running Jar on Hadoop Cluster to submit a Map-Reduce Clustering Job For me. But it did not worked and I was continuously getting ClassDef Not Found Error.

After some googling I realized that we can install maven, can build the project using Maven and then we can execute it.


But I was looking for a way to execute the code for Mahout on Hadoop without using Maven. Following is the class that I coded:




Then I compiled the java file to create a java file as:


Above command has created my class file:  SimpleKMeansClustering.class

Now as I don't wanted to use Maven to build my project, I looked at the Sean Owen Comment:

Use the "job" JAR file provided by Mahout. It packages up all the dependencies. You need to add your classes to it too.

So I went to my Mahout installation directory:


I copied the above file in red to the directory where my Mahout Code resides. 

Then I have added my class file to our main jar file mahout-core-0.5-cdh3u3-job.jar Using following Command. And then simply executed main jar file by invoking my class SimpleKMeansClustering. My Code got executed properly in Map-Reduce Fashion and Generated Output As: