Thursday, 12 June 2014

Executing Code With Mahout Without Using Maven


Recently I have started learning Mahout using Mahout In Action. Topic that I first choose was Un-Supervised Learning. Very first example that I found was about Clustering. I was able to code the example of K-Means Clustering and was able to compiled it successfully. Then I followed the standard procedure of creating Jar and Running Jar on Hadoop Cluster to submit a Map-Reduce Clustering Job For me. But it did not worked and I was continuously getting ClassDef Not Found Error.

After some googling I realized that we can install maven, can build the project using Maven and then we can execute it.


But I was looking for a way to execute the code for Mahout on Hadoop without using Maven. Following is the class that I coded:


import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.clustering.kmeans.Cluster;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
//public class SimpleKMeansClustering extends Configured implements Tool
public class SimpleKMeansClustering
{
public static final double[][] points = { {1, 1}, {2, 1}, {1, 2},
{2, 2}, {3, 3}, {8, 8},
{9, 8}, {8, 9}, {9, 9}};
public static void writePointsToFile(List<Vector> points,
String fileName,
FileSystem fs,
Configuration conf) throws IOException
{
Path path = new Path(fileName);
SequenceFile.Writer writer = new SequenceFile.Writer(fs,conf,path, LongWritable.class, VectorWritable.class);
long recNum = 0;
VectorWritable vec = new VectorWritable();
for (Vector point : points)
{
vec.set(point);
writer.append(new LongWritable(recNum++), vec);
}
writer.close();
}
public static List<Vector> getPoints(double[][] raw)
{
List<Vector> points = new ArrayList<Vector>();
for (int i = 0; i < raw.length; i++)
{
double[] fr = raw[i];
Vector vec = new RandomAccessSparseVector(fr.length);
vec.assign(fr);
points.add(vec);
}
return points;
}
public static void main(String args[]) throws Exception
{
int k = 2;
List<Vector> vectors = getPoints(points);
File testData = new File("testdata");
if (!testData.exists())
{
testData.mkdir();
}
testData = new File("testdata/points");
if (!testData.exists())
{
testData.mkdir();
}
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
writePointsToFile(vectors, "testdata/points/file1", fs, conf);
Path path = new Path("testdata/clusters/part-00000");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,path, Text.class, Cluster.class);
for (int i = 0; i < k; i++)
{
Vector vec = vectors.get(i);
Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();
KMeansDriver.run(conf, new Path("testdata/points"),new Path("testdata/clusters"),new Path("output"), new EuclideanDistanceMeasure(),0.001, 10, true,false);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
new Path("output/" + Cluster.CLUSTERED_POINTS_DIR
+ "/part-m-00000"), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (reader.next(key, value))
{
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
reader.close();
}
}


Then I compiled the java file to create a java file as:

javac -cp "/usr/lib/haoop/lib/*:/usr/lib/hadoop/*:/usr/lib/mahout/*:/usr/lib/mahout/lib/*" SimpleKMeansClustering.java
view raw java_compile hosted with ❤ by GitHub

Above command has created my class file:  SimpleKMeansClustering.class

Now as I don't wanted to use Maven to build my project, I looked at the Sean Owen Comment:

Use the "job" JAR file provided by Mahout. It packages up all the dependencies. You need to add your classes to it too.

So I went to my Mahout installation directory:


> cd /usr/lib/mahout
> ls /usr/lib/mahout/
bin examples mahout-core-0.5-cdh3u3.jar mahout-examples-0.5-cdh3u3.jar mahout-math-0.5-cdh3u3.jar mahout-utils-0.5-cdh3u3.jar
conf lib mahout-core-0.5-cdh3u3-job.jar mahout-examples-0.5-cdh3u3-job.jar mahout-taste-webapp-0.5-cdh3u3.war
view raw shell_command hosted with ❤ by GitHub
I copied the above file in red to the directory where my Mahout Code resides. 

Then I have added my class file to our main jar file mahout-core-0.5-cdh3u3-job.jar Using following Command. And then simply executed main jar file by invoking my class SimpleKMeansClustering. My Code got executed properly in Map-Reduce Fashion and Generated Output As:


> jar uf mahout-core-0.5-cdh3u3-job.jar SimpleKMeansClustering.class
> hadoop jar mahout-core-0.5-cdh3u3-job.jar SimpleKMeansClustering
1.0: [1.000, 1.000] belongs to cluster 0
1.0: [2.000, 1.000] belongs to cluster 0
1.0: [1.000, 2.000] belongs to cluster 0
1.0: [2.000, 2.000] belongs to cluster 0
1.0: [3.000, 3.000] belongs to cluster 0
1.0: [8.000, 8.000] belongs to cluster 1
1.0: [9.000, 8.000] belongs to cluster 1
1.0: [8.000, 9.000] belongs to cluster 1
1.0: [9.000, 9.000] belongs to cluster 1
view raw addingjar hosted with ❤ by GitHub






1 comment: