Explore Data Science: June 2014

Thursday 19 June 2014

Bayes Theorem and Bayesian Statistics : Simple Understanding

This is one of the important theorem and one need to have clear understanding about it. It took a while for me to understand this in a manner in which I can correlate with a more real example instead of coin toss or a dice roll. Below is my understanding about this theorem and Statistical Paradigm. Please feel free to correct me on this. Will appreciate your comments and views.

Bayes Theorem: This theorem is based upon Bayesian Statistics. First We need to understand what a Bayesian Statistic is ?

Consider a scenario which has random process. This process produces multiple possible outcomes. The process depends upon some unknown features/parameters. Usually the process is described in terms of probability distribution of its outcomes which in turn depends upon some unknown features.

To visualize this situation we are organizing a football match between Brazil and Germany. So playing a football match is a process. There are multiple possible outcomes of a football match like it will get draw or who will win it or who will loose it or total number of goal scored during the match.

One way of describing the upcoming football match is to describe by Probability Distribution of Number of Goals scored by both the team together. This distribution will give the probability of scoring 0 goal, 1 goal, 2 goal, 3 goal, 4 goal, 5 goal and so on ..for the match in question.

One way to calculate probabilities is based upon the historical data of the previous matches played between Germany and Brazil without considering parameters/features/factors which impacts the total number of goals scored like number of penalty stroke and number of penalty corner or free hit given in the match. This paradigm of Statistics is known as Frequentist Approach. As in this case parameters that impacts total number of goals scored were ignored. So if we want to know the probability that more than 5 goals will be scored in this match we need to look upon total number of matches played between Brazil and Germany and have to check number of goals scored in each match. But this approach won’t consider the parameters that impact number of goals. This approach makes some assumptions like:

· Underlying parameters/features are constants. i.e. in our case assumption is same number of penalty corner or penalty stroke or free hit will be awarded in this match as it was given in the earlier matches.
· Number of parameters are fixed so that no additional parameters can be considered.
· Playing a football match is repeatable random process.

While in case of Bayesian Statistics we can calculate the probability of given outcome using a given data (data contains values for parameters/features). So in case of Bayesian approach Data is fixed. And parameters are described probabilistically. So the main advantage of Bayesian approach is the ability to use the prior knowledge. Which may became reason for its criticism as understanding and assumption of prior knowledge may be different for different people so we may get different results.

In our example of Football match between Brazil and Germany if we want to follow Bayesian Approach then probability of scoring 5 goals (also called as Posterior) will be calculated by considering different parameters. Like we might be calculating probability by considering there will be 2 penalty stroke and 3 penalty corners. So we may want to do this calculation for different combination of penalty stroke and penalty corners. And most importantly prior knowledge about this process (matches between Germany and Brazil) will be used during the calculation.

So Posterior Probability (seeing 5 goals) = ( Prior Probability * Likelihood of Parameters)/Marginal

So we need to know the prior probability. Prior probabilities are intrinsically subjective - your prior information is different from mine . So knowledge and understanding about prior probabilities could be different for different people can get different results for Bayes Theorem.

Bayes Theorem:

We need to briefly understand following terms:

Considering our football match example:

Hypothesis = Number of Goals scored will greater than 5. Denoted by “H”.

Parameters = Number of penalty stroke and penalty corners in the match. Denoted by “D”.

Posterior Probability = Defined as the probability of hypothesis being true under the given data. In our example of football match it is the probability that number of goal scores during the match is greater than 5

given the value of penalty score and penalty corners. This value will be calculated for each possible value of penalty corner and penalty strokes. Denoted by P(H|D).

Prior Probability = Probability of goals scored 5 without giving the number of penalty stroke and corners. This may be drawn by looking at the large sample or may be derived from expert judgment. Denoted by P(H).

Likelihood =Probability or likely values of parameters given our hypothesis is true. In our example there are two parameters like number of penalty score and penalty corners. So likelihood will be the value of these two parameters when number of goal scored in upcoming football match greater than 5 (our hypothesis). So to calculate likelihood values we need a training data. Training data comprises of statistics about the previous matches between Brazil and Germany with details of number of penalty corners and stroke and number of goals scored during those matches. Denoted by P(D|H).

Marginal/Evidence = It is the probability of getting these particular parameter values (values of number of penalty corners and number of penalty stroke) under all possible Hypothesis. In this case there will be two hypothesis a) our hypothesis i.e. total number of goals scored will be greater than 5 (b) Another Hypothesis would number of goals scored during the match will be less than or equal to 5. Denoted by P(D).

Now we want to predict the probability that during upcoming match number of goals will be greater than 5, if there will be 2 penalty corner and 3 penalty stroke awarded in match.

Prior Probability will be fixed depending upon our prior knowledge.

Likelihood : We need to likelihood of both the number of penalty corners being 2 and penalty stroke being 3 when number of goals scored during the match is 5.

Marginal: This will be the probability of number of penalty stroke and corner being 2 and 3 when number of goals scored will be greater than 5 and when number of goal scores is less than 5.

Theorem can be written as bellows and shows Posterior is directly proportional to prior knowledge and parameters.

P(H|D) = (P(H)*P(D|H))/P(D)

Important Assumption: It is assumed that all the features/parameters/factors are independent. And they are not impacted by presence or absence of other feature/parameters/factors.

Bayes Rule is a scoring algorithm and it provides a probabilistic estimates. But It can be converted into a classification algorithm by selecting the hypothesis that is most probable. Or we need to select the hypothesis for which Posterior Value is maximum. This is called as Maximum Posterior or MAP decision Rule.

Thursday 12 June 2014

Brief Description About Modelling Techniques in DS project

There are various statistical and machine learning modelling techniques that can be applied in solving a Data Science Problems. Few of these techniques can be listed as:

Classification : Type of Supervised Learning. Dividing items into predefined categories. There are various algorithm available for clustering are Navie Bayes, Decision Tree, Logistic Regression (If a decision boundary/thershold is defined) and Support Vector Machine. For example predicting if there will be a rain tomorrow or not.

Regression/Scoring: Type of Supervised Learning .This is task for predicting numeric value or score. For example predicting the rain fall that might occur tomorrow in-terms of inches. Different algorithms used for regression analysis are logistic regression (to predict probabilities) and linear regression.

Clustering : Type of Unsupervised Learning This is a task of grouping items into most similar groups. Different algorithm available for clustering are K-Means, Hierarchical.

Recommendations: Producing a list of recommendations in either of way a) based upon user's past experience and on similar experience faced by some other user. b) based on a comparison between the content of items and a user profile. For example Nearest Neighbour Alogrithm.

Association Rules: This is about finding relationship among the variables. Finding correlations or reasons behind the effects observed in the data. Algorithm available for association rules mining are APRIORI.

Debugging With Pig: Using Illustrate

Illustrate is an Interesting operator which is capable of generating test data automatically. It looks for present data in your input and then scans PIG scripts for conditions and verifies for what all conditions data is missing in your input file. Then PIG creates data for that condition on the fly and feed it to your script. This functionality was introduced in Pig 0.4 and enhanced in 0.9. This is useful when you are running with small amount of data like in unit test or local mode and not sure if your sample data covers all the path. This is also useful when you are not sure that your input data contains all the scenarios that you have coded in PIG script.

As per Oreilly Programming following is the description about illustrate operator:

Illustrate takes sample of your data and runs it thorough your script, but as it encounters operator that remove data (such as filter, join, etc.) it makes sure some records pass through the operator and some do not. When necessary, it will manufacture records that look like yours (i.e., that have the same schema) but are not in the sample it took.

To validates this I have coded a sample Pig script to filter one type of record. My script looks like:

CASE 1: Input file do not contain any record that satisfies filter for relC. So Pig manufactures a record that passes filter.

My Input file looks like as. No record in the file satisfies filter in relC

None of the record in input file passes filter.. But due to addition of illustrate operator Pig has manufactured the data on the fly that fails this condition. My Pig Job Output looks like.Record in blue is cooked by Pig on the fly. You need not to worry to modify your input data

CASE 2: Input file contains all the records that passes through validation in script. However it does not contain any negative test case:

It is clear that file contains all the records that is ”asia” and it should go through the output file as filter will pass it. But due to addition of illustrate operator Pig has manufactured the data on the fly that fails this condition. My Pig Job Output looks like. Record in blue is cooked by Pig on the fly.

CASE 3: Having multiple conditions in multiple statements inside the script.

Added two filters in the above script

My input file looks like as below and it is clear from the above script none of the record passes through the script:

Due to the addition of Illustrate operator we can see Pig has manufactured the record that passes through a filter and we can conclude that our script works well even though that data is not present in our script.

Discrete and Random Variables

Difference Between Discrete and Continuous Random Variables

To understand difference between Discrete and Continuous variable we need to understand what a Random Variable means and how it is different than the normal algebric variable.

By Definition:

Random Variable: They are used to measure the outcome of a random process. It is just quantifying the outcome of random of process. This quantification is required to perform more mathematical analysis. They are usually represented by capital letters.

For eg.
X is a random variable that measures outcome of a toss of fair coin. We can assign value to X as:
X = { 1 when head otherwise 0 }

We can use this random variable and can calculate mathematical measures like what is the probability that we will get three Heads in 3 tosses of fair coin etc. Since tossing a fair coin is a pure random process.

This probability can be written as:
P(X =3 Head)

Non Random Variables: These are the variables that measures outcome of a non-random process. We can solve for these variables, we can calculate relationship among these variables and we can draw inferences about these variables. They are usually denoted in the small letters.

For eg:

Like x + 5 = 10 we can easily deduce x has one the value 5. If y = 2x + 3 this represents an equation of line and we can plot it and make conclusion is y is an linear additive function of x.

Random Variables can be further divided into "Continuous Random Variables" or "Discrete Random Variables". This classification is done basis on the measure or feature that a variable represents.

Discrete Random Variable: By meaning Discrete means "individually separate or distinct" while variable means something that can take different values at different times. So a discrete variable is something that can take values from well defined finite set and these values are produced as outcome of a Random Process. However it can take values upto infinity. Say we have defined a set of all positive integer like S = {0,1,2,3,4.....,infinity } and a discrete variable X denotes all even numbers from the set S. So being a discrete variable X can take any value like 2,4,6,8,10 and so on. But X can not take any value in between 2 and 3, Since S is set of all positive integers. So X can not take any value between interval of 2 and 3. For example we have a data set of Name and DOB. Both the variables can take values from Set of all human Names and Set of all legitimate date of births. There is nothing like between 2 dates or 2 names. So these variables are discrete in nature.

In other words Discrete Random Variable represents an aspect of random process that can be count or that is countable however count can go up to infinity.

Continuous Random Variables: These are the variables that can take all possible values between an interval of values generated as outcome of random process. In the above example variable Y (represents real numbers) is a continuous random variable when it can take any value between interval 2,3. That Y can have value 2.01,2.001,2.999999 etc. Another example would weight of all american males. So discrete random variables represents aspect of random process whose outcome are not countable. It is not possible to count all the all values that a continuous random variable represents.

To classify a random variable as a continuous variable or discrete variable we need to understand how measurement about those variable is taken. So prior to making this decision understands context and methodology with which measurement was taken. Suppose we are looking at the data set with weight of all males in India and while taking measurements weight is rounded off to nearest Kg. In this case variable weight became a discrete variable. However weight is taken on a high precision scale so that it can measure up to milligrams and so on it became a continuous variables.

Since Random variables are used for further mathematical analysis and the widely used technique is Probability Distributions. Both Continuous and Discrete Random Variables can be explored by different kind of Probability Distributions. Discrete Random Variables are usually expressed by Probability Mass Distribution and Continuous random variables are expressed in terms of Probability Density Function. We will take about them later. But their effectiveness is measured in terms of Sum of all the probabilities for probability mass function (Discrete Random Variable) and it less than or equal to 1. And for Random Continuous Variables its effectiveness is measured in terms of Probability Density Function's Area Under Curve (Integration of Probability Curve with limits from start of the curve to end of curve) and it is always less than or equal to 100%.

Good Luck !!!

Dynamic Partitioning With Hive

Automatic (Dynamic) Partitioning With Hive

Partitioning in Hive is usually considered to be tedious task while loading the data. You have to manually segregate the data and have to tell Hive to which partition you want to to load data. This requires a lot of manual intervention and effort.

Further partitioning is usually done on the columns that does not exist in the data. What else if we want to partition on the column that exist in our data. I usually became clueless in this scenario.

Another problem is querying the one large HDFS file that contains historical data. If we can somehow divide the large file into small pieces we can get a faster turn around time while querying.

Consider a situation when we have one file that has two fields:ORDERID and ORDERDATE. Lets say each day we receive large number of orders. We want to view that file in the Hive but creating a one large table and querying the data is a heavy operation. So another way is to view this file as Hive Table partitioned by Date and all the partitions fills automatically depending upon value in the input file.

This problem can be solved by a two step process:

1) Set couple of properties in Hive

2)Create a external staging table "staging_order" and load the input files data to this table.

3) Create a main production external table "production_order" with the date as one of the partitioned columns.

4) Load the production table from the staging table so that data is distributed in partitions automatically.

Executing Code With Mahout Without Using Maven

Recently I have started learning Mahout using Mahout In Action. Topic that I first choose was Un-Supervised Learning. Very first example that I found was about Clustering. I was able to code the example of K-Means Clustering and was able to compiled it successfully. Then I followed the standard procedure of creating Jar and Running Jar on Hadoop Cluster to submit a Map-Reduce Clustering Job For me. But it did not worked and I was continuously getting ClassDef Not Found Error.

After some googling I realized that we can install maven, can build the project using Maven and then we can execute it.

But I was looking for a way to execute the code for Mahout on Hadoop without using Maven. Following is the class that I coded:

Then I compiled the java file to create a java file as:

Above command has created my class file: SimpleKMeansClustering.class

Now as I don't wanted to use Maven to build my project, I looked at the Sean Owen Comment:

Use the "job" JAR file provided by Mahout. It packages up all the dependencies. You need to add your classes to it too.

So I went to my Mahout installation directory:

I copied the above file in red to the directory where my Mahout Code resides.

Then I have added my class file to our main jar file mahout-core-0.5-cdh3u3-job.jar Using following Command. And then simply executed main jar file by invoking my class SimpleKMeansClustering. My Code got executed properly in Map-Reduce Fashion and Generated Output As: