Thursday, 12 June 2014

Debugging With Pig: Using Illustrate


Illustrate is an Interesting operator which is capable of generating test data automatically. It looks for present data in your input and then scans PIG scripts for conditions and verifies for what all conditions data is missing in your input file. Then PIG creates data for that condition on the fly and feed it to your script. This functionality was introduced in Pig 0.4 and enhanced in 0.9. This is useful when you are running with small amount of data like in unit test or local mode and not sure if your sample data covers all the path.  This is also useful when you are not sure that your input data contains all the scenarios that you have coded in PIG script.


 As per Oreilly Programming following is the description about illustrate operator:

Illustrate takes sample of your data and runs it thorough your script, but as it encounters operator that remove data (such as filter, join, etc.) it makes sure some records pass through the operator and some do not. When necessary, it will manufacture records that look like yours (i.e., that have the same schema) but are not in the sample it took.

To validates this I have coded a sample Pig script to filter one type of  record. My script looks like:

inputfile = load 'second_hdfs' using PigStorage('|') as (sex:chararray,country:chararray,cont:chararray);
describe inputfile;
relC = filter inputfile by cont == 'asia';
--store relC into 'second_out3' using PigStorage('|');
illustrate relC;
view raw cod1 hosted with ❤ by GitHub

CASE 1:  Input file do not contain any record that satisfies filter for relC. So Pig manufactures a record that passes filter.

My Input file looks like as. No record in the file satisfies filter in relC

MALE|INDIA|AMERICA
MALE|INDIA|AFRICA
MALE|INDIA|ANTARTICA
view raw inputfile hosted with ❤ by GitHub


None of the record in input file passes filter.. But due to addition of illustrate operator Pig has manufactured the data on the fly that fails this condition. My Pig Job Output looks like.Record in blue is cooked by Pig on the fly. You need not to worry to modify your input data


2013-06-05 07:29:07,461 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
-------------------------------------------------------------------------
| inputfile | sex: bytearray | country: bytearray | cont: bytearray |
-------------------------------------------------------------------------
| | MALE | INDIA | asia |
| | MALE | INDIA | ANTARTICA |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| inputfile | sex: chararray | country: chararray | cont: chararray |
-------------------------------------------------------------------------
| | MALE | INDIA | asia |
| | MALE | INDIA | ANTARTICA |
-------------------------------------------------------------------------
--------------------------------------------------------------------
| relC | sex: chararray | country: chararray | cont: chararray |
--------------------------------------------------------------------
| | MALE | INDIA | asia |
view raw log hosted with ❤ by GitHub


CASE 2:  Input file contains all the records that passes through validation in script. However it does not contain any negative test case:

MALE|india|asia
MALE|INDIA|asia
FEMALE|india|asia
view raw file2 hosted with ❤ by GitHub
It is clear that file contains all the records that is ”asia” and it should go through the output file as filter will pass it. But due to addition of illustrate operator Pig has manufactured the data on the fly that fails this condition. My Pig Job Output looks like. Record in blue is cooked by Pig on the fly.

2013-06-03 11:58:08,649 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
-------------------------------------------------------------------------
| inputfile | sex: bytearray | country: bytearray | cont: bytearray |
-------------------------------------------------------------------------
| | FEMALE | india | 0 |
| | FEMALE | india | asia |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| inputfile | sex: chararray | country: chararray | cont: chararray |
-------------------------------------------------------------------------
| | FEMALE | india | 0 |
| | FEMALE | india | asia |
-------------------------------------------------------------------------
--------------------------------------------------------------------
| relC | sex: chararray | country: chararray | cont: chararray |
--------------------------------------------------------------------
| | FEMALE | india | asia |
--------------------------------------------------------------------
view raw log2 hosted with ❤ by GitHub



CASE 3: Having multiple conditions in multiple statements inside the script.


Added two filters in the above script

inputfile = load 'second_hdfs' using PigStorage('|') as (sex:chararray,country:chararray,cont:chararray);
describe inputfile;
relC = filter inputfile by cont == 'africa';
store relC into 'second_out3' using PigStorage('|');
--explain relC;
relD = filter relC by country == 'PAKISTAN';
store relD into 'second_out3a' using PigStorage('|');
illustrate relD;
view raw script2 hosted with ❤ by GitHub

My input file looks like as below and it is clear from the above script none of the record passes through the script:

MALE|INDIA|ASIA
MALE|INDIA|ASIA
FEMALE|INDIA|ASIA
MALE|INDIA|ASIA
MALE|SRILANKA|ASIA
MALE|BRAZIL|AMERICA
view raw input3 hosted with ❤ by GitHub
Due to the addition of Illustrate operator we can see Pig has manufactured the record that passes through a filter and we can conclude that our script works well even though that data is not present in our script.

| inputfile | sex: bytearray | country: bytearray | cont: bytearray |
-------------------------------------------------------------------------
| | MALE | PAKISTAN | africa |
| | MALE | 0 | africa |
| | MALE | INDIA | ASIA |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| inputfile | sex: chararray | country: chararray | cont: chararray |
-------------------------------------------------------------------------
| | MALE | PAKISTAN | africa |
| | MALE | 0 | africa |
| | MALE | INDIA | ASIA |
-------------------------------------------------------------------------
--------------------------------------------------------------------
| relC | sex: chararray | country: chararray | cont: chararray |
--------------------------------------------------------------------
| | MALE | PAKISTAN | africa |
| | MALE | 0 | africa |
--------------------------------------------------------------------
--------------------------------------------------------------------
| relD | sex: chararray | country: chararray | cont: chararray |
--------------------------------------------------------------------
| | MALE | PAKISTAN | africa |
view raw output4 hosted with ❤ by GitHub

2 comments: