Thursday 12 June 2014

Debugging With Pig: Using Illustrate


Illustrate is an Interesting operator which is capable of generating test data automatically. It looks for present data in your input and then scans PIG scripts for conditions and verifies for what all conditions data is missing in your input file. Then PIG creates data for that condition on the fly and feed it to your script. This functionality was introduced in Pig 0.4 and enhanced in 0.9. This is useful when you are running with small amount of data like in unit test or local mode and not sure if your sample data covers all the path.  This is also useful when you are not sure that your input data contains all the scenarios that you have coded in PIG script.


 As per Oreilly Programming following is the description about illustrate operator:

Illustrate takes sample of your data and runs it thorough your script, but as it encounters operator that remove data (such as filter, join, etc.) it makes sure some records pass through the operator and some do not. When necessary, it will manufacture records that look like yours (i.e., that have the same schema) but are not in the sample it took.

To validates this I have coded a sample Pig script to filter one type of  record. My script looks like:


CASE 1:  Input file do not contain any record that satisfies filter for relC. So Pig manufactures a record that passes filter.

My Input file looks like as. No record in the file satisfies filter in relC



None of the record in input file passes filter.. But due to addition of illustrate operator Pig has manufactured the data on the fly that fails this condition. My Pig Job Output looks like.Record in blue is cooked by Pig on the fly. You need not to worry to modify your input data




CASE 2:  Input file contains all the records that passes through validation in script. However it does not contain any negative test case:

It is clear that file contains all the records that is ”asia” and it should go through the output file as filter will pass it. But due to addition of illustrate operator Pig has manufactured the data on the fly that fails this condition. My Pig Job Output looks like. Record in blue is cooked by Pig on the fly.




CASE 3: Having multiple conditions in multiple statements inside the script.


Added two filters in the above script


My input file looks like as below and it is clear from the above script none of the record passes through the script:

Due to the addition of Illustrate operator we can see Pig has manufactured the record that passes through a filter and we can conclude that our script works well even though that data is not present in our script.

2 comments: