Problem Scenario 2

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walk through of the solution to this problem can be found here [Click here]

Click here for the video version of this series. This takes you to the youtube playlist of videos.

Problem 2:

Using sqoop copy data available in mysql products table to folder /user/cloudera/products on hdfs as text file. columns should be delimited by pipe '|'
move all the files from /user/cloudera/products folder to /user/cloudera/problem2/products folder
Change permissions of all the files under /user/cloudera/problem2/products such that owner has read,write and execute permissions, group has read and write permissions whereas others have just read and execute permissions
read data in /user/cloudera/problem2/products and do the following operations using a) dataframes api b) spark sql c) RDDs aggregateByKey method. Your solution should have three sets of steps. Sort the resultant dataset by category id
- filter such that your RDD\DF has products whose price is lesser than 100 USD
- on the filtered data set find out the higest value in the product_price column under each category
- on the filtered data set also find out total products under each category
- on the filtered data set also find out the average price of the product under each category
- on the filtered data set also find out the minimum price of the product under each category
store the result in avro file using snappy compression under these folders respectively
- /user/cloudera/problem2/products/result-df
- /user/cloudera/problem2/products/result-sql
- /user/cloudera/problem2/products/result-rdd

Solution:

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario.

Step 1:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products \
--as-textfile \
--target-dir /user/cloudera/products \
--fields-terminated-by '|';

Step 2:

hadoop fs -mkdir /user/cloudera/problem2/
hadoop fs -mkdir /user/cloudera/problem2/products
hadoop fs -mv /user/cloudera/products/* /user/cloudera/problem2/products/

Step 3:

//Read is 4, Write is 2 and execute is 1.
//ReadWrite,Execute = 4 + 2 + 1 = 7
//Read,Write = 4+2 = 6
//Read ,Execute=4+1=5

hadoop fs -chmod 765 /user/cloudera/problem2/products/*

Step 4:

scala> var products = sc.textFile("/user/cloudera/products").map(x=> {var d = x.split('|'); (d(0).toInt,d(1).toInt,d(2).toString,d(3).toString,d(4).toFloat,d(5).toString)});

scala>case class Product(productID:Integer, productCatID: Integer, productName: String, productDesc:String, productPrice:Float, productImage:String);

scala> var productsDF = products.map(x=> Product(x._1,x._2,x._3,x._4,x._5,x._6)).toDF();

Step 4-Data Frame Api:

scala> import org.apache.spark.sql.functions._
scala> var dataFrameResult = productsDF.filter("productPrice < 100").groupBy(col("productCategory")).agg(max(col("productPrice")).alias("max_price"),countDistinct(col("productID")).alias("tot_products"),round(avg(col("productPrice")),2).alias("avg_price"),min(col("productPrice")).alias("min_price")).orderBy(col("productCategory"));

scala> dataFrameResult.show();

Step 4 - Spark SQL:

productsDF.registerTempTable("products");
var sqlResult = sqlContext.sql("select product_category_id, max(product_price) as maximum_price, count(distinct(product_id)) as total_products, cast(avg(product_price) as decimal(10,2)) as average_price, min(product_price) as minimum_price from products where product_price <100 group by product_category_id order by product_category_id desc");
sqlResult.show();

Step 4 - RDD aggregateByKey:

var rddResult = productsDF.map(x=>(x(1).toString.toInt,x(4).toString.toDouble)).aggregateByKey((0.0,0.0,0,9999999999999.0))((x,y)=>(math.max(x._1,y),x._2+y,x._3+1,math.min(x._4,y)),(x,y)=>(math.max(x._1,y._1),x._2+y._2,x._3+y._3,math.min(x._4,y._4))).map(x=> (x._1,x._2._1,(x._2._2/x._2._3),x._2._3,x._2._4)).sortBy(_._1, false);
rddResult.collect().foreach(println);

Step 5:

;

61 comments:

UnknownMay 18, 2017 at 2:05 PM
Hi Arun,

First of all thank you for such a confidence boosting blog on CCA175.
I would like to highlight a small correction in aggregateByKey transformation.
The default value of the MIN_PRICE should not be 0.0. If it is 0.0, for every MATH.MIN(a,b) the output would be 0.0. As a workaround this could be replaced by a higher value(10000.0) which would ultimately be swapped by the MIN values in the process.

Sample output:
(58,241.0,170.0,4,115.0)
(57,189.99,154.99,6,109.99)
(56,159.99,159.99,2,159.99)
(54,299.99,209.99,6,129.99)

Regards,
--Lax Dash
ReplyDelete
Replies
UnknownMay 28, 2017 at 2:32 PM
In the exam is it asked whether to solve any problem with RDD or DataFrames.
Or what matters is the output?
Please respond!!
ReplyDelete
Replies
adarsh singhJune 25, 2017 at 6:43 AM
var d = x.split('|'); -----> var d = x.split('\\|');
ReplyDelete
Replies
pnJuly 18, 2017 at 8:48 AM
This comment has been removed by the author.
ReplyDelete
Replies
pnJuly 18, 2017 at 8:49 AM
scala> var productsDF = products.map(x=> Product(x._1,x._2,x._3,x._4,x._5,x._6)).toDF();

gives error on quickstart 5.10
"Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@5fa9ef3d, see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
... 135 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/cloudera/metastore_db.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
"
I you help to resolve it.
ReplyDelete
Replies
BalaSeptember 5, 2017 at 1:22 AM
Hi Arun ! the question asks the user to sort the data by product_category_id (which bydefault i understand to be in ascending order). Whereas in your solution under "SPARK SQL" you have ordered the result set by descending order.

" product_category_id desc"

In the DF result set, you have sorted in ascending order - "orderBy(col("productCategory"));"

am i missing anything here or should your answer be updated in this blog ?
ReplyDelete
Replies
kiran tejNovember 9, 2017 at 1:26 PM
hi arun,
while am executing the step-4, am getting the below error as managed memory leak. And my spark version is spark 1.6.0 . Am doing this in cloudera quickstart vm.
As it is error, it is not showing the result also.

Can you please tell me how to change this ERROR to WARN so that i can see the result.

dataFrameResult.show();

17/11/09 13:19:01 WARN memory.TaskMemoryManager: leak 8.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@129e18f
17/11/09 13:19:01 ERROR executor.Executor: Managed memory leak detected; size = 8650752 bytes, TID = 7
17/11/09 13:19:01 ERROR executor.Executor: Exception in task 0.0 in stage 10.0 (TID 7)

ReplyDelete
Replies
UnknownNovember 26, 2017 at 7:25 PM
Hi Arun,

Thank you for the sharing, I found some issues which looks typo when trying your solutions.

Please help to correct me if I am wrong, thanks.

STEP 4 :
scala> var products = sc.textFile("/user/cloudera/problem2/products/").map(x=> {var d = x.split('|'); (d(0).toInt,d(1).toInt,d(2).toString,d(3).toString,d(4).toFloat,d(5).toString)});

Step 4-Data Frame Api:
scala> var dataFrameResult = productsDF.filter("productPrice < 100").groupBy(col("productCatID")).agg(max(col("productPrice")).alias("max_price"),countDistinct(col("productID")).alias("tot_products"),round(avg(col("productPrice")),2).alias("avg_price"),min(col("productPrice")).alias("min_price")).orderBy(col("productCatID"));

Step 4- spark sql:
var sqlResult = sqlContext.sql("SELECT productCatID, max(productPrice) AS maximum_price, count(distinct(productID)) AS total_products, cast(avg(productPrice) as DECIMAL(10,2)) AS average_price, min(productPrice) AS minimum_price FROM products WHERE productPrice < 100 GROUP BY productCatID ORDER BY productCatID asc");
ReplyDelete
Replies
UnknownNovember 29, 2017 at 5:25 PM
In the second question, if you are making both the directories at that moment itself, what dta is getting transferred when you are using -mv command?
ReplyDelete
Replies
AnandJanuary 2, 2018 at 9:21 PM
Hi Arun, one important query, for calculating the avg. price you are rounding it off to two decimal places even though it's not explicitly mentioned in the problem statement. So, will it be mentioned explicitly during the exam? or do we need to take care of it implicitly?
ReplyDelete
Replies
UnknownJanuary 9, 2018 at 6:26 PM
Hi Arun, in the real exam, can I use pyspark?
Or it's only allowed use spark-shell ?
ReplyDelete
Replies
UnknownJanuary 23, 2018 at 10:58 AM
Hi Arun,

Thanks a ton for the wonderful blog, it is helping us a lot to get certified. I think you have missed to filter out the products of price lower than 100 USD while using RDD. Correct me if I'm wrong.
ReplyDelete
Replies
UnknownFebruary 3, 2018 at 9:36 PM
Hi
could you please let me know that in CCA175 , will they provide any IDE like eclipse with scala or pythan code sinppet ?
Please let me know

Thanks
Vinay Mallela
ReplyDelete
Replies
UnknownFebruary 8, 2018 at 3:44 AM
IT is very informative blog and useful article thank you for sharing with us , keep posting learn more about Hadoop Admin Online Training Hyderbad

ReplyDelete
Replies
UnknownMarch 6, 2018 at 4:59 AM
Nice infornation keep updaing the postBig data hadoop online training Bangalore

ReplyDelete
Replies
UnknownMarch 19, 2018 at 6:06 AM
This comment has been removed by the author.
ReplyDelete
Replies
TejutejuMay 9, 2018 at 5:20 AM
It is nice blog Thank you provide important information and i am searching for same information to save my timeBig data hadoop online Course India
ReplyDelete
Replies
UnknownMay 25, 2018 at 10:59 PM
Hi Arun, This blog is awesome. When I was doing problem 2 in step4 for spark I found that in sql query columns/fields mentioned as product_category_id,product_price & product_id but we have created the Dataframe with productCatID,productPrice & productID. Wehenver I am running sql query with product_category_id,product_price & product_id getting error as "org.apache.spark.sql.AnalysisException: cannot resolve 'product_price' given input columns: [productDesc, productImage, productPrice, productCatID, productID, productName];"

So please check and correct it if it is wrong.

Thank you.
ReplyDelete
Replies
UnknownJune 13, 2018 at 7:19 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore
ReplyDelete
Replies
SubruJuly 6, 2018 at 5:53 PM
In aggregateByKey answer believe will have to add the Filter <100 also. In Video its correct.
ReplyDelete
Replies
UnknownAugust 12, 2018 at 4:41 AM
hi can you please provide the sample data of the product table. I have not understood the use of Count Distinct applied on product_id column because this column is auto increment this column value is always going to be unique
ReplyDelete
Replies
UnknownSeptember 12, 2018 at 3:24 PM
I got this error , can anyone help me ?
scala> val result_DF = productdf.filter("productPrice < 100").groupBy(col("productCategoryID")).agg(col(max("productPrice")).alias(max_price),countDistinct(col("productID")).alias(total_products),col(min("productPrice")).alias(min_price),col(avg("productPrice")).alias(moyenne_price))
:39: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
val result_DF = productdf.filter("productPrice < 100").groupBy(col("productCategoryID")).agg(col(max("productPrice")).alias(max_price),countDistinct(col("productID")).alias(total_products),col(min("productPrice")).alias(min_price),col(avg("productPrice")).alias(moyenne_price))
ReplyDelete
Replies
UnknownSeptember 13, 2018 at 1:35 PM
For Pyspark

prdd=sc.textFile("/user/cloudera11/products")
from pyspark.sql import Row

pdf=prdd.map(lambda x:Row(product_id=x.split("|")[0],product_category_id=x.split("|")[1],product_name=x.split("|")[2],product_price=x.split("|")[4])).toDF()

pdf.registerTempTable("PT")

pd=sqlContext.sql("select product_category_id, count(product_id),max(product_price),avg(product_price),min(product_price) from PT \
where product_price < 100 \
group by product_category_id \
order by product_category_id")

sqlContext.setConf("spark.sql.avro.compression.codec","Snappy")

pd.write.format("com.databricks.spark.avro").save("/user/cloudera/problem2/products/result-df")
ReplyDelete
Replies
pragyachitraSeptember 27, 2018 at 3:19 AM
After seeing your article I want to say that the presentation is very good and also a well-written article with some very good information which is very useful for the readers....thanks for sharing it and do share more posts like this.

angularjs Training in chennai
angularjs Training in chennai

angularjs-Training in tambaram

angularjs-Training in sholinganallur

angularjs-Training in velachery
ReplyDelete
Replies
UnknownSeptember 27, 2018 at 6:08 PM
Hi Arun,
I am trying to save a DF to a AVRO file with snappy compression.However didn’t notice any size difference between avro file without snappy compression and with snappy compression.Also in your example file size is almost same after snappy compression.I am using the below codes…

top5CustPerMonthMapSortedMapDF.coalesce(2).write.format(“com.databricks.spark.avro”).save("/user/sushital1997/DRPROB6/avro/top_5_cust")

AVRO compression
sqlContext.setConf(“spark.sql.avro.compression.codec”,“snappy”)
top5CustPerMonthMapSortedMapDF.save("/user/sushital1997/DRPROB6/avro/top_5_cust1_snappy",“com.databricks.spark.avro”)

I also tried with the below code
top5CustPerMonthMapSortedMapDF.coalesce(2).write.format(“com.databricks.spark.avro”).save("/user/sushital1997/DRPROB6/avro/top_5_cust_snappy").

I heard that snappy doesn’t work with DataFrame.So what I should do during certification exam if they ask to save a DF to avro with snappy compression.
ReplyDelete
Replies
saiOctober 13, 2018 at 7:28 AM
Awesome! Education is the extreme motivation that open the new doors of data and material. So we always need to study around the things and the new part of educations with that we are not mindful.
python training Course in chennai | python training in Bangalore | Python training institute in kalyan nagar
ReplyDelete
Replies
TejutejuOctober 18, 2018 at 10:11 PM
Nice Information Keep Learning Big Data Hadoop Online Training Hyderabad
ReplyDelete
Replies
nivathaOctober 18, 2018 at 10:55 PM
Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
Data Science Training in Indira nagar
Data Science training in marathahalli
Data Science Interview questions and answers

ReplyDelete
Replies
UnknownDecember 31, 2018 at 7:22 AM
Hi Arun, Thanks a lot for this Blog . I am trying to do this problem in pyspark. I am getting the following error. I am able to create dataframe but i am not able to use dataframe.show(). if i do so. i am getting below error.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)...i tried all options....but no use.. but RDD.take(n) is working.
ReplyDelete
Replies
UnknownApril 5, 2019 at 8:18 PM
hi arun i am getting error like this:

var productsDF = products.map(x=> Product(x._1,x._2,x._3,x._4,x._5,x._6)).toDF();

error: value map is not a member of object product
ReplyDelete
Replies
Mark WallaceApril 15, 2019 at 2:02 PM
Just a suggestion : It may be good to post the result as well so that we all can verify. This is what I am getting. Is anyone else also getting this ? I want to know if my results are ok or if i need to fix anything

+-------------------+-----------------+---------------------------+-----------------+-----------------+
|product_category_id|max_product_price|total_products_per_category|avg_price_per_cat|min_price_per_cat|
+-------------------+-----------------+---------------------------+-----------------+-----------------+
| 59| 70.0| 10| 38.6| 28.0|
| 58| 60.0| 13| 43.69| 22.0|
| 57| 99.99| 18| 59.16| 0.0|
| 56| 90.0| 22| 60.5| 9.99|
| 55| 85.0| 24| 31.5| 9.99|
| 54| 99.99| 18| 61.43| 34.99|
| 53| 99.99| 8| 91.24| 69.99|
| 52| 65.0| 19| 28.74| 10.0|
| 51| 79.97| 10| 40.99| 28.0|
| 50| 60.0| 14| 53.71| 34.0|
| 49| 99.99| 13| 74.22| 19.98|
| 48| 49.98| 7| 35.7| 19.98|
| 47| 99.95| 14| 44.63| 21.99|
| 46| 49.98| 9| 34.65| 19.98|
| 45| 99.99| 7| 55.42| 27.99|
| 44| 99.98| 15| 62.19| 21.99|
| 43| 99.0| 1| 99.0| 99.0|
| 42| 0.0| 1| 0.0| 0.0|
| 41| 99.99| 37| 31.24| 9.59|
| 40| 24.99| 24| 24.99| 24.99|
| 39| 34.99| 12| 23.74| 19.99|
| 38| 99.95| 14| 46.34| 19.99|
| 37| 51.99| 24| 36.41| 4.99|
| 36| 24.99| 24| 19.2| 12.99|
| 35| 79.99| 9| 34.21| 9.99|
| 34| 99.99| 9| 83.88| 34.99|
| 33| 99.99| 19| 58.46| 10.8|
| 32| 99.99| 10| 48.99| 19.99|
| 31| 99.99| 7| 88.56| 79.99|
| 30| 99.99| 7| 95.42| 68.0|
| 29| 99.99| 15| 60.73| 4.99|
| 27| 90.0| 24| 44.16| 18.0|
| 26| 90.0| 24| 41.66| 18.0|
etc etc
ReplyDelete
Replies
UnknownApril 27, 2019 at 7:47 AM
Can you show the rdd solution using pyspark ?
ReplyDelete
Replies
PriyankaMay 31, 2019 at 10:29 PM
Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
python training in bangalore
ReplyDelete
Replies
TBWBJune 24, 2019 at 3:28 AM
Stylish wooden double bed furniture by krishnafurniture
ReplyDelete
Replies
kabsJuly 4, 2019 at 9:02 PM
Hi Arun,
in the DF asnwer, why you have used CountDistinct('product_id) to find the no product's..... this is product master table... do we need to use that... please advise.
ReplyDelete
Replies
UnknownJuly 30, 2019 at 11:54 PM
This is also a very good post which I really enjoyed reading. It is not Digital marketing in pune every day that I have the possibility to see something like this..
ReplyDelete
Replies
KarthikAugust 30, 2019 at 7:58 AM
for problem 2 :

why can't i use the below command to achieve?

hadoop fs -cp /user/cloudera/products /user/cloudera/problem2/products
hadoop fs -rm /user/cloudera/products
ReplyDelete
Replies
PriyankaSeptember 24, 2019 at 11:44 PM
Attend The Artificial Intelligence Online courses From ExcelR. Practical Artificial Intelligence Online courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Artificial Intelligence Online courses.
ExcelR Artificial Intelligence Online courses
ReplyDelete
Replies
drasDecember 16, 2019 at 1:36 AM
very useful post...
inplant training in chennai
inplant training in chennai
inplant training in chennai for it.php
Australia hosting
mexico web hosting
moldova web hosting
albania web hosting
andorra hosting
australia web hosting
denmark web hosting
ReplyDelete
Replies
yamunaJanuary 28, 2020 at 4:08 AM
Good blog, thanks for sharing this information.
Spark Scala Training
ReplyDelete
Replies
veera cynixitMay 28, 2020 at 3:20 AM
Excellent post.
Thank you.Keep updating more concepts onbig data hadoop course
ReplyDelete
Replies
ReshmaNovember 26, 2021 at 2:17 AM
Awesome blog. Thanks for sharing such a worthy information...
Salesforce Training in Bangalore
Salesforce Training in Delhi

ReplyDelete
Replies
HusseyDecember 3, 2021 at 10:02 PM
Extraordinary Blog. Provides necessary information.
german institute in Chennai
german coaching center in Chennai
ReplyDelete
Replies
manashaMarch 17, 2022 at 3:47 AM
Great post. keep sharing such a worthy information.
Angularjs Training in Chennai
Angularjs Certification Online
Angularjs Training In Bangalore
ReplyDelete
Replies
Matt ReevesMarch 31, 2022 at 3:45 AM
Mindblowing blog very useful thanks
Digital Marketing Course in Porur
Digital Marketing Course in OMR
ReplyDelete
Replies
Emerging TechnologiesMay 5, 2022 at 3:17 AM
I finally found great post here.I will get back here. I just added your blog to my bookmark sites. thanks.Quality posts is the crucial to invite the visitors to visit the web page, that's what this web page is providing. data science course in jaipur
ReplyDelete
Replies
Emerging TechnologiesMay 16, 2022 at 11:18 PM
I curious more interest in some of them hope you will give more information on this topics in your next articles. data science course jaipur
ReplyDelete
Replies

Hadoop and Spark Developer - CCA 175