PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS
Video walk through of the solution to this problem can be found here [Click here]
Click here for the video version of this series. This takes you to the youtube playlist of videos.
Video walk through of the solution to this problem can be found here [Click here]
Click here for the video version of this series. This takes you to the youtube playlist of videos.
Problem 1:
- Using sqoop, import orders table into hdfs to folders /user/cloudera/problem1/orders. File should be loaded as Avro File and use snappy compression
- Using sqoop, import order_items table into hdfs to folders /user/cloudera/problem1/order-items. Files should be loaded as avro file and use snappy compression
- Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.
- Expected Intermediate Result: Order_Date , Order_status, total_orders, total_amount. In plain english, please find total orders and total amount per status per day. The result should be sorted by order date in descending, order status in ascending and total amount in descending and total orders in ascending. Aggregation should be done using below methods. However, sorting can be done using a dataframe or RDD. Perform aggregation in each of the following ways
- a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
- b). Using Spark SQL - here order_date should be YYYY-MM-DD format
- c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount
- Store the result as parquet file into hdfs using gzip compression under folder
- /user/cloudera/problem1/result4a-gzip
- /user/cloudera/problem1/result4b-gzip
- /user/cloudera/problem1/result4c-gzip
- Store the result as parquet file into hdfs using snappy compression under folder
- /user/cloudera/problem1/result4a-snappy
- /user/cloudera/problem1/result4b-snappy
- /user/cloudera/problem1/result4c-snappy
- Store the result as CSV file into hdfs using No compression under folder
- /user/cloudera/problem1/result4a-csv
- /user/cloudera/problem1/result4b-csv
- /user/cloudera/problem1/result4c-csv
- create a mysql table named result and load data from /user/cloudera/problem1/result4a-csv to mysql table named result
Solution:
Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario.
Step 1:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/orders \
--as-avrodatafile;
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/orders \
--as-avrodatafile;
Step 2:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table order_items \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/order-items \
--as-avrodatafile;
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table order_items \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/order-items \
--as-avrodatafile;
Step 3:
import com.databricks.spark.avro._;
var ordersDF = sqlContext.read.avro("/user/cloudera/problem1/orders");
var orderItemDF = sqlContext.read.avro("/user/cloudera/problem1/order-items");
var ordersDF = sqlContext.read.avro("/user/cloudera/problem1/orders");
var orderItemDF = sqlContext.read.avro("/user/cloudera/problem1/order-items");
Step 4:
var joinedOrderDataDF = ordersDF
.join(orderItemDF,ordersDF("order_id")===orderItemDF("order_item_order_id"))
.join(orderItemDF,ordersDF("order_id")===orderItemDF("order_item_order_id"))
Step 4a:
import org.apache.spark.sql.functions._;
dataFrameResult.show();
joinedOrderDataDF.
groupBy(to_date(from_unixtime(col("order_date")/1000)).alias("order_formatted_date"),col("order_status")).
agg(round(sum("order_item_subtotal"),2).alias("total_amount"),countDistinct("order_id").alias("total_orders")).
orderBy(col("order_formatted_date").desc,col("order_status"),col("total_amount").desc,col("total_orders"));
Step 4b:
joinedOrderDataDF.registerTempTable("order_joined");
var sqlResult = sqlContext.sql("select to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date, order_status, cast(sum(order_item_subtotal) as DECIMAL (10,2)) as total_amount, count(distinct(order_id)) as total_orders from order_joined group by to_date(from_unixtime(cast(order_date/1000 as bigint))), order_status order by order_formatted_date desc,order_status,total_amount desc, total_orders");
sqlResult.show();
var sqlResult = sqlContext.sql("select to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date, order_status, cast(sum(order_item_subtotal) as DECIMAL (10,2)) as total_amount, count(distinct(order_id)) as total_orders from order_joined group by to_date(from_unixtime(cast(order_date/1000 as bigint))), order_status order by order_formatted_date desc,order_status,total_amount desc, total_orders");
sqlResult.show();
Step 4c:
var comByKeyResult =
joinedOrderDataDF.
map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString))).
combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size)).
toDF().
orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"));
comByKeyResult.show();
joinedOrderDataDF.
map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString))).
combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size)).
toDF().
orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"));
comByKeyResult.show();
Step 5:
- sqlContext.setConf("spark.sql.parquet.compression.codec","gzip");
- dataFrameResult.write.parquet("/user/cloudera/problem1/result4a-gzip");
- sqlResult.write.parquet("/user/cloudera/problem1/result4b-gzip");
- comByKeyResult.write.parquet("/user/cloudera/problem1/result4c-gzip");
Step 6:
- sqlContext.setConf("spark.sql.parquet.compression.codec","snappy");
- dataFrameResult.write.parquet("/user/cloudera/problem1/result4a-snappy");
- sqlResult.write.parquet("/user/cloudera/problem1/result4b-snappy");
- comByKeyResult.write.parquet("/user/cloudera/problem1/result4c-snappy");
Step 7:
- dataFrameResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4a-csv")
- sqlResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4b-csv")
- comByKeyResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4c-csv")
Step 8:
a) login to MYSQL using below : mysql -h localhost -u retail_dba -p
(when prompted password use cloudera or any password that you have currently set)
b) create table retail_db.result(order_date varchar(255) not null,order_status varchar(255) not null, total_orders int, total_amount numeric, constraint pk_order_result primary key (order_date,order_status));
c)
sqoop export \
--table result \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--export-dir "/user/cloudera/problem1/result4a-csv" \
--columns "order_date,order_status,total_amount,total_orders"
(when prompted password use cloudera or any password that you have currently set)
b) create table retail_db.result(order_date varchar(255) not null,order_status varchar(255) not null, total_orders int, total_amount numeric, constraint pk_order_result primary key (order_date,order_status));
c)
sqoop export \
--table result \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--export-dir "/user/cloudera/problem1/result4a-csv" \
--columns "order_date,order_status,total_amount,total_orders"
--table result \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--export-dir "/user/cloudera/problem1/result4a-csv" \
--columns "order_date,order_status,total_amount,total_orders"
Are you covering all possible problems in each scenario?
ReplyDeleteWhile that is my goal, it may not be possible to cover all scenarios in a single problem questions. But the goal is to cover all exam topics in the series of problems. I am not targeting to post more than 10. So far i have posted 3 as of the time this comment is posted. I have 2 getting ready to be published by End of day today. Will do 5 more by end of this week with other possible combinations.
DeleteGreat Help buddy..
DeleteIt would be if you could prepare each problem scenarios possibly inclusive of all exam objectives in future.. may be like 1- sqoop import, 1 -sqoop export/1- flume, 8- sprak questions.. ( 3- spark - etl, 4-spark sql, 1-spark configuraton). This structure would help us to consider these scenarios as real mock exams with solutions..
ReplyDeleteThanks for your efforts on this so far..
Sure, i will create the next set of problems in the pattern you suggested.
DeleteThanks for accepting the suggestion. Appreciate your efforts.
DeleteIn exam , mainly when coming to spark and spark-sql questions can we give solutions in any of the approach we prefer or is it manandatory to do it RDD and spark-sql for asked question.
Deletewhere can we get these sample datasets to come up with solution and test it on our vm? Thx
ReplyDeleteAll of datasets come prepackaged in the cloudera quickstart vm. I have not used anything that is not available in the QuickStart. Install it from link below link
DeleteHttp://www.cloudera.com/downloads/quickstart_vms/5-10.html
Thx. Will download and try it. I took the Cloudera course but that doesnt involve complicate aggregate functions etc. So this blog is helpful.
DeleteYes this blog will help, I have uploaded a video walk through of strategy as well as for each problem. Please go through the videos and let me know what you think
DeleteHello Arun,
DeleteWhat should be the approach to solve a problem wherein we have to copy the content of a hive table "abc" to a hdfs location "xyz/abcd" and change the delimiter to "|"?
Hi Arun, are you covering problems based on the new syllabus by cloudera?
ReplyDeleteYes. Please check the prep plan post where I explained the same in detail
ReplyDeletegood job it is helpful blog
ReplyDeletei tied step4c: to round the big decimal by adding this format "%.2f".format(x)
ReplyDeletejoinorders.map(x=>((x(1).toString,x(3).toString),(x(8).toString.toDouble,x(0).toString.toInt))).combineByKey((x:(Double,Int))=>(x._1,Set(x._2)),(x:(Double,Set[Int]),y:(Double,Int))=>(x._1+y._1,x._2 + y._2),(x:(Double,Set[Int]),y:(Double,Set[Int]))=>(x._1+ y._1,x._2++ y._2)).map(x=>(x._1._1,x._1._2,"%.3f".format(x._2._1),x._2._2.size))
This comment has been removed by the author.
Deletehello Arun
ReplyDeletethanks for your blog , really useful, just one question. why do you use combineByKey instead of group by..because performance? or the exam ask for use it.. (still for me os complicated to use and I dont find eny site who can explain better)
combineByKey gives more control. If you can solve the same problem using group by key then i recommend using it. however, remember that you may be asked questions to complete a programming sentence and hence you need to be prepared with using all major spark functions.
Deletethank you Arun. your blog helped me to clear exam cca175.
ReplyDeleteVery very happy to know. Congratulations!
DeleteHello please which editor is use during CCA175 exam. Thank you
DeleteHi Arun,
ReplyDeleteWhile doing sqoop export do we have to takecare of anything say suppose my table in mysql is having 7 rows when I describe I see table with 7 columns with one field is having not null . I have to export using hdfs file . I used below query it did not work our please correct me
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username --password --table departments --export-dir /hdfs/ocation --bath --columns colunames --batch --outdir output
anything wrong in the about command my job is failing with some file format issue . Please help me
Thanks,
Satish.
your question is about export the command you posted is doing import. are you sure this is the command you want me to review? also i dont know what bath option is... i know you used batch which i kind of understand but not bath.
DeleteHi Arun ,
ReplyDeleteMiss typed it is export.
Thanks,
Satish.
Hi Arun,
ReplyDeleteI took the exam yesterday and I met a problem when loading multiple files(parquet,avro) into the spark shell. However, today when I try directly load them into the spark shell just like
avroLoadingTest = sqlContext.read.format("com.databricks.spark.avro").load("/user/raku/test/avroLoadingTest")
everything is fine. Now I am confused by this thing, because before the test I had tried loading single file only, so I am not sure is loading multiple files in a directory in one time not supported or I just got some grammar mistake in the exam. Is it alright to use a single sentence like above to load multiple files in a directory into spark? Thank you.
P.S. The scenario you gave are really helpful
yes you can load all the files when you specifiy a directory location. you can also load files from different directories at once.
DeleteHi Arun.. Is there possibility of having multiple types of files at one directory ? or we will get one file format at a time?
DeleteThanks for your response in advance
Hi Arun if it is possible can you have spark Databricks exam certification on your blog.
ReplyDeleteHi Arun thanks for the post ...can i know this is how exactly question is exam?? i mean each question will have at least 10 task to complete? please let me know i'm going to take exam in next month
ReplyDeleteeach question will have one or two tasks to complete.
ReplyDeleteHi Arun,
ReplyDeleteAre you planning to more problem scenarios or is the 7th one the last scenario?Your blog is really helpful.Thanks for creating the problem scenarios
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteBuddy you removed the comment by the time i enjoyed viewing it :). All the best for your CCA 175 exam buddy.
DeleteThis comment has been removed by the author.
Deletecongrats
DeleteHi Arun,
ReplyDeleteReally a huge effort you are doing here so big thank to you.
i have two questions:
1/how can i export parquet file into mysql using sqoop? you did the export for the csv file, but when i tried the same command using the parquet directory it gave me some exceptions.
2/Are you sure the queries in 4a, 4b and 4c give the same result? its not the case for me.
Thanks!
This comment has been removed by the author.
DeleteBelow are responses to both questions
ReplyDelete1. Create a hive table that is backed by parquet file, then use hive table as source
2. Watch the video
Hi Arun --
DeleteGreat content! Thanks for doing this.
Sorry, I agree with Taufik in that Results for 4a, 4b, and 4c do not match. SQL and DF result sets do but CombineByKey resultset does not match with the other two from second row on. For example, the second row total amount for dataframe and SQL res is 16333.16; while for combineByKey res for the same row is 10999.05 (off by about 4000). What you are doing appears to be correct -- I am not sure what need to be done differently, can't tell why the result for combineByKey is wrong. MySQL query supports the other two result sets.
Hi Arun, Thanks for a great block. I was going through the Sqoop documentation "sqoop-1.4.6-cdh5.10.1". I have two questions 1) Do we need to by-heart compression technique names because it is not mentioned in the documentation 2) Is the compression technique is mentioned in any other document? Thanks in advance.
ReplyDeleteWorking (hands on) knowledge of compression techniques is absolutely required not only for the exam but also on a day to day life as a big data developer. You need to memorize the compression methods and i recommend that you go through the file formats link of this blog. best of luck
DeleteBelow are the available compressions from Apache. Just add org.apache.hadoop.io.compress before these compression methods/algorithms while writing sqoop code.
DeleteBZip2Codec, DefaultCodec, DeflateCodec, GzipCodec, Lz4Codec, SnappyCodec
This comment has been removed by the author.
ReplyDeleteHello Arun,
ReplyDeleteThanks for the posts . They are really helpful.
But I am confused with analyzing the right way to solve a given problem as there could be many solutions possible. For example , creating a hive table in say parquet file format can be done through Hive or sqoop import . How do I know which is the right solution during certification. Please suggest. Suggestions from people with certification experience will be helpful too.
Thanks once again for all the effort .
Hi Arun,
ReplyDeleteThank you for the blog. The questions are of great detail and the questions were well structured
it depends. the exam may ask you to do it in a certain method or just ask you for the result. if the exams question is asking is you only for a result then you are free to choose whatever method you want. but always remember that spark allows a lot of flexibility whereas sqoop is very limited. so choose a technology that helps you solve the problem in the fastest way. Time is premium, i have heard from so many that they could not complete the exam as they took too much time debugging their procedure. so be careful choosing a solution..
DeleteHello Arun,
ReplyDeleteI have booked exam and waiting to take it once preperation is done. Am confusing a lot with scala i decided to learn more on dataframes as you have provided in problrm1 and 2 as it is easy compare to scala is it ok to choose dataframes in exam for all type of problems can you add some more videos regarding dataframes. that would be greatly appreciated.
This comment has been removed by the author.
ReplyDeleteHi Arun,
ReplyDeletethanks for your blog.
from exam point of view should we be aware of all the below 3 types? If i'm good in sql, can i ignore a &c?? Please advise..
a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
b). Using Spark SQL - here order_date should be YYYY-MM-DD format
c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount
be aware of all three. you may be asked to fill in the blanks and execute the program.awareness about all three will make you better prepared.
DeleteHello , I have been trying the problems scenarios given in your blog . but could you please provide the sample files for order-items so we can try in our laptop .for example lets say problem 1 sqooping order-items data . So I need sample data for this. Appreciate your response :)
ReplyDeleteBuddy, you need to have CDH downloaded from cloudera. CDH has all the data in mysql database that comes with it. Watch the video for an understanding how that data is pulled from mysql to hdfs.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHello please which editor is use during CCA175 exam. Thank you
ReplyDeleteuse gedit.
DeleteThanks for your blog.
ReplyDeleteI have one question. Is it mandatory to practice the questions on Cloudera Platform only. I have experience working on Hortonworks platform. Can I directly attempt the exam by practicing on Hortonworks Platform
practice on cloudera platform. this gives you acquaintance. Even seasoned big data developers failed the CCA 175 exam recently. not due to complexity of the exam but due to unfamiliarity of the environment. they ran out of time and hence could not complete the required number of problem scenarios.
Deletevar dataFrameResult =
ReplyDeletedataFrameResult.show();
ERROR IN MY MACHINE PLEASE REPLY ME
Hi Arun,
ReplyDeleteI have cleared CCA spark and Hadoop Developer because of your blogspot.
Thank you very much!! Hope to see more problem scenarios more like this.
Regards,
Aswin Ramakrishnan
Hi Aswin,
DeleteBig congratulation!!
Can you please tell me which editor we are allowed to use during exam. is it VI or gredit?
Hi Arun sir Can I get the data of problems which you have explained.
ReplyDeleteBeautiful Blog.........Great content presented in the best possible way that the entire big data knowledge can be streamlined
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHello Arun - I think you have done the coding in Scala in all your videos so if someone wants to take the exam using Python do you have the code snippet for the same ? If yes can you please share the same -Auro
ReplyDeleteHi Arun,
ReplyDeleteI have booked exam and holding up to take it once planning is finished. Am mistaking a considerable measure for Scala I chose to take in more on information outlines as you have given in issue and 2 as it is simple contrast with Scala is it alright to pick information outlines in exam for all kind of issues would you be able to include some more recordings in regards to information outlines. that would be significantly refreshing.
Regards,
Trep Helix
Hi Arun,
ReplyDeleteIn certificate exam, this avro databrocks will work in the spark-shell. Because as per cloudera they do not provide "Databricks API" in the CDH install
import com.databricks.spark.avro._;
Hi Arun,
ReplyDeleteI've WATCHED YOUR YOUTUBE VIDEO but I still DON'T UNDERSTAND THIS part of 4c exercise. I'm totally lost with the combineByKeyResult:
combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size))
Completeley lost with the combineByKey. Could you please explain me what you are doing here?
I've got my exam in two days.
Thank you
Even i couldn't understood from 2nd lamda fn:(
DeleteHI Arun,
ReplyDeleteThis is really informative blog..
But CCA-175 exam content is changed. Do you help with new playlist for the same?
Hi Sohil,
DeleteThese would suffice the CCA175 exam content as well.
This comment has been removed by the author.
ReplyDeleteHi Arun/Venkat,
ReplyDeleteCan we submit more than one file as output because "spark.sql.shuffle.partitions" value is more than 1, by default. OR we have submit one output file only.
Hi Jitesh,
ReplyDeleteBy default "spark.sql.shuffle.partitions" is set to 200. This can be even configured to value 1 based performance tuning options.
Hi Venkat,
ReplyDeleteThanks for your answer, But my question is still same, can we submit more than file? Because it will save time during exam.
If you are expecting yes or no kind of answer. Answer is YES.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteI am trying to use aggregateByKey instead of combineByKey.but getting issue with count rest all looking good...could you please let me know what i am missing here,
ReplyDeleteval ordersDF = sqlContext.read.avro("/user/imthiyas90/problem1/orders")
val orderitemDF = sqlContext.read.avro("/user/imthiyas90/problem1/order_items")
ordersDF.registerTempTable("orders")
orderitemDF.registerTempTable("order_items")
val joinedDF = ordersDF.join(orderitemDF, ordersDF("order_id") === orderitemDF("order_item_order_id"))
val combine_agg = joinedDF.map(e => ((e(1).toString,e(3).toString),(e(8).toString.toDouble,e(0).toString.toInt))).aggregateByKey((0.0,0))((x:(Double,Int),y:(Double,Int)) =>(x._1+y._1,x._2+1),(x:(Double,Int),y:(Double,Int)) =>(x._1+y._1,x._2+y._2)).map(x => (x._1._1,x._1._2,x._2._1,x._2._2)).toDF.orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"))
This comment has been removed by the author.
ReplyDeleteHi Arun,
ReplyDeleteThanks for providing such a wonderful blog for those who aspire to clear CCA Spark And Hadoop Certification. But what I found is all of the solutions are explained in scala, I am learning Pyspark and don't know scala. Could you please tell if you have the same solutions in Python as well?
Thanks
Hi Arun,
ReplyDeleteI tried from_unixtime. But is returning null values.
Thanks for good info.It's very good blog
ReplyDeleteBig data Hadop Online course Hyderabad
Hi, i appreciate your great efforts! Is it necessary to round the sum of order_item_subtotal to 2 decimal values? They have not mentioned in the question.
ReplyDeletewhen it comes to saving as csv can we do following steps:-
ReplyDeleteval resultRDD = resultDF.rdd.
map(_.toString().replace("[","").replace("]","")).
saveAsTextFile("/user/cloudera/problem1/result4a-csv")
Hi Arun,
ReplyDeleteThanks a ton for these exercises and the videos.
I have cleared my CCA 175 today. These exercises has helped a lot in my preparation and gave the confidence to take up the test.
Thanks
Vamsi
This comment has been removed by the author.
Deletemany congratulations for passing the exam. I am going to appear for exam in few days. I just wanted to know did you get templates in exam? Was there any question about avro? How many questions were there from each category? Your answers can really help since you have just given it recently. i will really appreciate if you will revert.
DeleteIt is nice blog keep updating your blog and i am waiting for your next update also Big Data Hadoop Online course
ReplyDeleteArun from_unixtime does not work in pyspark . do we have any solution for same ?
ReplyDeleteCan we not register the Dataframes as temp tables then we can use SQL ? Can you please explain .
ReplyDeleteHi Arun,
ReplyDeleteThanks for the great effort.
In step 3, you've imported avro files from databricks, will this import statement be given to me during the exam or I need to memorize it as it's not mentioned in spark docs and as far as i know i'll only have access to spark docs.
Thanks
hi arun.. your video tutorials is good. but from where i can get sample data files that you used in your videos.
ReplyDeletei want to try out. please give me path from which i can download that files and folders and try out on my local machine
Hi Arun, how different will be the problem scenario of CCP DE575 from what you have given here
ReplyDeleteHi Please can you share top 20 rows of result set to compare our output.
ReplyDeleteNice post thank you for sharing keep updating the postBig data hadoop online Course Hyderabad
ReplyDeleteto_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date is giving null values but when I just query o.order_date from orders temp table it is giving correct value which is of data type String .Any help in this situation?
ReplyDeleteThe order date should be long data type and if we use to_date(from_unixtime(cast(order_date/1000 as bigint))) then it will correct output in YYYY-MM-DD format
ReplyDeletewhat all the api documentation available during exam in VM itself?
ReplyDeleteIt was really a nice article and i was really impressed by reading this Big data hadoop online Course Hyderabad
ReplyDeletegreat efforts!!. One suggestion from my end .. Please provide the solutions in python as well .. it would be great help.
ReplyDeleteHi,
ReplyDeleteIn the problem scenario 1, problem 1 when have been asked to use Snappy compression. The code which you have given contains "--compression-codec org.apache.hadoop.io.compress.SnappyCodec". Would like to know, are we supposed to remember org.apache.... to use the Snappy compression? or we can get it from hadoop or even from user guide?
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete3.Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.
ReplyDeletecan this be done as
ordersDF = sc.textFile("xxx/xxx")
ordersDF.toDF
Hello,
ReplyDeleteThank you for this amazing blog. I am bit confuse is this the entire demo question paper or just one question in the exam?
Hi Arun,
ReplyDeleteThank for your video, Request you to please add some dedicated videos upon groupByKey,CombineByKey,reduceByKey....functions
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeleteBig Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore
Thanks for sharing Good Information
ReplyDeleteBig data Hadoop Online Training Hyderabad
Hi,
ReplyDeleteI am not able to follow the combineByKey step.
1. in the second lambda, is the output of first lambda treated as the input for second lambda?
2. what will be value of x, y in the second lambda.
I did watched some videos to understand combineByKey. ALso please share any good videos for combineByKey.
This comment has been removed by the author.
DeleteHi Arun,
ReplyDeleteI have really motivated after seen your preparation plan for certification. you have done a very good job. I am following your plan and started trying hands-on. My humble request is like if you can provide the same code in python instead of scale it would be really helpful for me to understand since i have worked with python so far. expecting some materials as well as more scenarios.
Thanks,
Aruna Asokan
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi Arun,
ReplyDeleteThanks for the blog and usful questions.
I'm trying to resolve the problem1 using pyspark and stuck at point 7 solution.
I'm trying to transform the sqlResult(in my below code "resultDF") DF to RDD using the map function ,but ending up some numeric valies in the map variable.Even I save as text file like below, I'm seeing some junk data as in bold below.
>>resultDF. \
map(lambda m: str(m(0))+','+str(m(1))+','+str(m(2))+','+str(m(3))). \
coalesce(4). \
saveAsTextFile("/user/....")
>>resultMap = resultDF. \
map(lambda m: str(m(0))+','+str(m(1))+','+str(m(2))+','+str(m(3)))
>>for i in resultMap.take(10):print(i)
Row(2014-07-24=0),Row(2014-07-24=1),Row(2014-07-24=2),Row(2014-07-24=3)
Even I tried to config the compression to uncompressed , but still I am seeing same result.
sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed");
could you please check how can I save the resultDF to textFile (csv).
Thanks
Sekar
Hi Reddy , i think i can help firstly i used to converting DF to RDD in python like this :
DeleteMyNewRDD = resultDF.rdd.map(list)
now for textfile with comma delimiter(CSV):
MyNewRDD.map(lambda V:str(V[0])+','+str(V[1])+','+str(V[2])+','+str(V[3])).saveAsTextFile("path here ... ")
the default compression is uncompressed so you don't have to do anything
some tip for you , when you do compression in RDD its a bit different from scale
Eg: RDD.saveAsTextFile("path",compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec')
don't worry about the syntax you can get it from writing on the shell: help(myRDD.saveAsTextFile)
This is awesome.. Many questions to save file as text in exam. was looking for a way to save it in pyspark for a while
DeleteHI, where can i access the data set for the scenario, plus the correct output?
ReplyDelete
ReplyDeleteWell done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful R Programming tutorial on youtube so you can check that too for more detailed knowledge on R Programming.https://www.youtube.com/watch?v=rgFVq_Q6VF0
ReplyDeleteVery Impressive Data Science tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Data Science course. I'm also a learner taken up Data Science training and I think your content has cleared some concepts of mine. While browsing for Data Science tutorials on YouTube i found this fantastic video on Data Science. Do check it out if you are interested to know more.:-https://www.youtube.com/watch?v=gXb9ZKwx29U&t=237s
please help me convert combineByKey scala code to pyspark
ReplyDeleteAfter reading this blog i very strong in this topics and this blog really helpful to all Big data hadoop online Course Bangalore
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteIt was looking like a difficult challenge to pass CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Exam because I was short of time and was not having any proper material. Then one of my friends suggested me to use CCA175 Braindumps for the preparation of my certification. I am thankful to Realexamdumps.com for free demo that won my trust. I found it the same as I was told.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteFor Pyspark AVRO...
ReplyDeleteThere is no need to install an additional package during the exam. You are given a base CDH5.10 cluster with the default installation. That includes Spark 1.6 and Avro packages. https://www.cloudera.com/documentation/enterprise/5-10-x/topics/spark_avro.html
Ex: op.write.format("com.databricks.spark.avro").save("/home/cloudera/retail_db/orders/avro1")
opdf = sqlContext.read.format("com.databricks.spark.avro").load("/home/cloudera/retail_db/orders/avro1")
Hello Arun,
ReplyDeleteWhat should be the approach to solve a problem wherein we have to copy the content of a hive table "abc" to a hdfs location "xyz/abcd" and change the delimiter to "|"?
It was not my first attempt for this certification but I am sure if I had chosen CCA175 braindumps before then it would be my first attempt with ultimate success. Like everyone I am very happy over my success but more than it I am thank full to Dumps4download for all the help and support they provided me throughout my preparation.
ReplyDelete4a using pyspark
ReplyDeletefrom pyspark.sql import functions as F
ojS.groupBy(F.to_date(F.from_unixtime(ojS.order_date/1000)).alias("Formated_order_date"),'order_status'). \
agg(F.sum('order_item_product_price').alias("total_amount"),F.count('order_id').alias("total_orders")). \
orderBy(['Formated_order_date','order_status','total_amount','total_orders'], ascending=[0, 1, 0, 1]).show()
Hi, did you use Cloudera or HDP Sandbox for running the pyspark shell ?
Deletefor the question 4c, while creating the rdd using
ReplyDeleteval rdd = joindf.map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString)))
rdd is resulting as dataset instead of rdd. what could be the issue?
because of this I am not able to use combineByKey.
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Training Bangalore
ReplyDeleteHi Arun,
ReplyDeleteCan you help me in understanding your combineByKey code. Below is my query . your comments pls
https://stackoverflow.com/questions/52677045/understanding-of-combinebykey
Hi
ReplyDeleteBelow are pyspark code
pyspark --master yarn --jars /home/cloudera/Downloads/spark-avro_2.10-2.0.1.jar
order_df = sqlContext.read. \
format("com.databricks.spark.avro"). \
load("/user/cloudera/problem1/orders")
order_df.show()
order_item_df = sqlContext.read. \
format("com.databricks.spark.avro"). \
load("/user/cloudera/problem1/order-items")
order_item_df.show()
joined_df = order_df. \
join(order_item_df, order_df.order_id == order_item_df.order_item_order_id)
from pyspark.sql.functions import *
result_df = joined_df. \
groupby(to_date(from_unixtime(col("order_date")/1000)).alias("order_date"), col("order_status")). \
agg(countDistinct(col("order_id")).alias("total_orders"), round(sum(col("order_item_subtotal")), 2).alias("total_amount")). \
orderBy(col("order_date").desc(), col("order_status"),col("total_amount").desc(), col("total_orders"))
result_df.show()
joined_df.registerTempTable("joined_table")
sqlContext.sql("select * from joined_table limit 10").show()
result_sql = sqlContext. \
sql("select to_date(from_unixtime(cast(order_date/1000 as int))) as formatted_order_date, \
order_status, round(sum(order_item_subtotal) ,2) as total_amount, \
count(distinct(order_id)) as total_orders from joined_table \
group by order_date, order_status \
order by formatted_order_date desc, order_status, total_amount desc, total_orders")
result_sql.show()
result_rdd = joined_df. \
map(lambda x : ((x[1], str(x[3])), (float(x[8]), int(x[0])))). \
combineByKey(
lambda x : (x[0], set([x[1]])),
lambda x, y : (x[0] + y[0], x[1] | set([y[1]])),
lambda x, y : (x[0] + y[0], x[1] | y[1])). \
map(lambda x :(x[0][0], x[0][1], x[1][0], len(x[1][1]))). \
toDF().orderBy(col("_1").desc(), col("_2"), col("_3").desc(), col("_4"))
result_rdd.show()
Hi Arun,
ReplyDeleteThis blog is really helpful for hadoop certification... Really appreciate your effort to put these content at one place..
I gave a try last week, but couldn't clear due to time management. I got one question where I have to rename order_id column as "order id" in select clause in sqlContext.sql, and save the result as parquet file (or avro), I was able to rename by using <>.toDF("order id") to my result, but I was not able to write to file with space in column header. Could you please help to resolve this.
Hello Arun,
ReplyDeleteI Couldn't understand this logic. could you please break each step and explain for me.
result_rdd = OrdersJoin.map(lambda x:((x[1], x[3]), (float(x[8]), int(x[0])))). \
combineByKey(lambda x : (x[0], set([x[1]])), \
lambda x, y : (x[0] + y[0], x[1] | set([y[1]])), \
lambda x, y : (x[0] + y[0], x[1] | y[1])). \
map(lambda x :(x[0][0], x[0][1], x[1][0], len(x[1][1]))). \
toDF().orderBy(col("_1").desc(), col("_2"), col("_3").desc(), col("_4"))
how do we know the location of the database. Is it part of the direction in the problem because your problem one only shows the location in hdfs where the files is imported to.
ReplyDeleteThanks
can you please provide the dummy data file for the problem?
ReplyDeleteHello Arun,
ReplyDeleteThank you so much for this resource, it has been invaluable for me in preparing for the CCA175 test.
I have a question for you that isn't in this blog, I was wondering if you could help me...
I'm creating a case class to convert rdd to df and was wondering what data type date would be in my case class.
ex: case class orderClass( order_id : Integer, orderDate : ?, )
I already tried finding the answer myself, you're sort of my last resort.
Thank you,
Hey Arun,
ReplyDeleteFirst of all great on comprehensive guidance from your blog.
I am just a beginner with big data and wanted to ask that while attempting the certification exam can we google stuff if we dont know..:-)
Hi Arun thank you very very much for this blog. your contribution is great.
ReplyDeleteBelow is my code for who practice in pyspark:
https://github.com/Karthik-NS/solutions-in-pyspark-for-problems-in-Aruns-Blog-
Nice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me
ReplyDeleteNice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me
ReplyDeleteIs this just one question, or 8 different question in exam.
ReplyDeleteThanks for sharing your valuable information and time.
ReplyDeleteHadoop Training in Delhi
Hadoop Training institute in Delhi
Thanks Arun. The blog helped me to clear the certification successfully!
ReplyDeleteI could not imagine such an amazing study experience but it helped me to secure my desired grades in my coveted IT certification. 7004 Dumps really changed my professional life by giving me a gift in the form of successful career opportunity. I could have hardly passed my examination without taking help from Avaya 7004 Dumps. I say thanks to Dumpspass4sure.com for all the help.
ReplyDeletehttps://www.dumpspass4sure.com/avaya/7004-dumps.html
I got success at the first attempt with Pass4sure CompTIA dumps and am happy for the way it happened. I chose this PDF guide quickly after looking into free demo questions. I think everyone must download free demo questions to take CompTIA exam material as their helping guide.
ReplyDeletetHANKS FOR THE bLOG. REALLY HELPFUL
ReplyDeleteThough it is challenging to transform the data from one computer to another, that is, data migration , but with the help of this program, I had not felt any need to change the source code for uploading the several database tables.
ReplyDeleteGood post!Thank you so much for sharing this lovely article.It was so good to read and useful to upgrade my understanding...
ReplyDeleteData Science Course in Pune
Good blog, thanks for sharing this information.
ReplyDeleteSpark Scala Training
The main motive of the Automated big data engineering is to spread the knowledge so that they can give more big data engineers to the world.
ReplyDeletedo you have pyspark version of the code?
ReplyDeleteto_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date why are you doing this? and what's the datatype of order_date here?
ReplyDeleteThis is what I did: is mine right? Please confirm
data = spark.sql('''select distinct count(order_id) count,sum(order_item_subtotal) total ,order_status,date_format(order_date,'EEE') day from orders o join orderitems oi
on o.order_id = oi.order_item_order_id group by o.order_status,day order by day desc, order_status, total desc, count''')
Hi arun,
ReplyDeleteAre you using spak 1.6 or spark 2 please? Is there any difference?
Thanks
Hi Arun, This blog help me a lot to pass the certification. We know that the examn was recently updated, however, it was very useful for me. Thanks a lot.
ReplyDeleteMost informative post.Thank you.
ReplyDeleteMachine Learning training in Pallikranai Chennai
Data science training in Pallikaranai
Python Training in Pallikaranai chennai
Bigdata training in Pallikaranai chennai
Spark with ML training in Pallikaranai chennai
Btreesystems Specializes in Offering Real-time Experience IT training on AWS, ..etc.. No.1 Software (IT) Training Institute in India aws training in chennai
ReplyDeletePython training in Chennai
data science training in chennai
hadoop training in chennai
machine learning training chennai
ReplyDeleteExcellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well.
Data Scientist Course in pune
Good blog and good work, thanks for sharing this information
ReplyDeleteI encountered two errors while doing Problem Scenario 1, thanks for your help :
4c. var rddResult = joinedDF.map(x=> (x(1),x(3),x(0),x(9)));
#=>>>error: java.lang.ClassNotFoundException: scala.Any
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
7. dataframeResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/osboxes/problem1/result4a-csv")
#=>>>error: value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[String]
sqlResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/osboxes/problem1/result4b-csv")
#=>>>error: value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[String]
For information, I'm on Cloudera QuickStart VM 6.3.2
Nice article please do visit my website for cloudera developer training
ReplyDeleteThanks for sharing the such information with us.
ReplyDeleteHadoop Certification in Pune
Well written articles like yours renews my faith in today's writers. The article is very informative. Thanks for sharing such beautiful information.
ReplyDeleteBest Data Migration tools
Penetration testing companies USA
What is Data Lake
Artificial Intelligence in Banking
What is Data analytics
Big data Companies USA
What is Data Lake
What is Data Migration
What is Data Science
java online courses
ReplyDeletesalesforce online courses
hadoop online courses
Data Science online courses
linux online courses
etl testing online courses
web methods online courses
business analyst online course
oracle adf online course
You can't imagine what I am going to tell you. Our institution is offering you a CS executive classes and a free of cost CSEET classes and many more to explore. So please contact us or visit our website at https://uniqueacademyforcommerce.com/
ReplyDeleteThanku For Sharing Nice Informative Blog.
ReplyDeleteVisit us: uipath training in delhi
Visit us: ui path online training
Visit us: uipath training in hyderabad
Visit us: ui path course in bangalore
thanks for the article on hadoop keep writing and sharing.
ReplyDeleteData Science Training in Pune
web methods online courses
ReplyDeletebusiness analyst online course
oracle adf online course
oracle rac online course
msbi online course
osb online training
azure devops online training
best digital marketing agancy new tips for best digital marketing
ReplyDeletefree classified submission sites list
kishorsasemahal
nices information thanku so much this information
ReplyDeletebluehost-discounts
digital marketing tips
thanku so much this information
ReplyDeletevalueking
thanks arun to share valuable cloudera spark certification training related (cca 175 ) questions, but recently its changed syllabus.
ReplyDeleteRegards
Venu
bigdata training in Hyderabad
Set An Alert For RMB TO USD To Receive An Email When The Exchange Rate Changes. Alternatively, Bookmark The Page And Check Back Here Regularly.
ReplyDeleteMetallurgical equipment in India
ReplyDeleteAre You Thinking About Joining AVATRADE REVIEW ? If You Are Then Read The Review On AVATRADE Fx In This Article. We, Will, Tell You About The Company, The Brokers, The Platform, And All The Information You Need To Know. Read More Here.
ReplyDeletebeautiful blog thank you Chandan Online
ReplyDeleteShankh Plate
Branch Lamp
Sphatik mala
Nice Piece Of Information, Keep Sharing Such Informative Post.
ReplyDeletebig data hadoop course
Call on 7070905090 To Join Ducat Today
ReplyDeleteukcat organic chemistry
bmat chemistry
organic chemistry tutor
NEET chemistry home tutor
Thanks for this nice blog
ReplyDeleteRudraksha mala
Rudraksha Bracelet
Great Post. Very informative. Keep Sharing!!
ReplyDeleteApply Now for Big DATA Training Classes in Noida
For more details about the course fee, duration, classes, certification, and placement call our expert at 70-70-90-50-90
Very Informative post. Thank you for sharing with us.
ReplyDeleteThirukkural pdf
Sai Satcharitra in English pdf
Sai Satcharitra in Tamil pdf
Sai Satcharitra in Telugu pdf
Sai Satcharitra in Hindi pdf
aqualudo
ReplyDeleteThanks for sharing such information with us.
ReplyDeleteJewellery Billing Software
Jewellery Billing Software
thanks for the article keep writing and sharing.
ReplyDeleteJewellery ERP Software UAE
Jewellery ERP Software UAE