Problem Scenario 1

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walk through of the solution to this problem can be found here [Click here]

Click here for the video version of this series. This takes you to the youtube playlist of videos.

Problem 1:

Using sqoop, import orders table into hdfs to folders /user/cloudera/problem1/orders. File should be loaded as Avro File and use snappy compression
Using sqoop, import order_items table into hdfs to folders /user/cloudera/problem1/order-items. Files should be loaded as avro file and use snappy compression
Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.
Expected Intermediate Result: Order_Date , Order_status, total_orders, total_amount. In plain english, please find total orders and total amount per status per day. The result should be sorted by order date in descending, order status in ascending and total amount in descending and total orders in ascending. Aggregation should be done using below methods. However, sorting can be done using a dataframe or RDD. Perform aggregation in each of the following ways
- a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
- b). Using Spark SQL - here order_date should be YYYY-MM-DD format
- c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount
Store the result as parquet file into hdfs using gzip compression under folder
- /user/cloudera/problem1/result4a-gzip
- /user/cloudera/problem1/result4b-gzip
- /user/cloudera/problem1/result4c-gzip
Store the result as parquet file into hdfs using snappy compression under folder
- /user/cloudera/problem1/result4a-snappy
- /user/cloudera/problem1/result4b-snappy
- /user/cloudera/problem1/result4c-snappy
Store the result as CSV file into hdfs using No compression under folder
- /user/cloudera/problem1/result4a-csv
- /user/cloudera/problem1/result4b-csv
- /user/cloudera/problem1/result4c-csv
create a mysql table named result and load data from /user/cloudera/problem1/result4a-csv to mysql table named result

Solution:

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario.

Step 1:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/orders \
--as-avrodatafile;

Step 2:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table order_items \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/order-items \
--as-avrodatafile;

Step 3:

import com.databricks.spark.avro._;
var ordersDF = sqlContext.read.avro("/user/cloudera/problem1/orders");
var orderItemDF = sqlContext.read.avro("/user/cloudera/problem1/order-items");

Step 4:

var joinedOrderDataDF = ordersDF
.join(orderItemDF,ordersDF("order_id")===orderItemDF("order_item_order_id"))

Step 4a:

import org.apache.spark.sql.functions._;

var dataFrameResult =
dataFrameResult.show();

joinedOrderDataDF.

groupBy(to_date(from_unixtime(col("order_date")/1000)).alias("order_formatted_date"),col("order_status")).

agg(round(sum("order_item_subtotal"),2).alias("total_amount"),countDistinct("order_id").alias("total_orders")).

orderBy(col("order_formatted_date").desc,col("order_status"),col("total_amount").desc,col("total_orders"));

Step 4b:

joinedOrderDataDF.registerTempTable("order_joined");

var sqlResult = sqlContext.sql("select to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date, order_status, cast(sum(order_item_subtotal) as DECIMAL (10,2)) as total_amount, count(distinct(order_id)) as total_orders from order_joined group by to_date(from_unixtime(cast(order_date/1000 as bigint))), order_status order by order_formatted_date desc,order_status,total_amount desc, total_orders");

sqlResult.show();

Step 4c:

var comByKeyResult =
joinedOrderDataDF.
map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString))).
combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size)).
toDF().
orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"));

comByKeyResult.show();

Step 5:

sqlContext.setConf("spark.sql.parquet.compression.codec","gzip");
dataFrameResult.write.parquet("/user/cloudera/problem1/result4a-gzip");
sqlResult.write.parquet("/user/cloudera/problem1/result4b-gzip");
comByKeyResult.write.parquet("/user/cloudera/problem1/result4c-gzip");

Step 6:

sqlContext.setConf("spark.sql.parquet.compression.codec","snappy");
dataFrameResult.write.parquet("/user/cloudera/problem1/result4a-snappy");
sqlResult.write.parquet("/user/cloudera/problem1/result4b-snappy");
comByKeyResult.write.parquet("/user/cloudera/problem1/result4c-snappy");

Step 7:

dataFrameResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4a-csv")
sqlResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4b-csv")
comByKeyResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4c-csv")

Step 8:

a) login to MYSQL using below : mysql -h localhost -u retail_dba -p
(when prompted password use cloudera or any password that you have currently set)

b) create table retail_db.result(order_date varchar(255) not null,order_status varchar(255) not null, total_orders int, total_amount numeric, constraint pk_order_result primary key (order_date,order_status));

c)

sqoop export \
--table result \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--export-dir "/user/cloudera/problem1/result4a-csv" \
--columns "order_date,order_status,total_amount,total_orders"

186 comments:

UnknownMay 6, 2017 at 12:48 PM
Are you covering all possible problems in each scenario?
ReplyDelete
Replies
UnknownMay 6, 2017 at 12:54 PM
It would be if you could prepare each problem scenarios possibly inclusive of all exam objectives in future.. may be like 1- sqoop import, 1 -sqoop export/1- flume, 8- sprak questions.. ( 3- spark - etl, 4-spark sql, 1-spark configuraton). This structure would help us to consider these scenarios as real mock exams with solutions..

Thanks for your efforts on this so far..
ReplyDelete
Replies
SVMay 8, 2017 at 1:08 PM
where can we get these sample datasets to come up with solution and test it on our vm? Thx
ReplyDelete
Replies
SharadMay 21, 2017 at 11:51 PM
Hi Arun, are you covering problems based on the new syllabus by cloudera?
ReplyDelete
Replies
UnknownMay 22, 2017 at 3:34 AM
Yes. Please check the prep plan post where I explained the same in detail
ReplyDelete
Replies
BereketJune 1, 2017 at 12:53 PM
good job it is helpful blog
ReplyDelete
Replies
BereketJune 1, 2017 at 12:54 PM
i tied step4c: to round the big decimal by adding this format "%.2f".format(x)

joinorders.map(x=>((x(1).toString,x(3).toString),(x(8).toString.toDouble,x(0).toString.toInt))).combineByKey((x:(Double,Int))=>(x._1,Set(x._2)),(x:(Double,Set[Int]),y:(Double,Int))=>(x._1+y._1,x._2 + y._2),(x:(Double,Set[Int]),y:(Double,Set[Int]))=>(x._1+ y._1,x._2++ y._2)).map(x=>(x._1._1,x._1._2,"%.3f".format(x._2._1),x._2._2.size))
ReplyDelete
Replies
Sarcos : lider de enlace y logisticaJune 1, 2017 at 2:01 PM
hello Arun

thanks for your blog , really useful, just one question. why do you use combineByKey instead of group by..because performance? or the exam ask for use it.. (still for me os complicated to use and I dont find eny site who can explain better)
ReplyDelete
Replies
bereketJune 7, 2017 at 9:09 AM
thank you Arun. your blog helped me to clear exam cca175.
ReplyDelete
Replies
Satish Kumar KakolluJune 9, 2017 at 4:03 PM
Hi Arun,

While doing sqoop export do we have to takecare of anything say suppose my table in mysql is having 7 rows when I describe I see table with 7 columns with one field is having not null . I have to export using hdfs file . I used below query it did not work our please correct me

sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username --password --table departments --export-dir /hdfs/ocation --bath --columns colunames --batch --outdir output

anything wrong in the about command my job is failing with some file format issue . Please help me

Thanks,
Satish.
ReplyDelete
Replies
Satish Kumar KakolluJune 9, 2017 at 5:00 PM
Hi Arun ,

Miss typed it is export.

Thanks,
Satish.
ReplyDelete
Replies
UnknownJune 10, 2017 at 12:23 AM
Hi Arun,

I took the exam yesterday and I met a problem when loading multiple files(parquet,avro) into the spark shell. However, today when I try directly load them into the spark shell just like

avroLoadingTest = sqlContext.read.format("com.databricks.spark.avro").load("/user/raku/test/avroLoadingTest")

everything is fine. Now I am confused by this thing, because before the test I had tried loading single file only, so I am not sure is loading multiple files in a directory in one time not supported or I just got some grammar mistake in the exam. Is it alright to use a single sentence like above to load multiple files in a directory into spark? Thank you.

P.S. The scenario you gave are really helpful
ReplyDelete
Replies
bereketJune 12, 2017 at 8:50 AM
Hi Arun if it is possible can you have spark Databricks exam certification on your blog.
ReplyDelete
Replies
UnknownJune 20, 2017 at 10:41 PM
Hi Arun thanks for the post ...can i know this is how exactly question is exam?? i mean each question will have at least 10 task to complete? please let me know i'm going to take exam in next month
ReplyDelete
Replies
UnknownJune 21, 2017 at 3:06 AM
each question will have one or two tasks to complete.
ReplyDelete
Replies
UnknownJune 21, 2017 at 4:28 AM
Hi Arun,

Are you planning to more problem scenarios or is the 7th one the last scenario?Your blog is really helpful.Thanks for creating the problem scenarios
ReplyDelete
Replies
UnknownJune 22, 2017 at 11:05 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJune 22, 2017 at 11:16 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 1, 2017 at 6:40 PM
Hi Arun,

Really a huge effort you are doing here so big thank to you.

i have two questions:

1/how can i export parquet file into mysql using sqoop? you did the export for the csv file, but when i tried the same command using the parquet directory it gave me some exceptions.

2/Are you sure the queries in 4a, 4b and 4c give the same result? its not the case for me.

Thanks!
ReplyDelete
Replies
UnknownJuly 1, 2017 at 7:40 PM
Below are responses to both questions

1. Create a hive table that is backed by parquet file, then use hive table as source
2. Watch the video
ReplyDelete
Replies
Murali RachakondaJuly 18, 2017 at 7:13 PM
Hi Arun, Thanks for a great block. I was going through the Sqoop documentation "sqoop-1.4.6-cdh5.10.1". I have two questions 1) Do we need to by-heart compression technique names because it is not mentioned in the documentation 2) Is the compression technique is mentioned in any other document? Thanks in advance.
ReplyDelete
Replies
UnknownJuly 18, 2017 at 8:06 PM
This comment has been removed by the author.
ReplyDelete
Replies
KarthikaJuly 19, 2017 at 8:21 AM
Hello Arun,
Thanks for the posts . They are really helpful.
But I am confused with analyzing the right way to solve a given problem as there could be many solutions possible. For example , creating a hive table in say parquet file format can be done through Hive or sqoop import . How do I know which is the right solution during certification. Please suggest. Suggestions from people with certification experience will be helpful too.
Thanks once again for all the effort .
ReplyDelete
Replies
ssrkJuly 19, 2017 at 3:42 PM
Hi Arun,
Thank you for the blog. The questions are of great detail and the questions were well structured
ReplyDelete
Replies
UnknownJuly 23, 2017 at 1:17 PM
Hello Arun,
I have booked exam and waiting to take it once preperation is done. Am confusing a lot with scala i decided to learn more on dataframes as you have provided in problrm1 and 2 as it is easy compare to scala is it ok to choose dataframes in exam for all type of problems can you add some more videos regarding dataframes. that would be greatly appreciated.
ReplyDelete
Replies
Murali RachakondaJuly 27, 2017 at 2:35 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 28, 2017 at 12:18 PM
Hi Arun,

thanks for your blog.
from exam point of view should we be aware of all the below 3 types? If i'm good in sql, can i ignore a &c?? Please advise..

a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
b). Using Spark SQL - here order_date should be YYYY-MM-DD format
c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount
ReplyDelete
Replies
UnknownJuly 30, 2017 at 5:45 AM
Hello , I have been trying the problems scenarios given in your blog . but could you please provide the sample files for order-items so we can try in our laptop .for example lets say problem 1 sqooping order-items data . So I need sample data for this. Appreciate your response :)
ReplyDelete
Replies
UnknownJuly 30, 2017 at 5:51 AM
Buddy, you need to have CDH downloaded from cloudera. CDH has all the data in mysql database that comes with it. Watch the video for an understanding how that data is pulled from mysql to hdfs.
ReplyDelete
Replies
UnknownAugust 5, 2017 at 2:10 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownAugust 5, 2017 at 2:23 AM
Hello please which editor is use during CCA175 exam. Thank you
ReplyDelete
Replies
UnknownAugust 20, 2017 at 5:47 AM
Thanks for your blog.
I have one question. Is it mandatory to practice the questions on Cloudera Platform only. I have experience working on Hortonworks platform. Can I directly attempt the exam by practicing on Hortonworks Platform
ReplyDelete
Replies
UnknownAugust 23, 2017 at 1:55 PM
var dataFrameResult =
dataFrameResult.show();

ERROR IN MY MACHINE PLEASE REPLY ME
ReplyDelete
Replies
UnknownAugust 27, 2017 at 6:58 PM
Hi Arun,

I have cleared CCA spark and Hadoop Developer because of your blogspot.

Thank you very much!! Hope to see more problem scenarios more like this.

Regards,
Aswin Ramakrishnan
ReplyDelete
Replies
UnknownSeptember 15, 2017 at 2:55 AM
Hi Arun sir Can I get the data of problems which you have explained.
ReplyDelete
Replies
UnknownOctober 4, 2017 at 9:31 AM
Beautiful Blog.........Great content presented in the best possible way that the entire big data knowledge can be streamlined
ReplyDelete
Replies
AnnoyedOctober 27, 2017 at 7:54 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnnoyedOctober 27, 2017 at 7:56 PM
Hello Arun - I think you have done the coding in Scala in all your videos so if someone wants to take the exam using Python do you have the code snippet for the same ? If yes can you please share the same -Auro
ReplyDelete
Replies
UnknownNovember 7, 2017 at 2:00 AM
Hi Arun,

I have booked exam and holding up to take it once planning is finished. Am mistaking a considerable measure for Scala I chose to take in more on information outlines as you have given in issue and 2 as it is simple contrast with Scala is it alright to pick information outlines in exam for all kind of issues would you be able to include some more recordings in regards to information outlines. that would be significantly refreshing.

Regards,
Trep Helix
ReplyDelete
Replies
crescendoNovember 7, 2017 at 11:55 PM
Hi Arun,

In certificate exam, this avro databrocks will work in the spark-shell. Because as per cloudera they do not provide "Databricks API" in the CDH install

import com.databricks.spark.avro._;
ReplyDelete
Replies
UnknownNovember 8, 2017 at 3:31 AM
Hi Arun,
I've WATCHED YOUR YOUTUBE VIDEO but I still DON'T UNDERSTAND THIS part of 4c exercise. I'm totally lost with the combineByKeyResult:

combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size))

Completeley lost with the combineByKey. Could you please explain me what you are doing here?
I've got my exam in two days.

Thank you
ReplyDelete
Replies
ShivaniNovember 10, 2017 at 2:18 AM
HI Arun,

This is really informative blog..

But CCA-175 exam content is changed. Do you help with new playlist for the same?
ReplyDelete
Replies
JiteshNovember 15, 2017 at 2:28 AM
This comment has been removed by the author.
ReplyDelete
Replies
JiteshNovember 15, 2017 at 2:31 AM
Hi Arun/Venkat,

Can we submit more than one file as output because "spark.sql.shuffle.partitions" value is more than 1, by default. OR we have submit one output file only.
ReplyDelete
Replies
UnknownNovember 15, 2017 at 3:28 AM
Hi Jitesh,

By default "spark.sql.shuffle.partitions" is set to 200. This can be even configured to value 1 based performance tuning options.
ReplyDelete
Replies
JiteshNovember 16, 2017 at 4:48 AM
Hi Venkat,

Thanks for your answer, But my question is still same, can we submit more than file? Because it will save time during exam.
ReplyDelete
Replies
UnknownNovember 16, 2017 at 4:59 AM
If you are expecting yes or no kind of answer. Answer is YES.
ReplyDelete
Replies
UnknownNovember 27, 2017 at 2:40 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownNovember 27, 2017 at 2:43 PM
I am trying to use aggregateByKey instead of combineByKey.but getting issue with count rest all looking good...could you please let me know what i am missing here,

val ordersDF = sqlContext.read.avro("/user/imthiyas90/problem1/orders")
val orderitemDF = sqlContext.read.avro("/user/imthiyas90/problem1/order_items")

ordersDF.registerTempTable("orders")

orderitemDF.registerTempTable("order_items")

val joinedDF = ordersDF.join(orderitemDF, ordersDF("order_id") === orderitemDF("order_item_order_id"))

val combine_agg = joinedDF.map(e => ((e(1).toString,e(3).toString),(e(8).toString.toDouble,e(0).toString.toInt))).aggregateByKey((0.0,0))((x:(Double,Int),y:(Double,Int)) =>(x._1+y._1,x._2+1),(x:(Double,Int),y:(Double,Int)) =>(x._1+y._1,x._2+y._2)).map(x => (x._1._1,x._1._2,x._2._1,x._2._2)).toDF.orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"))
ReplyDelete
Replies
UnknownDecember 1, 2017 at 3:07 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownDecember 1, 2017 at 3:09 AM
Hi Arun,
Thanks for providing such a wonderful blog for those who aspire to clear CCA Spark And Hadoop Certification. But what I found is all of the solutions are explained in scala, I am learning Pyspark and don't know scala. Could you please tell if you have the same solutions in Python as well?
Thanks
ReplyDelete
Replies
UnknownDecember 3, 2017 at 7:54 PM
Hi Arun,
I tried from_unixtime. But is returning null values.
ReplyDelete
Replies
soukya reddyDecember 11, 2017 at 1:45 AM
Thanks for good info.It's very good blog
Big data Hadop Online course Hyderabad
ReplyDelete
Replies
UnknownDecember 28, 2017 at 7:03 AM
Hi, i appreciate your great efforts! Is it necessary to round the sum of order_item_subtotal to 2 decimal values? They have not mentioned in the question.
ReplyDelete
Replies
ausgeekJanuary 1, 2018 at 4:37 PM
when it comes to saving as csv can we do following steps:-

val resultRDD = resultDF.rdd.
map(_.toString().replace("[","").replace("]","")).
saveAsTextFile("/user/cloudera/problem1/result4a-csv")
ReplyDelete
Replies
VamsiJanuary 2, 2018 at 7:23 PM
Hi Arun,

Thanks a ton for these exercises and the videos.

I have cleared my CCA 175 today. These exercises has helped a lot in my preparation and gave the confidence to take up the test.

Thanks
Vamsi
ReplyDelete
Replies
UnknownJanuary 12, 2018 at 1:46 AM
It is nice blog keep updating your blog and i am waiting for your next update also Big Data Hadoop Online course
ReplyDelete
Replies
Manish GuptaJanuary 21, 2018 at 9:36 AM
Arun from_unixtime does not work in pyspark . do we have any solution for same ?
ReplyDelete
Replies
AnnoyedJanuary 24, 2018 at 12:50 PM
Can we not register the Dataframes as temp tables then we can use SQL ? Can you please explain .
ReplyDelete
Replies
Osama BahgatFebruary 7, 2018 at 1:02 PM
Hi Arun,
Thanks for the great effort.
In step 3, you've imported avro files from databricks, will this import statement be given to me during the exam or I need to memorize it as it's not mentioned in spark docs and as far as i know i'll only have access to spark docs.
Thanks
ReplyDelete
Replies
VIJAY PARMARFebruary 8, 2018 at 7:28 AM
hi arun.. your video tutorials is good. but from where i can get sample data files that you used in your videos.
i want to try out. please give me path from which i can download that files and folders and try out on my local machine
ReplyDelete
Replies
UnknownMarch 2, 2018 at 6:12 AM
Hi Arun, how different will be the problem scenario of CCP DE575 from what you have given here
ReplyDelete
Replies
UnknownMarch 31, 2018 at 8:10 PM
Hi Please can you share top 20 rows of result set to compare our output.
ReplyDelete
Replies
UnknownApril 4, 2018 at 3:46 AM
Nice post thank you for sharing keep updating the postBig data hadoop online Course Hyderabad
ReplyDelete
Replies
rohithApril 30, 2018 at 6:07 PM
to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date is giving null values but when I just query o.order_date from orders temp table it is giving correct value which is of data type String .Any help in this situation?
ReplyDelete
Replies
rohithMay 1, 2018 at 7:36 AM
The order date should be long data type and if we use to_date(from_unixtime(cast(order_date/1000 as bigint))) then it will correct output in YYYY-MM-DD format
ReplyDelete
Replies
zaheer shaikMay 13, 2018 at 7:11 AM
what all the api documentation available during exam in VM itself?
ReplyDelete
Replies
TejutejuMay 23, 2018 at 7:22 AM
It was really a nice article and i was really impressed by reading this Big data hadoop online Course Hyderabad

ReplyDelete
Replies
UnknownMay 24, 2018 at 3:33 AM
great efforts!!. One suggestion from my end .. Please provide the solutions in python as well .. it would be great help.
ReplyDelete
Replies
UnknownMay 29, 2018 at 11:10 PM
Hi,
In the problem scenario 1, problem 1 when have been asked to use Snappy compression. The code which you have given contains "--compression-codec org.apache.hadoop.io.compress.SnappyCodec". Would like to know, are we supposed to remember org.apache.... to use the Snappy compression? or we can get it from hadoop or even from user guide?
ReplyDelete
Replies
UnknownMay 30, 2018 at 11:08 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJune 2, 2018 at 10:38 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJune 11, 2018 at 2:24 PM
3.Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.

can this be done as

ordersDF = sc.textFile("xxx/xxx")
ordersDF.toDF
ReplyDelete
Replies
UnknownJune 12, 2018 at 2:07 PM
Hello,
Thank you for this amazing blog. I am bit confuse is this the entire demo question paper or just one question in the exam?
ReplyDelete
Replies
UnknownJune 12, 2018 at 8:50 PM
Hi Arun,
Thank for your video, Request you to please add some dedicated videos upon groupByKey,CombineByKey,reduceByKey....functions
ReplyDelete
Replies
UnknownJune 13, 2018 at 7:18 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore
ReplyDelete
Replies
rmouniakJune 14, 2018 at 4:18 AM
Thanks for sharing Good Information
Big data Hadoop Online Training Hyderabad
ReplyDelete
Replies
AnonymousJune 17, 2018 at 9:30 AM
Hi,
I am not able to follow the combineByKey step.

1. in the second lambda, is the output of first lambda treated as the input for second lambda?
2. what will be value of x, y in the second lambda.

I did watched some videos to understand combineByKey. ALso please share any good videos for combineByKey.

ReplyDelete
Replies
Aruna AsokanJune 27, 2018 at 10:34 PM
Hi Arun,

I have really motivated after seen your preparation plan for certification. you have done a very good job. I am following your plan and started trying hands-on. My humble request is like if you can provide the same code in python instead of scale it would be really helpful for me to understand since i have worked with python so far. expecting some materials as well as more scenarios.

Thanks,
Aruna Asokan
ReplyDelete
Replies
UnknownJuly 1, 2018 at 11:49 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 1, 2018 at 11:52 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 1, 2018 at 11:55 PM
Hi Arun,

Thanks for the blog and usful questions.

I'm trying to resolve the problem1 using pyspark and stuck at point 7 solution.

I'm trying to transform the sqlResult(in my below code "resultDF") DF to RDD using the map function ,but ending up some numeric valies in the map variable.Even I save as text file like below, I'm seeing some junk data as in bold below.

>>resultDF. \
map(lambda m: str(m(0))+','+str(m(1))+','+str(m(2))+','+str(m(3))). \
coalesce(4). \
saveAsTextFile("/user/....")

>>resultMap = resultDF. \
map(lambda m: str(m(0))+','+str(m(1))+','+str(m(2))+','+str(m(3)))

>>for i in resultMap.take(10):print(i)

Row(2014-07-24=0),Row(2014-07-24=1),Row(2014-07-24=2),Row(2014-07-24=3)

Even I tried to config the compression to uncompressed , but still I am seeing same result.

sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed");

could you please check how can I save the resultDF to textFile (csv).

Thanks
Sekar

ReplyDelete
Replies
UnknownJuly 6, 2018 at 12:28 AM
HI, where can i access the data set for the scenario, plus the correct output?
ReplyDelete
Replies
UnknownJuly 11, 2018 at 4:40 AM

Well done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful R Programming tutorial on youtube so you can check that too for more detailed knowledge on R Programming.https://www.youtube.com/watch?v=rgFVq_Q6VF0
ReplyDelete
Replies
UnknownJuly 17, 2018 at 1:37 AM

Very Impressive Data Science tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Data Science course. I'm also a learner taken up Data Science training and I think your content has cleared some concepts of mine. While browsing for Data Science tutorials on YouTube i found this fantastic video on Data Science. Do check it out if you are interested to know more.:-https://www.youtube.com/watch?v=gXb9ZKwx29U&t=237s
ReplyDelete
Replies
SevlarajJuly 27, 2018 at 5:00 AM
please help me convert combineByKey scala code to pyspark
ReplyDelete
Replies
TejutejuJuly 31, 2018 at 6:10 AM
After reading this blog i very strong in this topics and this blog really helpful to all Big data hadoop online Course Bangalore
ReplyDelete
Replies
SureshAugust 6, 2018 at 10:23 PM
This comment has been removed by the author.
ReplyDelete
Replies
Lee SmithAugust 15, 2018 at 2:29 AM
It was looking like a difficult challenge to pass CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Exam because I was short of time and was not having any proper material. Then one of my friends suggested me to use CCA175 Braindumps for the preparation of my certification. I am thankful to Realexamdumps.com for free demo that won my trust. I found it the same as I was told.
ReplyDelete
Replies
RoseAugust 21, 2018 at 12:03 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownSeptember 5, 2018 at 9:47 AM
For Pyspark AVRO...
There is no need to install an additional package during the exam. You are given a base CDH5.10 cluster with the default installation. That includes Spark 1.6 and Avro packages. https://www.cloudera.com/documentation/enterprise/5-10-x/topics/spark_avro.html

Ex: op.write.format("com.databricks.spark.avro").save("/home/cloudera/retail_db/orders/avro1")
opdf = sqlContext.read.format("com.databricks.spark.avro").load("/home/cloudera/retail_db/orders/avro1")
ReplyDelete
Replies
UnknownSeptember 7, 2018 at 1:30 PM
Hello Arun,

What should be the approach to solve a problem wherein we have to copy the content of a hive table "abc" to a hdfs location "xyz/abcd" and change the delimiter to "|"?
ReplyDelete
Replies
maniishSeptember 12, 2018 at 12:19 AM
It was not my first attempt for this certification but I am sure if I had chosen CCA175 braindumps before then it would be my first attempt with ultimate success. Like everyone I am very happy over my success but more than it I am thank full to Dumps4download for all the help and support they provided me throughout my preparation.
ReplyDelete
Replies
UnknownSeptember 13, 2018 at 12:41 PM
4a using pyspark

from pyspark.sql import functions as F

ojS.groupBy(F.to_date(F.from_unixtime(ojS.order_date/1000)).alias("Formated_order_date"),'order_status'). \
agg(F.sum('order_item_product_price').alias("total_amount"),F.count('order_id').alias("total_orders")). \
orderBy(['Formated_order_date','order_status','total_amount','total_orders'], ascending=[0, 1, 0, 1]).show()
ReplyDelete
Replies
AnonymousSeptember 18, 2018 at 8:24 PM
for the question 4c, while creating the rdd using
val rdd = joindf.map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString)))

rdd is resulting as dataset instead of rdd. what could be the issue?

because of this I am not able to use combineByKey.

ReplyDelete
Replies
TejutejuSeptember 25, 2018 at 1:49 AM
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Training Bangalore

ReplyDelete
Replies
Siddhesh KalgaonkarOctober 6, 2018 at 1:20 AM
Hi Arun,

Can you help me in understanding your combineByKey code. Below is my query . your comments pls
https://stackoverflow.com/questions/52677045/understanding-of-combinebykey
ReplyDelete
Replies
SevlarajNovember 23, 2018 at 11:35 PM
Hi
Below are pyspark code

pyspark --master yarn --jars /home/cloudera/Downloads/spark-avro_2.10-2.0.1.jar

order_df = sqlContext.read. \
format("com.databricks.spark.avro"). \
load("/user/cloudera/problem1/orders")

order_df.show()

order_item_df = sqlContext.read. \
format("com.databricks.spark.avro"). \
load("/user/cloudera/problem1/order-items")

order_item_df.show()

joined_df = order_df. \
join(order_item_df, order_df.order_id == order_item_df.order_item_order_id)

from pyspark.sql.functions import *

result_df = joined_df. \
groupby(to_date(from_unixtime(col("order_date")/1000)).alias("order_date"), col("order_status")). \
agg(countDistinct(col("order_id")).alias("total_orders"), round(sum(col("order_item_subtotal")), 2).alias("total_amount")). \
orderBy(col("order_date").desc(), col("order_status"),col("total_amount").desc(), col("total_orders"))

result_df.show()

joined_df.registerTempTable("joined_table")

sqlContext.sql("select * from joined_table limit 10").show()

result_sql = sqlContext. \
sql("select to_date(from_unixtime(cast(order_date/1000 as int))) as formatted_order_date, \
order_status, round(sum(order_item_subtotal) ,2) as total_amount, \
count(distinct(order_id)) as total_orders from joined_table \
group by order_date, order_status \
order by formatted_order_date desc, order_status, total_amount desc, total_orders")

result_sql.show()

result_rdd = joined_df. \
map(lambda x : ((x[1], str(x[3])), (float(x[8]), int(x[0])))). \
combineByKey(
lambda x : (x[0], set([x[1]])),
lambda x, y : (x[0] + y[0], x[1] | set([y[1]])),
lambda x, y : (x[0] + y[0], x[1] | y[1])). \
map(lambda x :(x[0][0], x[0][1], x[1][0], len(x[1][1]))). \
toDF().orderBy(col("_1").desc(), col("_2"), col("_3").desc(), col("_4"))

result_rdd.show()
ReplyDelete
Replies
UnknownDecember 23, 2018 at 5:03 PM
Hi Arun,

This blog is really helpful for hadoop certification... Really appreciate your effort to put these content at one place..

I gave a try last week, but couldn't clear due to time management. I got one question where I have to rename order_id column as "order id" in select clause in sqlContext.sql, and save the result as parquet file (or avro), I was able to rename by using <>.toDF("order id") to my result, but I was not able to write to file with space in column header. Could you please help to resolve this.
ReplyDelete
Replies
Suresh BabuFebruary 8, 2019 at 6:16 AM
Hello Arun,

I Couldn't understand this logic. could you please break each step and explain for me.
result_rdd = OrdersJoin.map(lambda x:((x[1], x[3]), (float(x[8]), int(x[0])))). \
combineByKey(lambda x : (x[0], set([x[1]])), \
lambda x, y : (x[0] + y[0], x[1] | set([y[1]])), \
lambda x, y : (x[0] + y[0], x[1] | y[1])). \
map(lambda x :(x[0][0], x[0][1], x[1][0], len(x[1][1]))). \
toDF().orderBy(col("_1").desc(), col("_2"), col("_3").desc(), col("_4"))
ReplyDelete
Replies
UnknownFebruary 17, 2019 at 3:32 PM
how do we know the location of the database. Is it part of the direction in the problem because your problem one only shows the location in hdfs where the files is imported to.
Thanks
ReplyDelete
Replies
UnknownApril 11, 2019 at 11:44 AM
can you please provide the dummy data file for the problem?
ReplyDelete
Replies
FROSTYApril 11, 2019 at 8:10 PM
Hello Arun,

Thank you so much for this resource, it has been invaluable for me in preparing for the CCA175 test.

I have a question for you that isn't in this blog, I was wondering if you could help me...

I'm creating a case class to convert rdd to df and was wondering what data type date would be in my case class.

ex: case class orderClass( order_id : Integer, orderDate : ?, )

I already tried finding the answer myself, you're sort of my last resort.

Thank you,
ReplyDelete
Replies
aldrin maxxApril 25, 2019 at 12:06 PM
Hey Arun,

First of all great on comprehensive guidance from your blog.

I am just a beginner with big data and wanted to ask that while attempting the certification exam can we google stuff if we dont know..:-)
ReplyDelete
Replies
UnknownJuly 10, 2019 at 3:33 AM
Hi Arun thank you very very much for this blog. your contribution is great.

Below is my code for who practice in pyspark:

https://github.com/Karthik-NS/solutions-in-pyspark-for-problems-in-Aruns-Blog-

ReplyDelete
Replies
AdeelJuly 26, 2019 at 8:07 AM
Nice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me
ReplyDelete
Replies
AdeelJuly 26, 2019 at 8:09 AM
Nice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me
ReplyDelete
Replies
VISHALAugust 2, 2019 at 7:22 AM
Is this just one question, or 8 different question in exam.
ReplyDelete
Replies
manishaAugust 9, 2019 at 4:49 AM
Thanks for sharing your valuable information and time.
Hadoop Training in Delhi
Hadoop Training institute in Delhi
ReplyDelete
Replies
UnknownAugust 22, 2019 at 2:01 AM
Thanks Arun. The blog helped me to clear the certification successfully!
ReplyDelete
Replies
janicejayAugust 27, 2019 at 2:24 AM
I could not imagine such an amazing study experience but it helped me to secure my desired grades in my coveted IT certification. 7004 Dumps really changed my professional life by giving me a gift in the form of successful career opportunity. I could have hardly passed my examination without taking help from Avaya 7004 Dumps. I say thanks to Dumpspass4sure.com for all the help.
https://www.dumpspass4sure.com/avaya/7004-dumps.html
ReplyDelete
Replies
DumpsPass4sureSeptember 29, 2019 at 11:07 PM
I got success at the first attempt with Pass4sure CompTIA dumps and am happy for the way it happened. I chose this PDF guide quickly after looking into free demo questions. I think everyone must download free demo questions to take CompTIA exam material as their helping guide.
ReplyDelete
Replies
AnonymousNovember 17, 2019 at 7:41 PM
tHANKS FOR THE bLOG. REALLY HELPFUL
ReplyDelete
Replies
NikishaDecember 3, 2019 at 11:25 PM
Though it is challenging to transform the data from one computer to another, that is, data migration , but with the help of this program, I had not felt any need to change the source code for uploading the several database tables.
ReplyDelete
Replies
TechnogeekscsDecember 10, 2019 at 3:09 AM
Good post!Thank you so much for sharing this lovely article.It was so good to read and useful to upgrade my understanding...
Data Science Course in Pune
ReplyDelete
Replies
yamunaJanuary 20, 2020 at 4:11 AM
Good blog, thanks for sharing this information.
Spark Scala Training
ReplyDelete
Replies
Alfred AvinaJanuary 28, 2020 at 12:43 AM
The main motive of the Automated big data engineering is to spread the knowledge so that they can give more big data engineers to the world.

ReplyDelete
Replies
UnknownMarch 5, 2020 at 5:53 PM
do you have pyspark version of the code?
ReplyDelete
Replies
UnknownApril 9, 2020 at 9:38 AM
to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date why are you doing this? and what's the datatype of order_date here?
This is what I did: is mine right? Please confirm
data = spark.sql('''select distinct count(order_id) count,sum(order_item_subtotal) total ,order_status,date_format(order_date,'EEE') day from orders o join orderitems oi
on o.order_id = oi.order_item_order_id group by o.order_status,day order by day desc, order_status, total desc, count''')
ReplyDelete
Replies
thouApril 14, 2020 at 1:25 PM
Hi arun,
Are you using spak 1.6 or spark 2 please? Is there any difference?
Thanks
ReplyDelete
Replies
UnknownJune 23, 2020 at 10:08 AM
Hi Arun, This blog help me a lot to pass the certification. We know that the examn was recently updated, however, it was very useful for me. Thanks a lot.
ReplyDelete
Replies
AshwinAugust 17, 2020 at 8:25 AM
Most informative post.Thank you.
Machine Learning training in Pallikranai Chennai

Data science training in Pallikaranai

Python Training in Pallikaranai chennai

Bigdata training in Pallikaranai chennai

Spark with ML training in Pallikaranai chennai
ReplyDelete
Replies
BTree Systems, Software (IT) Training InstituteSeptember 10, 2020 at 11:02 PM
Btreesystems Specializes in Offering Real-time Experience IT training on AWS, ..etc.. No.1 Software (IT) Training Institute in India aws training in chennai
Python training in Chennai
data science training in chennai
hadoop training in chennai
machine learning training chennai
ReplyDelete
Replies
shitalOctober 31, 2020 at 2:10 AM

Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well.

Data Scientist Course in pune
ReplyDelete
Replies
KALISJanuary 5, 2021 at 12:38 AM
Good blog and good work, thanks for sharing this information

I encountered two errors while doing Problem Scenario 1, thanks for your help :

4c. var rddResult = joinedDF.map(x=> (x(1),x(3),x(0),x(9)));
#=>>>error: java.lang.ClassNotFoundException: scala.Any
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
7. dataframeResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/osboxes/problem1/result4a-csv")
#=>>>error: value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[String]
sqlResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/osboxes/problem1/result4b-csv")
#=>>>error: value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[String]

For information, I'm on Cloudera QuickStart VM 6.3.2
ReplyDelete
Replies
AnonymousMarch 2, 2021 at 9:04 PM
Nice article please do visit my website for cloudera developer training
ReplyDelete
Replies
BabitMarch 4, 2021 at 11:23 PM
Thanks for sharing the such information with us.
Hadoop Certification in Pune
ReplyDelete
Replies
KathyMarch 31, 2021 at 3:29 AM
Well written articles like yours renews my faith in today's writers. The article is very informative. Thanks for sharing such beautiful information.
Best Data Migration tools
Penetration testing companies USA
What is Data Lake
Artificial Intelligence in Banking
What is Data analytics
Big data Companies USA
What is Data Lake
What is Data Migration
What is Data Science
ReplyDelete
Replies
KITS TechnologiesJuly 8, 2021 at 4:38 AM
java online courses
salesforce online courses
hadoop online courses
Data Science online courses
linux online courses
etl testing online courses
web methods online courses
business analyst online course
oracle adf online course
ReplyDelete
Replies
Free CSEET Online ClassesJuly 19, 2021 at 3:59 AM
You can't imagine what I am going to tell you. Our institution is offering you a CS executive classes and a free of cost CSEET classes and many more to explore. So please contact us or visit our website at https://uniqueacademyforcommerce.com/
ReplyDelete
Replies
KaparthicynixitJuly 21, 2021 at 5:32 AM
Thanku For Sharing Nice Informative Blog.
Visit us: uipath training in delhi
Visit us: ui path online training
Visit us: uipath training in hyderabad
Visit us: ui path course in bangalore
ReplyDelete
Replies
3RI TechnologiesJuly 29, 2021 at 3:59 AM
thanks for the article on hadoop keep writing and sharing.

Data Science Training in Pune
ReplyDelete
Replies
KITS TechnologiesAugust 6, 2021 at 3:58 AM
web methods online courses
business analyst online course
oracle adf online course
oracle rac online course
msbi online course
osb online training
azure devops online training
ReplyDelete
Replies
kishorSeptember 9, 2021 at 10:25 PM
best digital marketing agancy new tips for best digital marketing
free classified submission sites list
kishorsasemahal
ReplyDelete
Replies
kishorSeptember 12, 2021 at 1:24 AM
nices information thanku so much this information
bluehost-discounts
digital marketing tips
ReplyDelete
Replies
Health Insurance in UAESeptember 19, 2021 at 1:50 AM
thanku so much this information
valueking
ReplyDelete
Replies
Sreyobhilashi InstituteSeptember 24, 2021 at 12:14 AM
thanks arun to share valuable cloudera spark certification training related (cca 175 ) questions, but recently its changed syllabus.
Regards
Venu
bigdata training in Hyderabad
ReplyDelete
Replies
GeorgeNovember 2, 2021 at 1:44 AM
Set An Alert For RMB TO USD To Receive An Email When The Exchange Rate Changes. Alternatively, Bookmark The Page And Check Back Here Regularly.
ReplyDelete
Replies
FranklinNovember 9, 2021 at 9:06 PM
Metallurgical equipment in India
ReplyDelete
Replies
Elena JamesNovember 18, 2021 at 2:51 AM
Are You Thinking About Joining AVATRADE REVIEW ? If You Are Then Read The Review On AVATRADE Fx In This Article. We, Will, Tell You About The Company, The Brokers, The Platform, And All The Information You Need To Know. Read More Here.
ReplyDelete
Replies
superpowerbacklinksNovember 22, 2021 at 3:45 AM
beautiful blog thank you Chandan Online
Shankh Plate
Branch Lamp
Sphatik mala
ReplyDelete
Replies
DucatJanuary 31, 2022 at 12:43 AM
Nice Piece Of Information, Keep Sharing Such Informative Post.

big data hadoop course

Call on 7070905090 To Join Ducat Today
ReplyDelete
Replies
Brian JoeFebruary 2, 2022 at 3:38 AM

ukcat organic chemistry
bmat chemistry
organic chemistry tutor
NEET chemistry home tutor
ReplyDelete
Replies
kamakshiFebruary 4, 2022 at 4:28 AM
Thanks for this nice blog
Rudraksha mala
Rudraksha Bracelet
ReplyDelete
Replies
Welcome to Ducat IndiaMarch 12, 2022 at 1:36 AM
Great Post. Very informative. Keep Sharing!!

Apply Now for Big DATA Training Classes in Noida

For more details about the course fee, duration, classes, certification, and placement call our expert at 70-70-90-50-90
ReplyDelete
Replies
Tamil novelsApril 8, 2022 at 2:44 AM
Very Informative post. Thank you for sharing with us.
Thirukkural pdf
Sai Satcharitra in English pdf
Sai Satcharitra in Tamil pdf
Sai Satcharitra in Telugu pdf
Sai Satcharitra in Hindi pdf
ReplyDelete
Replies
AnonymousDecember 9, 2022 at 9:00 PM
aqualudo
ReplyDelete
Replies
SwarnAppFebruary 27, 2023 at 4:51 AM
Thanks for sharing such information with us.
Jewellery Billing Software
Jewellery Billing Software
ReplyDelete
Replies
SwarnAppMay 4, 2023 at 5:59 AM
thanks for the article keep writing and sharing.
Jewellery ERP Software UAE
Jewellery ERP Software UAE
ReplyDelete
Replies
anjani02September 11, 2025 at 12:35 AM
Pega course online: Design end-to-end case lifecycles, automate workflows, and build responsive UIs on Pega’s low‑code platform with hands‑on projects aligned to CSA skills for delivery‑ready confidence.
Model data objects, implement decision rules and integrations, and practice real scenarios like service requests, approvals, and SLA‑driven escalations for job‑oriented outcomes.
ReplyDelete
Replies

Add comment

Hadoop and Spark Developer - CCA 175