Problem Scenario 1

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walk through of the solution to this problem can be found here [Click here]

Click here for the video version of this series. This takes you to the youtube playlist of videos. 

Problem 1:
  1. Using sqoop, import orders table into hdfs to folders /user/cloudera/problem1/orders. File should be loaded as Avro File and use snappy compression
  2. Using sqoop, import order_items  table into hdfs to folders /user/cloudera/problem1/order-items. Files should be loaded as avro file and use snappy compression
  3. Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes
  4. Expected Intermediate Result: Order_Date , Order_status, total_orders, total_amount. In plain english, please find total orders and total amount per status per day. The result should be sorted by order date in descending, order status in ascending and total amount in descending and total orders in ascending. Aggregation should be done using below methods. However, sorting can be done using a dataframe or RDD. Perform aggregation in each of the following ways
    • a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
    • b). Using Spark SQL  - here order_date should be YYYY-MM-DD format
    • c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount
  5.  Store the result as parquet file into hdfs using gzip compression under folder
    • /user/cloudera/problem1/result4a-gzip
    • /user/cloudera/problem1/result4b-gzip
    • /user/cloudera/problem1/result4c-gzip
  6.  Store the result as parquet file into hdfs using snappy compression under folder
    • /user/cloudera/problem1/result4a-snappy
    • /user/cloudera/problem1/result4b-snappy
    • /user/cloudera/problem1/result4c-snappy
  7. Store the result as CSV file into hdfs using No compression under folder
    • /user/cloudera/problem1/result4a-csv
    • /user/cloudera/problem1/result4b-csv
    • /user/cloudera/problem1/result4c-csv
  8. create a mysql table named result and load data from /user/cloudera/problem1/result4a-csv to mysql table named result 

Solution: 

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario. 

Step 1: 

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/orders \
--as-avrodatafile;



Step 2: 

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table order_items \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--target-dir /user/cloudera/problem1/order-items \
--as-avrodatafile;



Step 3: 

import com.databricks.spark.avro._;
   var ordersDF = sqlContext.read.avro("/user/cloudera/problem1/orders");
   var orderItemDF = sqlContext.read.avro("/user/cloudera/problem1/order-items");


Step 4: 

var joinedOrderDataDF = ordersDF
.join(orderItemDF,ordersDF("order_id")===orderItemDF("order_item_order_id"))


Step 4a: 

import org.apache.spark.sql.functions._;

var dataFrameResult = 
dataFrameResult.show();


joinedOrderDataDF.
groupBy(to_date(from_unixtime(col("order_date")/1000)).alias("order_formatted_date"),col("order_status")).
agg(round(sum("order_item_subtotal"),2).alias("total_amount"),countDistinct("order_id").alias("total_orders")).
orderBy(col("order_formatted_date").desc,col("order_status"),col("total_amount").desc,col("total_orders"));



Step 4b: 

joinedOrderDataDF.registerTempTable("order_joined");

var sqlResult = sqlContext.sql("select to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date, order_status, cast(sum(order_item_subtotal) as DECIMAL (10,2)) as total_amount, count(distinct(order_id)) as total_orders from order_joined group by to_date(from_unixtime(cast(order_date/1000 as bigint))), order_status order by order_formatted_date desc,order_status,total_amount desc, total_orders");

sqlResult.show();


Step 4c: 

var comByKeyResult = 
joinedOrderDataDF.
map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString))).
combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size)).
toDF().
orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"));

comByKeyResult.show();



Step 5: 


  • sqlContext.setConf("spark.sql.parquet.compression.codec","gzip");
  • dataFrameResult.write.parquet("/user/cloudera/problem1/result4a-gzip");
  • sqlResult.write.parquet("/user/cloudera/problem1/result4b-gzip");
  • comByKeyResult.write.parquet("/user/cloudera/problem1/result4c-gzip");


Step 6: 

  • sqlContext.setConf("spark.sql.parquet.compression.codec","snappy");
  • dataFrameResult.write.parquet("/user/cloudera/problem1/result4a-snappy");
  • sqlResult.write.parquet("/user/cloudera/problem1/result4b-snappy");
  • comByKeyResult.write.parquet("/user/cloudera/problem1/result4c-snappy");


Step 7: 

  • dataFrameResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4a-csv")
  • sqlResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4b-csv")
  • comByKeyResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/cloudera/problem1/result4c-csv")


Step 8: 

a)   login to MYSQL using below : mysql -h localhost -u retail_dba -p
  (when prompted password use cloudera or any password that you have currently set)


b) create table retail_db.result(order_date varchar(255) not null,order_status varchar(255) not null, total_orders int, total_amount numeric, constraint pk_order_result primary key (order_date,order_status)); 

c)
sqoop export \
--table result \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--export-dir "/user/cloudera/problem1/result4a-csv" \
--columns "order_date,order_status,total_amount,total_orders"











185 comments:

  1. Are you covering all possible problems in each scenario?

    ReplyDelete
    Replies
    1. While that is my goal, it may not be possible to cover all scenarios in a single problem questions. But the goal is to cover all exam topics in the series of problems. I am not targeting to post more than 10. So far i have posted 3 as of the time this comment is posted. I have 2 getting ready to be published by End of day today. Will do 5 more by end of this week with other possible combinations.

      Delete
  2. It would be if you could prepare each problem scenarios possibly inclusive of all exam objectives in future.. may be like 1- sqoop import, 1 -sqoop export/1- flume, 8- sprak questions.. ( 3- spark - etl, 4-spark sql, 1-spark configuraton). This structure would help us to consider these scenarios as real mock exams with solutions..

    Thanks for your efforts on this so far..

    ReplyDelete
    Replies
    1. Sure, i will create the next set of problems in the pattern you suggested.

      Delete
    2. Thanks for accepting the suggestion. Appreciate your efforts.

      Delete
    3. In exam , mainly when coming to spark and spark-sql questions can we give solutions in any of the approach we prefer or is it manandatory to do it RDD and spark-sql for asked question.

      Delete
  3. where can we get these sample datasets to come up with solution and test it on our vm? Thx

    ReplyDelete
    Replies
    1. All of datasets come prepackaged in the cloudera quickstart vm. I have not used anything that is not available in the QuickStart. Install it from link below link
      Http://www.cloudera.com/downloads/quickstart_vms/5-10.html

      Delete
    2. Thx. Will download and try it. I took the Cloudera course but that doesnt involve complicate aggregate functions etc. So this blog is helpful.

      Delete
    3. Yes this blog will help, I have uploaded a video walk through of strategy as well as for each problem. Please go through the videos and let me know what you think

      Delete
    4. Hello Arun,

      What should be the approach to solve a problem wherein we have to copy the content of a hive table "abc" to a hdfs location "xyz/abcd" and change the delimiter to "|"?

      Delete
  4. Hi Arun, are you covering problems based on the new syllabus by cloudera?

    ReplyDelete
  5. Yes. Please check the prep plan post where I explained the same in detail

    ReplyDelete
  6. i tied step4c: to round the big decimal by adding this format "%.2f".format(x)

    joinorders.map(x=>((x(1).toString,x(3).toString),(x(8).toString.toDouble,x(0).toString.toInt))).combineByKey((x:(Double,Int))=>(x._1,Set(x._2)),(x:(Double,Set[Int]),y:(Double,Int))=>(x._1+y._1,x._2 + y._2),(x:(Double,Set[Int]),y:(Double,Set[Int]))=>(x._1+ y._1,x._2++ y._2)).map(x=>(x._1._1,x._1._2,"%.3f".format(x._2._1),x._2._2.size))

    ReplyDelete
  7. hello Arun

    thanks for your blog , really useful, just one question. why do you use combineByKey instead of group by..because performance? or the exam ask for use it.. (still for me os complicated to use and I dont find eny site who can explain better)

    ReplyDelete
    Replies
    1. combineByKey gives more control. If you can solve the same problem using group by key then i recommend using it. however, remember that you may be asked questions to complete a programming sentence and hence you need to be prepared with using all major spark functions.

      Delete
  8. thank you Arun. your blog helped me to clear exam cca175.

    ReplyDelete
    Replies
    1. Very very happy to know. Congratulations!

      Delete
    2. Hello please which editor is use during CCA175 exam. Thank you

      Delete
  9. Hi Arun,

    While doing sqoop export do we have to takecare of anything say suppose my table in mysql is having 7 rows when I describe I see table with 7 columns with one field is having not null . I have to export using hdfs file . I used below query it did not work our please correct me

    sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username --password --table departments --export-dir /hdfs/ocation --bath --columns colunames --batch --outdir output

    anything wrong in the about command my job is failing with some file format issue . Please help me

    Thanks,
    Satish.

    ReplyDelete
    Replies
    1. your question is about export the command you posted is doing import. are you sure this is the command you want me to review? also i dont know what bath option is... i know you used batch which i kind of understand but not bath.

      Delete
  10. Hi Arun ,

    Miss typed it is export.

    Thanks,
    Satish.

    ReplyDelete
  11. Hi Arun,

    I took the exam yesterday and I met a problem when loading multiple files(parquet,avro) into the spark shell. However, today when I try directly load them into the spark shell just like

    avroLoadingTest = sqlContext.read.format("com.databricks.spark.avro").load("/user/raku/test/avroLoadingTest")

    everything is fine. Now I am confused by this thing, because before the test I had tried loading single file only, so I am not sure is loading multiple files in a directory in one time not supported or I just got some grammar mistake in the exam. Is it alright to use a single sentence like above to load multiple files in a directory into spark? Thank you.

    P.S. The scenario you gave are really helpful

    ReplyDelete
    Replies
    1. yes you can load all the files when you specifiy a directory location. you can also load files from different directories at once.

      Delete
    2. Hi Arun.. Is there possibility of having multiple types of files at one directory ? or we will get one file format at a time?
      Thanks for your response in advance

      Delete
  12. Hi Arun if it is possible can you have spark Databricks exam certification on your blog.

    ReplyDelete
  13. Hi Arun thanks for the post ...can i know this is how exactly question is exam?? i mean each question will have at least 10 task to complete? please let me know i'm going to take exam in next month

    ReplyDelete
  14. each question will have one or two tasks to complete.

    ReplyDelete
  15. Hi Arun,

    Are you planning to more problem scenarios or is the 7th one the last scenario?Your blog is really helpful.Thanks for creating the problem scenarios

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Buddy you removed the comment by the time i enjoyed viewing it :). All the best for your CCA 175 exam buddy.

      Delete
    2. This comment has been removed by the author.

      Delete
  18. Hi Arun,

    Really a huge effort you are doing here so big thank to you.

    i have two questions:

    1/how can i export parquet file into mysql using sqoop? you did the export for the csv file, but when i tried the same command using the parquet directory it gave me some exceptions.

    2/Are you sure the queries in 4a, 4b and 4c give the same result? its not the case for me.

    Thanks!

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  19. Below are responses to both questions

    1. Create a hive table that is backed by parquet file, then use hive table as source
    2. Watch the video

    ReplyDelete
    Replies
    1. Hi Arun --
      Great content! Thanks for doing this.

      Sorry, I agree with Taufik in that Results for 4a, 4b, and 4c do not match. SQL and DF result sets do but CombineByKey resultset does not match with the other two from second row on. For example, the second row total amount for dataframe and SQL res is 16333.16; while for combineByKey res for the same row is 10999.05 (off by about 4000). What you are doing appears to be correct -- I am not sure what need to be done differently, can't tell why the result for combineByKey is wrong. MySQL query supports the other two result sets.

      Delete
  20. Hi Arun, Thanks for a great block. I was going through the Sqoop documentation "sqoop-1.4.6-cdh5.10.1". I have two questions 1) Do we need to by-heart compression technique names because it is not mentioned in the documentation 2) Is the compression technique is mentioned in any other document? Thanks in advance.

    ReplyDelete
    Replies
    1. Working (hands on) knowledge of compression techniques is absolutely required not only for the exam but also on a day to day life as a big data developer. You need to memorize the compression methods and i recommend that you go through the file formats link of this blog. best of luck

      Delete
    2. Below are the available compressions from Apache. Just add org.apache.hadoop.io.compress before these compression methods/algorithms while writing sqoop code.
      BZip2Codec, DefaultCodec, DeflateCodec, GzipCodec, Lz4Codec, SnappyCodec

      Delete
  21. This comment has been removed by the author.

    ReplyDelete
  22. Hello Arun,
    Thanks for the posts . They are really helpful.
    But I am confused with analyzing the right way to solve a given problem as there could be many solutions possible. For example , creating a hive table in say parquet file format can be done through Hive or sqoop import . How do I know which is the right solution during certification. Please suggest. Suggestions from people with certification experience will be helpful too.
    Thanks once again for all the effort .

    ReplyDelete
  23. Hi Arun,
    Thank you for the blog. The questions are of great detail and the questions were well structured

    ReplyDelete
    Replies
    1. it depends. the exam may ask you to do it in a certain method or just ask you for the result. if the exams question is asking is you only for a result then you are free to choose whatever method you want. but always remember that spark allows a lot of flexibility whereas sqoop is very limited. so choose a technology that helps you solve the problem in the fastest way. Time is premium, i have heard from so many that they could not complete the exam as they took too much time debugging their procedure. so be careful choosing a solution..

      Delete
  24. Hello Arun,
    I have booked exam and waiting to take it once preperation is done. Am confusing a lot with scala i decided to learn more on dataframes as you have provided in problrm1 and 2 as it is easy compare to scala is it ok to choose dataframes in exam for all type of problems can you add some more videos regarding dataframes. that would be greatly appreciated.

    ReplyDelete
  25. This comment has been removed by the author.

    ReplyDelete
  26. Hi Arun,

    thanks for your blog.
    from exam point of view should we be aware of all the below 3 types? If i'm good in sql, can i ignore a &c?? Please advise..

    a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
    b). Using Spark SQL - here order_date should be YYYY-MM-DD format
    c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount

    ReplyDelete
    Replies
    1. be aware of all three. you may be asked to fill in the blanks and execute the program.awareness about all three will make you better prepared.

      Delete
  27. Hello , I have been trying the problems scenarios given in your blog . but could you please provide the sample files for order-items so we can try in our laptop .for example lets say problem 1 sqooping order-items data . So I need sample data for this. Appreciate your response :)

    ReplyDelete
  28. Buddy, you need to have CDH downloaded from cloudera. CDH has all the data in mysql database that comes with it. Watch the video for an understanding how that data is pulled from mysql to hdfs.

    ReplyDelete
  29. This comment has been removed by the author.

    ReplyDelete
  30. Hello please which editor is use during CCA175 exam. Thank you

    ReplyDelete
  31. Thanks for your blog.
    I have one question. Is it mandatory to practice the questions on Cloudera Platform only. I have experience working on Hortonworks platform. Can I directly attempt the exam by practicing on Hortonworks Platform

    ReplyDelete
    Replies
    1. practice on cloudera platform. this gives you acquaintance. Even seasoned big data developers failed the CCA 175 exam recently. not due to complexity of the exam but due to unfamiliarity of the environment. they ran out of time and hence could not complete the required number of problem scenarios.

      Delete
  32. var dataFrameResult =
    dataFrameResult.show();

    ERROR IN MY MACHINE PLEASE REPLY ME

    ReplyDelete
  33. Hi Arun,

    I have cleared CCA spark and Hadoop Developer because of your blogspot.

    Thank you very much!! Hope to see more problem scenarios more like this.

    Regards,
    Aswin Ramakrishnan

    ReplyDelete
    Replies
    1. Hi Aswin,

      Big congratulation!!

      Can you please tell me which editor we are allowed to use during exam. is it VI or gredit?

      Delete
  34. Hi Arun sir Can I get the data of problems which you have explained.

    ReplyDelete
  35. Beautiful Blog.........Great content presented in the best possible way that the entire big data knowledge can be streamlined

    ReplyDelete
  36. This comment has been removed by the author.

    ReplyDelete
  37. Hello Arun - I think you have done the coding in Scala in all your videos so if someone wants to take the exam using Python do you have the code snippet for the same ? If yes can you please share the same -Auro

    ReplyDelete
  38. Hi Arun,

    I have booked exam and holding up to take it once planning is finished. Am mistaking a considerable measure for Scala I chose to take in more on information outlines as you have given in issue and 2 as it is simple contrast with Scala is it alright to pick information outlines in exam for all kind of issues would you be able to include some more recordings in regards to information outlines. that would be significantly refreshing.

    Regards,
    Trep Helix

    ReplyDelete
  39. Hi Arun,

    In certificate exam, this avro databrocks will work in the spark-shell. Because as per cloudera they do not provide "Databricks API" in the CDH install

    import com.databricks.spark.avro._;

    ReplyDelete
  40. Hi Arun,
    I've WATCHED YOUR YOUTUBE VIDEO but I still DON'T UNDERSTAND THIS part of 4c exercise. I'm totally lost with the combineByKeyResult:

    combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
    (x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
    (x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
    map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size))


    Completeley lost with the combineByKey. Could you please explain me what you are doing here?
    I've got my exam in two days.

    Thank you

    ReplyDelete
    Replies
    1. Even i couldn't understood from 2nd lamda fn:(

      Delete
  41. HI Arun,

    This is really informative blog..

    But CCA-175 exam content is changed. Do you help with new playlist for the same?

    ReplyDelete
    Replies
    1. Hi Sohil,

      These would suffice the CCA175 exam content as well.

      Delete
  42. This comment has been removed by the author.

    ReplyDelete
  43. Hi Arun/Venkat,

    Can we submit more than one file as output because "spark.sql.shuffle.partitions" value is more than 1, by default. OR we have submit one output file only.

    ReplyDelete
  44. Hi Jitesh,

    By default "spark.sql.shuffle.partitions" is set to 200. This can be even configured to value 1 based performance tuning options.

    ReplyDelete
  45. Hi Venkat,

    Thanks for your answer, But my question is still same, can we submit more than file? Because it will save time during exam.

    ReplyDelete
  46. If you are expecting yes or no kind of answer. Answer is YES.

    ReplyDelete
  47. This comment has been removed by the author.

    ReplyDelete
  48. I am trying to use aggregateByKey instead of combineByKey.but getting issue with count rest all looking good...could you please let me know what i am missing here,

    val ordersDF = sqlContext.read.avro("/user/imthiyas90/problem1/orders")
    val orderitemDF = sqlContext.read.avro("/user/imthiyas90/problem1/order_items")

    ordersDF.registerTempTable("orders")

    orderitemDF.registerTempTable("order_items")

    val joinedDF = ordersDF.join(orderitemDF, ordersDF("order_id") === orderitemDF("order_item_order_id"))

    val combine_agg = joinedDF.map(e => ((e(1).toString,e(3).toString),(e(8).toString.toDouble,e(0).toString.toInt))).aggregateByKey((0.0,0))((x:(Double,Int),y:(Double,Int)) =>(x._1+y._1,x._2+1),(x:(Double,Int),y:(Double,Int)) =>(x._1+y._1,x._2+y._2)).map(x => (x._1._1,x._1._2,x._2._1,x._2._2)).toDF.orderBy(col("_1").desc,col("_2"),col("_3").desc,col("_4"))

    ReplyDelete
  49. This comment has been removed by the author.

    ReplyDelete
  50. Hi Arun,
    Thanks for providing such a wonderful blog for those who aspire to clear CCA Spark And Hadoop Certification. But what I found is all of the solutions are explained in scala, I am learning Pyspark and don't know scala. Could you please tell if you have the same solutions in Python as well?
    Thanks

    ReplyDelete
  51. Hi Arun,
    I tried from_unixtime. But is returning null values.

    ReplyDelete
  52. Hi, i appreciate your great efforts! Is it necessary to round the sum of order_item_subtotal to 2 decimal values? They have not mentioned in the question.

    ReplyDelete
  53. when it comes to saving as csv can we do following steps:-

    val resultRDD = resultDF.rdd.
    map(_.toString().replace("[","").replace("]","")).
    saveAsTextFile("/user/cloudera/problem1/result4a-csv")

    ReplyDelete
  54. Hi Arun,

    Thanks a ton for these exercises and the videos.

    I have cleared my CCA 175 today. These exercises has helped a lot in my preparation and gave the confidence to take up the test.

    Thanks
    Vamsi

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. many congratulations for passing the exam. I am going to appear for exam in few days. I just wanted to know did you get templates in exam? Was there any question about avro? How many questions were there from each category? Your answers can really help since you have just given it recently. i will really appreciate if you will revert.

      Delete
  55. It is nice blog keep updating your blog and i am waiting for your next update also Big Data Hadoop Online course

    ReplyDelete
  56. Arun from_unixtime does not work in pyspark . do we have any solution for same ?

    ReplyDelete
  57. Can we not register the Dataframes as temp tables then we can use SQL ? Can you please explain .

    ReplyDelete
  58. Hi Arun,
    Thanks for the great effort.
    In step 3, you've imported avro files from databricks, will this import statement be given to me during the exam or I need to memorize it as it's not mentioned in spark docs and as far as i know i'll only have access to spark docs.
    Thanks

    ReplyDelete
  59. hi arun.. your video tutorials is good. but from where i can get sample data files that you used in your videos.
    i want to try out. please give me path from which i can download that files and folders and try out on my local machine

    ReplyDelete
  60. Hi Arun, how different will be the problem scenario of CCP DE575 from what you have given here

    ReplyDelete
  61. Hi Please can you share top 20 rows of result set to compare our output.

    ReplyDelete
  62. to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date is giving null values but when I just query o.order_date from orders temp table it is giving correct value which is of data type String .Any help in this situation?

    ReplyDelete
  63. The order date should be long data type and if we use to_date(from_unixtime(cast(order_date/1000 as bigint))) then it will correct output in YYYY-MM-DD format

    ReplyDelete
  64. what all the api documentation available during exam in VM itself?

    ReplyDelete
  65. It was really a nice article and i was really impressed by reading this Big data hadoop online Course Hyderabad

    ReplyDelete
  66. great efforts!!. One suggestion from my end .. Please provide the solutions in python as well .. it would be great help.

    ReplyDelete
  67. Hi,
    In the problem scenario 1, problem 1 when have been asked to use Snappy compression. The code which you have given contains "--compression-codec org.apache.hadoop.io.compress.SnappyCodec". Would like to know, are we supposed to remember org.apache.... to use the Snappy compression? or we can get it from hadoop or even from user guide?

    ReplyDelete
  68. This comment has been removed by the author.

    ReplyDelete
  69. This comment has been removed by the author.

    ReplyDelete
  70. 3.Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.


    can this be done as

    ordersDF = sc.textFile("xxx/xxx")
    ordersDF.toDF

    ReplyDelete
  71. Hello,
    Thank you for this amazing blog. I am bit confuse is this the entire demo question paper or just one question in the exam?

    ReplyDelete
  72. Hi Arun,
    Thank for your video, Request you to please add some dedicated videos upon groupByKey,CombineByKey,reduceByKey....functions

    ReplyDelete
  73. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore

    ReplyDelete
  74. Hi,
    I am not able to follow the combineByKey step.

    1. in the second lambda, is the output of first lambda treated as the input for second lambda?
    2. what will be value of x, y in the second lambda.

    I did watched some videos to understand combineByKey. ALso please share any good videos for combineByKey.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  75. Hi Arun,

    I have really motivated after seen your preparation plan for certification. you have done a very good job. I am following your plan and started trying hands-on. My humble request is like if you can provide the same code in python instead of scale it would be really helpful for me to understand since i have worked with python so far. expecting some materials as well as more scenarios.


    Thanks,
    Aruna Asokan

    ReplyDelete
  76. This comment has been removed by the author.

    ReplyDelete
  77. This comment has been removed by the author.

    ReplyDelete
  78. Hi Arun,

    Thanks for the blog and usful questions.

    I'm trying to resolve the problem1 using pyspark and stuck at point 7 solution.

    I'm trying to transform the sqlResult(in my below code "resultDF") DF to RDD using the map function ,but ending up some numeric valies in the map variable.Even I save as text file like below, I'm seeing some junk data as in bold below.

    >>resultDF. \
    map(lambda m: str(m(0))+','+str(m(1))+','+str(m(2))+','+str(m(3))). \
    coalesce(4). \
    saveAsTextFile("/user/....")

    >>resultMap = resultDF. \
    map(lambda m: str(m(0))+','+str(m(1))+','+str(m(2))+','+str(m(3)))

    >>for i in resultMap.take(10):print(i)

    Row(2014-07-24=0),Row(2014-07-24=1),Row(2014-07-24=2),Row(2014-07-24=3)

    Even I tried to config the compression to uncompressed , but still I am seeing same result.


    sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed");

    could you please check how can I save the resultDF to textFile (csv).

    Thanks
    Sekar

    ReplyDelete
    Replies
    1. Hi Reddy , i think i can help firstly i used to converting DF to RDD in python like this :
      MyNewRDD = resultDF.rdd.map(list)
      now for textfile with comma delimiter(CSV):
      MyNewRDD.map(lambda V:str(V[0])+','+str(V[1])+','+str(V[2])+','+str(V[3])).saveAsTextFile("path here ... ")
      the default compression is uncompressed so you don't have to do anything
      some tip for you , when you do compression in RDD its a bit different from scale
      Eg: RDD.saveAsTextFile("path",compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec')
      don't worry about the syntax you can get it from writing on the shell: help(myRDD.saveAsTextFile)

      Delete
    2. This is awesome.. Many questions to save file as text in exam. was looking for a way to save it in pyspark for a while

      Delete
  79. HI, where can i access the data set for the scenario, plus the correct output?

    ReplyDelete

  80. Well done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful R Programming tutorial on youtube so you can check that too for more detailed knowledge on R Programming.https://www.youtube.com/watch?v=rgFVq_Q6VF0

    ReplyDelete

  81. Very Impressive Data Science tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Data Science course. I'm also a learner taken up Data Science training and I think your content has cleared some concepts of mine. While browsing for Data Science tutorials on YouTube i found this fantastic video on Data Science. Do check it out if you are interested to know more.:-https://www.youtube.com/watch?v=gXb9ZKwx29U&t=237s

    ReplyDelete
  82. please help me convert combineByKey scala code to pyspark

    ReplyDelete
  83. After reading this blog i very strong in this topics and this blog really helpful to all Big data hadoop online Course Bangalore

    ReplyDelete
  84. This comment has been removed by the author.

    ReplyDelete
  85. It was looking like a difficult challenge to pass CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Exam because I was short of time and was not having any proper material. Then one of my friends suggested me to use CCA175 Braindumps for the preparation of my certification. I am thankful to Realexamdumps.com for free demo that won my trust. I found it the same as I was told.

    ReplyDelete
  86. This comment has been removed by the author.

    ReplyDelete
  87. For Pyspark AVRO...
    There is no need to install an additional package during the exam. You are given a base CDH5.10 cluster with the default installation. That includes Spark 1.6 and Avro packages. https://www.cloudera.com/documentation/enterprise/5-10-x/topics/spark_avro.html

    Ex: op.write.format("com.databricks.spark.avro").save("/home/cloudera/retail_db/orders/avro1")
    opdf = sqlContext.read.format("com.databricks.spark.avro").load("/home/cloudera/retail_db/orders/avro1")

    ReplyDelete
  88. Hello Arun,

    What should be the approach to solve a problem wherein we have to copy the content of a hive table "abc" to a hdfs location "xyz/abcd" and change the delimiter to "|"?

    ReplyDelete
  89. It was not my first attempt for this certification but I am sure if I had chosen CCA175 braindumps before then it would be my first attempt with ultimate success. Like everyone I am very happy over my success but more than it I am thank full to Dumps4download for all the help and support they provided me throughout my preparation.

    ReplyDelete
  90. 4a using pyspark

    from pyspark.sql import functions as F

    ojS.groupBy(F.to_date(F.from_unixtime(ojS.order_date/1000)).alias("Formated_order_date"),'order_status'). \
    agg(F.sum('order_item_product_price').alias("total_amount"),F.count('order_id').alias("total_orders")). \
    orderBy(['Formated_order_date','order_status','total_amount','total_orders'], ascending=[0, 1, 0, 1]).show()

    ReplyDelete
    Replies
    1. Hi, did you use Cloudera or HDP Sandbox for running the pyspark shell ?

      Delete
  91. for the question 4c, while creating the rdd using
    val rdd = joindf.map(x=> ((x(1).toString,x(3).toString),(x(8).toString.toFloat,x(0).toString)))

    rdd is resulting as dataset instead of rdd. what could be the issue?

    because of this I am not able to use combineByKey.

    ReplyDelete
  92. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Training Bangalore


    ReplyDelete
  93. Hi Arun,

    Can you help me in understanding your combineByKey code. Below is my query . your comments pls
    https://stackoverflow.com/questions/52677045/understanding-of-combinebykey

    ReplyDelete
  94. Hi
    Below are pyspark code

    pyspark --master yarn --jars /home/cloudera/Downloads/spark-avro_2.10-2.0.1.jar

    order_df = sqlContext.read. \
    format("com.databricks.spark.avro"). \
    load("/user/cloudera/problem1/orders")

    order_df.show()

    order_item_df = sqlContext.read. \
    format("com.databricks.spark.avro"). \
    load("/user/cloudera/problem1/order-items")

    order_item_df.show()

    joined_df = order_df. \
    join(order_item_df, order_df.order_id == order_item_df.order_item_order_id)

    from pyspark.sql.functions import *

    result_df = joined_df. \
    groupby(to_date(from_unixtime(col("order_date")/1000)).alias("order_date"), col("order_status")). \
    agg(countDistinct(col("order_id")).alias("total_orders"), round(sum(col("order_item_subtotal")), 2).alias("total_amount")). \
    orderBy(col("order_date").desc(), col("order_status"),col("total_amount").desc(), col("total_orders"))

    result_df.show()

    joined_df.registerTempTable("joined_table")

    sqlContext.sql("select * from joined_table limit 10").show()

    result_sql = sqlContext. \
    sql("select to_date(from_unixtime(cast(order_date/1000 as int))) as formatted_order_date, \
    order_status, round(sum(order_item_subtotal) ,2) as total_amount, \
    count(distinct(order_id)) as total_orders from joined_table \
    group by order_date, order_status \
    order by formatted_order_date desc, order_status, total_amount desc, total_orders")

    result_sql.show()

    result_rdd = joined_df. \
    map(lambda x : ((x[1], str(x[3])), (float(x[8]), int(x[0])))). \
    combineByKey(
    lambda x : (x[0], set([x[1]])),
    lambda x, y : (x[0] + y[0], x[1] | set([y[1]])),
    lambda x, y : (x[0] + y[0], x[1] | y[1])). \
    map(lambda x :(x[0][0], x[0][1], x[1][0], len(x[1][1]))). \
    toDF().orderBy(col("_1").desc(), col("_2"), col("_3").desc(), col("_4"))

    result_rdd.show()

    ReplyDelete
  95. Hi Arun,

    This blog is really helpful for hadoop certification... Really appreciate your effort to put these content at one place..

    I gave a try last week, but couldn't clear due to time management. I got one question where I have to rename order_id column as "order id" in select clause in sqlContext.sql, and save the result as parquet file (or avro), I was able to rename by using <>.toDF("order id") to my result, but I was not able to write to file with space in column header. Could you please help to resolve this.

    ReplyDelete
  96. Hello Arun,

    I Couldn't understand this logic. could you please break each step and explain for me.
    result_rdd = OrdersJoin.map(lambda x:((x[1], x[3]), (float(x[8]), int(x[0])))). \
    combineByKey(lambda x : (x[0], set([x[1]])), \
    lambda x, y : (x[0] + y[0], x[1] | set([y[1]])), \
    lambda x, y : (x[0] + y[0], x[1] | y[1])). \
    map(lambda x :(x[0][0], x[0][1], x[1][0], len(x[1][1]))). \
    toDF().orderBy(col("_1").desc(), col("_2"), col("_3").desc(), col("_4"))

    ReplyDelete
  97. how do we know the location of the database. Is it part of the direction in the problem because your problem one only shows the location in hdfs where the files is imported to.
    Thanks

    ReplyDelete
  98. can you please provide the dummy data file for the problem?

    ReplyDelete
  99. Hello Arun,

    Thank you so much for this resource, it has been invaluable for me in preparing for the CCA175 test.

    I have a question for you that isn't in this blog, I was wondering if you could help me...

    I'm creating a case class to convert rdd to df and was wondering what data type date would be in my case class.

    ex: case class orderClass( order_id : Integer, orderDate : ?, )

    I already tried finding the answer myself, you're sort of my last resort.

    Thank you,

    ReplyDelete
  100. Hey Arun,

    First of all great on comprehensive guidance from your blog.

    I am just a beginner with big data and wanted to ask that while attempting the certification exam can we google stuff if we dont know..:-)

    ReplyDelete
  101. Hi Arun thank you very very much for this blog. your contribution is great.

    Below is my code for who practice in pyspark:

    https://github.com/Karthik-NS/solutions-in-pyspark-for-problems-in-Aruns-Blog-

    ReplyDelete
  102. Nice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me

    ReplyDelete
  103. Nice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me

    ReplyDelete
  104. Is this just one question, or 8 different question in exam.

    ReplyDelete
  105. Thanks Arun. The blog helped me to clear the certification successfully!

    ReplyDelete
  106. I could not imagine such an amazing study experience but it helped me to secure my desired grades in my coveted IT certification. 7004 Dumps really changed my professional life by giving me a gift in the form of successful career opportunity. I could have hardly passed my examination without taking help from Avaya 7004 Dumps. I say thanks to Dumpspass4sure.com for all the help.
    https://www.dumpspass4sure.com/avaya/7004-dumps.html

    ReplyDelete
  107. I got success at the first attempt with Pass4sure CompTIA dumps and am happy for the way it happened. I chose this PDF guide quickly after looking into free demo questions. I think everyone must download free demo questions to take CompTIA exam material as their helping guide.

    ReplyDelete
  108. tHANKS FOR THE bLOG. REALLY HELPFUL

    ReplyDelete
  109. Though it is challenging to transform the data from one computer to another, that is, data migration , but with the help of this program, I had not felt any need to change the source code for uploading the several database tables.

    ReplyDelete
  110. Good post!Thank you so much for sharing this lovely article.It was so good to read and useful to upgrade my understanding...
    Data Science Course in Pune

    ReplyDelete
  111. Good blog, thanks for sharing this information.
    Spark Scala Training

    ReplyDelete
  112. The main motive of the Automated big data engineering is to spread the knowledge so that they can give more big data engineers to the world.

    ReplyDelete
  113. do you have pyspark version of the code?

    ReplyDelete
  114. to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_formatted_date why are you doing this? and what's the datatype of order_date here?
    This is what I did: is mine right? Please confirm
    data = spark.sql('''select distinct count(order_id) count,sum(order_item_subtotal) total ,order_status,date_format(order_date,'EEE') day from orders o join orderitems oi
    on o.order_id = oi.order_item_order_id group by o.order_status,day order by day desc, order_status, total desc, count''')

    ReplyDelete
  115. Hi arun,
    Are you using spak 1.6 or spark 2 please? Is there any difference?
    Thanks

    ReplyDelete
  116. Hi Arun, This blog help me a lot to pass the certification. We know that the examn was recently updated, however, it was very useful for me. Thanks a lot.

    ReplyDelete

  117. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well.

    Data Scientist Course in pune

    ReplyDelete
  118. Good blog and good work, thanks for sharing this information

    I encountered two errors while doing Problem Scenario 1, thanks for your help :

    4c. var rddResult = joinedDF.map(x=> (x(1),x(3),x(0),x(9)));
    #=>>>error: java.lang.ClassNotFoundException: scala.Any
    at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    7. dataframeResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/osboxes/problem1/result4a-csv")
    #=>>>error: value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[String]
    sqlResult.map(x=> x(0) + "," + x(1) + "," + x(2) + "," + x(3)).saveAsTextFile("/user/osboxes/problem1/result4b-csv")
    #=>>>error: value saveAsTextFile is not a member of org.apache.spark.sql.Dataset[String]

    For information, I'm on Cloudera QuickStart VM 6.3.2

    ReplyDelete
  119. Nice article please do visit my website for cloudera developer training

    ReplyDelete
  120. Thanks for sharing the such information with us.
    Hadoop Certification in Pune

    ReplyDelete
  121. You can't imagine what I am going to tell you. Our institution is offering you a CS executive classes and a free of cost CSEET classes and many more to explore. So please contact us or visit our website at https://uniqueacademyforcommerce.com/

    ReplyDelete
  122. thanks for the article on hadoop keep writing and sharing.

    Data Science Training in Pune

    ReplyDelete
  123. thanks arun to share valuable cloudera spark certification training related (cca 175 ) questions, but recently its changed syllabus.
    Regards
    Venu
    bigdata training in Hyderabad

    ReplyDelete
  124. Set An Alert For RMB TO USD To Receive An Email When The Exchange Rate Changes. Alternatively, Bookmark The Page And Check Back Here Regularly.

    ReplyDelete
  125. Are You Thinking About Joining AVATRADE REVIEW ? If You Are Then Read The Review On AVATRADE Fx In This Article. We, Will, Tell You About The Company, The Brokers, The Platform, And All The Information You Need To Know. Read More Here.

    ReplyDelete
  126. Nice Piece Of Information, Keep Sharing Such Informative Post.

    big data hadoop course

    Call on 7070905090 To Join Ducat Today

    ReplyDelete
  127. Great Post. Very informative. Keep Sharing!!

    Apply Now for Big DATA Training Classes in Noida

    For more details about the course fee, duration, classes, certification, and placement call our expert at 70-70-90-50-90

    ReplyDelete
  128. Thanks for sharing such information with us.
    Jewellery Billing Software
    Jewellery Billing Software

    ReplyDelete
  129. thanks for the article keep writing and sharing.
    Jewellery ERP Software UAE
    Jewellery ERP Software UAE

    ReplyDelete

If you have landed on this page then you are most likely aspiring to learn Hadoop ecosystem of technologies and tools. Why not make you...