Problem Scenario 6 [Data Analysis]

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walkthrough of this problem is available at [CLICK HERE] AND

Click here for the video version of this series. This takes you to the youtube playlist of videos.

This problem helps you strengthen and validate skills related to data analysis objective of the certification exam.

Data model in mysql on cloudera VM looks like this. [Note: only primary and foreign keys are included in the relational schema diagram shown below]

Problem 6: Provide two solutions for steps 2 to 7

Using HIVE QL over Hive Context
Using Spark SQL over Spark SQL Context or by using RDDs

create a hive meta store database named problem6 and import all tables from mysql retail_db database into hive meta store.
On spark shell use data available on meta store as source and perform step 3,4,5 and 6. [this proves your ability to use meta store as a source]
Rank products within department by price and order by department ascending and rank descending [this proves you can produce ranked and sorted data on joined data sets]
find top 10 customers with most unique product purchases. if more than one customer has the same number of product purchases then the customer with the lowest customer_id will take precedence [this proves you can produce aggregate statistics on joined datasets]
On dataset from step 3, apply filter such that only products less than 100 are extracted [this proves you can use subqueries and also filter data]
On dataset from step 4, extract details of products purchased by top 10 customers which are priced at less than 100 USD per unit [this proves you can use subqueries and also filter data]
Store the result of 5 and 6 in new meta store tables within hive. [this proves your ability to use metastore as a sink]

Solution:

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario. Please go through the video for an indepth explanation of the solution.

NOTE: The same solution can be implemented using Spark SQL Context. Just replace Hive Context object with SQL Context object below. Rest of the solution remains the same. i.e same concept of querying, using temp table and storing the result back to hive.

Step 1:

sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --warehouse-dir /user/hive/warehouse/problem6.db --hive-import --hive-database problem6 --create-hive-table --as-textfile;

Step 2:

var hc = new org.apache.spark.sql.hive.HiveContext(sc);

Step 3:

var hiveResult = hc.sql("select d.department_id, p.product_id, p.product_name, p.product_price, rank() over (partition by d.department_id order by p.product_price) as product_price_rank, dense_rank() over (partition by d.department_id order by p.product_price) as product_dense_price_rank from products p inner join categories c on c.category_id = p.product_category_id inner join departments d on c.category_department_id = d.department_id order by d.department_id, product_price_rank desc, product_dense_price_rank ");

Step 4:

var hiveResult2 = hc.sql("select c.customer_id, c.customer_fname, count(distinct(oi.order_item_product_id)) unique_products from customers c inner join orders o on o.order_customer_id = c.customer_id inner join order_items oi on o.order_id = oi.order_item_order_id group by c.customer_id, c.customer_fname order by unique_products desc, c.customer_id limit 10
")

Step 5:

hiveResult.registerTempTable("product_rank_result_temp");

hc.sql("select * from product_rank_result_temp where product_price < 100").show();

Step 6:

var topCustomers = hc.sql("select c.customer_id, c.customer_fname, count(distinct(oi.order_item_product_id)) unique_products from customers c inner join orders o on o.order_customer_id = c.customer_id inner join order_items oi on o.order_id = oi.order_item_order_id group by c.customer_id, c.customer_fname order by unique_products desc, c.customer_id limit 10 ");

topCustomers.registerTempTable("top_cust");

var topProducts = hc.sql("select distinct p.* from products p inner join order_items oi on oi.order_item_product_id = p.product_id inner join orders o on o.order_id = oi.order_item_order_id inner join top_cust tc on o.order_customer_id = tc.customer_id where p.product_price < 100");

Step 7:

hc.sql("create table problem6.product_rank_result as select * from product_rank_result_temp where product_price < 100");

hc.sql("create table problem 6.top_products as select distinct p.* from products p inner join order_items oi on oi.order_item_product_id = p.product_id inner join orders o on o.order_id = oi.order_item_order_id inner join top_cust tc on o.order_customer_id = tc.customer_id where p.product_price < 100");

43 comments:

amitrockMay 22, 2017 at 1:36 AM
Hi Arun , This Blog is being so awesome for me !!! When can we expect the next videos which will be related to Flume / Kafka / Spark Streaming and Configuration ??? Looking forward to learn more from you.
ReplyDelete
Replies
UnknownJuly 10, 2017 at 11:51 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 14, 2017 at 4:41 AM
hi Arun,

do you really need the --create-hive-table in the first question ?
ReplyDelete
Replies
UnknownJuly 17, 2017 at 8:26 PM
Hi Arun,

I am getting error- "org.apache.spark.sql.AnalysisException: Table not found: products;" on while running the query in spark sql. I followed all the steps correctly till that point.
ReplyDelete
Replies
hammad zahidJuly 23, 2017 at 7:09 AM
Hi.arun first of all great effort? As I am pretty weak in hive but good in spark so I am confused whether what is replacement for dense rank in spark SQL.
ReplyDelete
Replies
UnknownJuly 25, 2017 at 12:27 PM
Hi Arun,
Great Blog for preparing for CCA175.
I would like to know , that a single question in the certification contains the number of steps or queries like above? or they are fewer?? I am asking keeping in mind the duration of the exam. 2 hrs for 10 questions , gives us 12 mins per question. Would we be able to solve all the steps in this amount of time?
ReplyDelete
Replies
UnknownOctober 5, 2017 at 7:46 AM
Such an useful blog!!
I have a query.Can i give answers to all the spark related questions in hive context/sql context?
ReplyDelete
Replies
UnknownOctober 5, 2017 at 11:41 AM
This is more of a sql or hive QL question , In this query below for 'Rank products within department by price and order by department ascending and rank descending' - is there a way to list only the top 3 ranked products within each department ??

select d.department_id, p.product_id, p.product_name, p.product_price, rank() over (partition by d.department_id order by p.product_price) as product_price_rank, dense_rank() over (partition by d.department_id order by p.product_price) as product_dense_price_rank from products p inner join categories c on c.category_id = p.product_category_id inner join departments d on c.category_department_id = d.department_id order by d.department_id, product_price_rank desc, product_dense_price_rank

ReplyDelete
Replies
ishtiaqDecember 28, 2017 at 10:20 AM
HI Arun
great post, helped me a lot. One question i have is is there no other way to push data in hive without going through hiveContext?

I have tried DF.write.mode("append").saveAsTable("schema.table") and it works but still wondering if there is another better way.
ReplyDelete
Replies
UnknownFebruary 12, 2018 at 6:15 AM

it is very excellent blog and useful article thank you for sharing with us , keep posting learn more about Big Data Hadoop important information thank you providing this important information on
Big Data Hadoop Online Course BANGALORE
ReplyDelete
Replies
UnknownApril 6, 2018 at 5:48 PM
Hi Arun,
Thanks for such a great post!
I need a help with this problem, I got an Error while running my SQL query in spark hive context. Error is:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenExchange hashpartitioning(department_id#93,200), None
+- Project [department_id#93,product_id#84,product_category_id#85,product_name#86,product_price#88]
ReplyDelete
Replies
TejutejuApril 23, 2018 at 6:57 AM
awesome post presented by you..your writing style is fabulous and keep update with your blogs Big data hadoop online Course
ReplyDelete
Replies
UnknownApril 30, 2018 at 10:57 AM
need help..... getting error- "org.apache.spark.sql.AnalysisException: Table not found: products;" on while running the query in spark sql....
ReplyDelete
Replies
PeterMay 5, 2018 at 3:34 PM
using cdh 5.13 in a kvm image.
I had to run sudo ln -s /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/hive-site.xml
to get spark to see the hive metastore that has the problem6 database
ReplyDelete
Replies
UnknownJune 13, 2018 at 7:21 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore
ReplyDelete
Replies
UnknownJuly 7, 2018 at 6:50 PM
Hi arun, detailed and very help for CCA175 prep. I was wondering if we can use only sparkSQL for data analysis instead of sparkRDD or DF functions? or will the question specify the technology to be used
ReplyDelete
Replies
SriniMarch 4, 2019 at 3:47 AM
Thanks for the excellent problem questions, Arun. Its a great help, along with Durga Gadiraju's itversity labs and youtube videos. You both have the blessings of so many people.
ReplyDelete
Replies
AnonymousMay 26, 2019 at 7:32 AM
where is it mentioned to use problem6 database?
ReplyDelete
Replies
GaneshJuly 27, 2019 at 6:45 PM
Great efforts Arun. Keep the great work on.
ReplyDelete
Replies
Shivam PurwarSeptember 30, 2019 at 4:51 AM
Hi
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) ;val b = a.keyBy(_.length); b.groupByKey.collect
expected result:

Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))

actual result:

Array((4,CompactBuffer(lion)), (6,CompactBuffer(spider)), (3,CompactBuffer(dog, cat)), (5,CompactBuffer(tiger, eagle)))

I run above line in saprk 2
will it create any issue in answered correct consiedration in CCA175
please assist
ReplyDelete
Replies
AnonymousNovember 13, 2019 at 9:54 PM
Excellent post. I learned a lot of information from this blog and Its useful for gain my knowledge. Keep blogging
Apache hive Training in Electronic City
ReplyDelete
Replies
yamunaJanuary 28, 2020 at 4:08 AM
Nice blog, thanks for sharing this information.
Spark Scala Training
ReplyDelete
Replies
veeraJune 9, 2020 at 3:41 AM
Wohhooo

Very impressive and useful blog.
Thank you..

big data and hadoop course
ReplyDelete
Replies
UnknownJune 11, 2020 at 11:09 PM
Hi Arun,
I am planning to attend CCA175 exam, Could you please have the practice test for CCA175 with new syllabus?

Thanks,
Ranju Thomas
ReplyDelete
Replies
PrwatechJuly 1, 2020 at 8:09 AM
This is most informative and also this post most user friendly and super navigation to all posts.
Apache Spark Training Institutes in Pune
ReplyDelete
Replies
Veera BlogspotNovember 19, 2020 at 11:27 PM
Very nice article,Thank you for sharing it.
Keep updating...

Big Data and Hadoop Online Training
ReplyDelete
Replies
Aptron GurgaonFebruary 20, 2021 at 4:11 AM
Thanks for sharing this Information. Data Science Institute in Gurgaon
ReplyDelete
Replies
Dooley Car RentalsAugust 28, 2021 at 11:48 PM
Thanks for sharing the very nice blog. It is a really very impressive blog post. please keep blogging. Thank you so much for the really good information.

Car Rental Hire Dublin Airport

Car Rental Hire Cork Airport

Car Rental Hire Kerry Airport

Car Rental Hire belfast City Aiport

Car Rental Dublin City

Car Rental Hire Dublin Airport

Car Rental Hire Shannon Airport

ReplyDelete
Replies
lakshmibhucynixOctober 1, 2021 at 3:08 AM
Appreciate you sharing, great article.Much thanks again. Really Cool.
data science training india
data science course hyderabad online
ReplyDelete
Replies
360DigiTMGMarch 19, 2022 at 12:18 AM
I would also motivate just about every person to save this web page for any favourite assistance to assist posted the appearance.
data analytics course in hyderabad
ReplyDelete
Replies
Prepare4testApril 21, 2022 at 10:28 AM
Download the Magento M70-101 Q&A PDF file easily to prepare Magento Certified Developer exam. It is particularly designed for Magento M70-101 exam and our Magento specialists have created this M70-101 Question Dumps observing the original M70-101 exam.
ReplyDelete
Replies
isofttrainingsMarch 23, 2023 at 11:43 PM
Nice article and good information shared by the author. when you are isofttrainings with us we use it to assist with the education information blog.
Visit Us - JAVA Full Stack Developer Online Training
ReplyDelete
Replies
Thanisha NairApril 26, 2023 at 5:18 AM
Thank you for sharing the valuable information with us.
Best Data Management Company/a>
ReplyDelete
Replies

Hadoop and Spark Developer - CCA 175