Practice 2: SparkSQL, DataFrames
A.JSON: Analyzing tweets via SparkSQL
1) Load this JSON file with tweets into Spark's DataFrame and make the following SQL queries tasks:
Print the structure(the scheme) of the json file
Read all tweets from input files (extract the message from the tweets)
Count the most active languages
Get earliest and latest tweet dates
Get Top devices used among all Twitter users
- Find all the tweets by user
Find how many tweets each user has
Find all the persons mentioned on tweets
Count how many times each person is mentioned
Find the 10 most mentioned persons
Find all the hashtags mentioned on a tweet
- Count how many times each hashtag is mentioned
- Find the 10 most popular Hashtags
B.CSV : Analyzing GDELT data via SparkSQL
Load this gdel.csv file with GDELT data into Spark's DataFrame and make the following SQL queries tasks:
- List all events which took place in recent time in USA, RUSSIA
Find the 10 most mentioned actors (persons)
Join EventCodes with CameoCodes in order to get events' description
Find the 10 most mentioned events with description