Practice 2: SparkSQL, DataFrames

A.JSON: Analyzing tweets via SparkSQL

1) Load this JSON file with tweets into Spark's DataFrame and make the following SQL queries tasks:

  • Print the structure(the scheme) of the json file

  • Read all tweets from input files (extract the message from the tweets)

  • Count the most active languages

  • Get earliest and latest tweet dates

  • Get Top devices used among all Twitter users

  • Find all the tweets by user
  • Find how many tweets each user has

  • Find all the persons mentioned on tweets

  • Count how many times each person is mentioned

  • Find the 10 most mentioned persons

  • Find all the hashtags mentioned on a tweet

  • Count how many times each hashtag is mentioned
  • Find the 10 most popular Hashtags

B.CSV : Analyzing GDELT data via SparkSQL

Load this gdel.csv file with GDELT data into Spark's DataFrame and make the following SQL queries tasks:

  • List all events which took place in recent time in USA, RUSSIA
  • Find the 10 most mentioned actors (persons)

  • Join EventCodes with CameoCodes in order to get events' description

  • Find the 10 most mentioned events with description

results matching ""

    No results matching ""