Apache Spark: Handle with DataFrames
From the original documentation:
With a
SparkSession
, applications can create DataFrames from an existingRDD
, from a Hive table, or from Spark data sources.
Common DataFrame Operations
Operation | |
---|---|
df.Show() | Show the Data in the DataFrame |
df.printScheme() | Show the structure(scheme) of DataFrame |
df.select("name").where("age > 18") | Select the specific column from DF |
df.filter(df("age") > 18) | Filter records in DF with desired properties |
df.groupBy("age") | Group records by specific column |
In order to use SparkSQL component please provide all needed dependencies for your sbt or maven project.
For example, for maven project pom.xml file will have the following dependencies:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>