Powered by GitBook

Apache Spark: Handle with DataFrames

From the original documentation:

With aSparkSession, applications can create DataFrames from an existingRDD, from a Hive table, or from Spark data sources.

Common DataFrame Operations

Operation
df.Show()	Show the Data in the DataFrame
df.printScheme()	Show the structure(scheme) of DataFrame
df.select("name").where("age > 18")	Select the specific column from DF
df.filter(df("age") > 18)	Filter records in DF with desired properties
df.groupBy("age")	Group records by specific column

In order to use SparkSQL component please provide all needed dependencies for your sbt or maven project.
For example, for maven project pom.xml file will have the following dependencies:

<dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
</dependencies>

results matching ""

No results matching ""