Spark ML Pipelines
ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.
A ML Pipeline is defines as a sequence of stages, where each stage perfom some transformation under the DataFrame. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.
Basics
MLlib provides APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. The key concepts of ML pipelines are:
DataFrame
: This ML API usesDataFrame
from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., aDataFrame
could have different columns storing text, feature vectors, true labels, and predictions.Transformer
: ATransformer
is an algorithm which can transform oneDataFrame
into anotherDataFrame
. E.g., an ML model is aTransformer
which transforms aDataFrame
with features into aDataFrame
with predictions.Estimator
: AnEstimator
is an algorithm which can be fit on aDataFrame
to produce aTransformer
. E.g., a learning algorithm is anEstimator
which trains on aDataFrame
and produces a model.Pipeline
: APipeline
chains multipleTransformer
s andEstimator
s together to specify an ML workflow.Parameter
: AllTransformer
s andEstimator
s now share a common API for specifying parameters.
Pipeline
A ML Pipeline is defines as a sequence of stages, where each stage perfom some transformation under the DataFrame. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.
Transformer
A transformer is a ML Pipeline component that transforms a DataFrame into another DataFrame. Usually by appending one or more columns.For example:
A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.
Estimators
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.