Spark ML Pipelines

ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

A ML Pipeline is defines as a sequence of stages, where each stage perfom some transformation under the DataFrame. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

Basics

MLlib provides APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. The key concepts of ML pipelines are:

  • DataFrame: This ML API usesDataFramefrom Spark SQL as an ML dataset, which can hold a variety of data types. E.g., aDataFramecould have different columns storing text, feature vectors, true labels, and predictions.

  • Transformer: ATransformeris an algorithm which can transform oneDataFrameinto anotherDataFrame. E.g., an ML model is aTransformerwhich transforms aDataFramewith features into aDataFramewith predictions.

  • Estimator: AnEstimatoris an algorithm which can be fit on aDataFrameto produce aTransformer. E.g., a learning algorithm is anEstimatorwhich trains on aDataFrameand produces a model.

  • Pipeline: APipelinechains multipleTransformers andEstimators together to specify an ML workflow.

  • Parameter: AllTransformers andEstimators now share a common API for specifying parameters.

Pipeline

A ML Pipeline is defines as a sequence of stages, where each stage perfom some transformation under the DataFrame. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

Transformer

A transformer is a ML Pipeline component that transforms a DataFrame into another DataFrame. Usually by appending one or more columns.For example:

  • A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.

  • A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

Estimators 

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

results matching ""

    No results matching ""