Spark MLlib: Frequent Pattern Mining (FP-Growth)
Spark MLlib provides FP-Growth algorithm for Frequent Pattern Mining (Association Rule Mining), which takes as an input a dataset of transactions and calculates item frequencies. After frequent item sets were identified association rules could be derived.
From the original source:
spark.mllib
’s FP-growth implementation takes the following (hyper-)parameters:
minSupport
: the minimum support for an itemset to be identified as frequent. For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.numPartitions
: the number of partitions used to distribute the work.
Market Basket Analysis Example
As it was mentioned previously, this approach originally was proposed as an appropriate method for sollving market basket problem. Market basket analysis is identifying items in the purchases which customers are more likely to buy together.e.g., Customers who bought bread also bought milk.
Example of solving this task by using Spark's FPGrowth algorithm is shown below. The source code is available on GitHub repository.
The input dataset is represented as a set of transactions, where transactions are the list of products which was bought within one purchase.
File with transactions:
MILK,BREAD,BISCUIT
BREAD,MILK,BISCUIT,CORNFLAKES
BREAD,MILK,CORNFLAKES
BREAD,MILK,TEA
BREAD,MILK,CORNFLAKES
BREAD,TEA,BOURNVITA
BREAD,COFFEE,COCK
..................
Source code:
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object FPGrowthEx {
def main(args: Array[String]): Unit = {
//initialize spark configurations
val conf = new SparkConf()
conf.setAppName("market-basket-problem")
conf.setMaster("local[2]")
//SparkContext
val sc = new SparkContext(conf)
val inputFile = args(0);
//read file with purchases
val fileRDD = sc.textFile(inputFile)
//get transactions
val products: RDD[Array[String]] = fileRDD.map(s => s.split(","))
//get frequent patterns via FPGrowth
val fpg = new FPGrowth()
.setMinSupport(0.2)
val model = fpg.run(products)
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
//get association rules
val minConfidence = 0.3
val rules = model.generateAssociationRules(minConfidence)
rules.collect().foreach { rule =>
println(
rule.antecedent.mkString("[", ",", "]")
+ " => " + rule.consequent.mkString("[", ",", "]")
+ ", " + rule.confidence)
}
}
}
Here as an example of derived association rules or purchase patterns:
[COCK] => [COFFEE], 1.0
[BREAD] => [MILK], 0.42857142857142855
[BREAD] => [COFFEE], 0.2857142857142857
[MILK] => [BREAD], 0.9
[COFFEE] => [COCK], 0.5454545454545454
[COFFEE] => [BREAD], 0.5454545454545454