# set options to hide warnings
options(warn=-1)
# Load relevant libraries (with startup messages hidden)
suppressPackageStartupMessages({
library(tidyverse)
library(arules)
library(arulesViz)
library(arulesSequences)
})
Association Rule Mining and Frequent Sequence Mining
1. Introduction to Association Rule Mining 1
Association rule mining look for combination of items that frequently occur together in transactions.
A classic example
For example, retail basket analysis is one of the most common applications of association rules. By finding frequent itemsets, retailers could understand what is commonly bought together and use this information to increase sales in various ways, such as placing frequently bought together items next to each other in a store.
Here is an example of five transactions from a grocery store:
transaction_id | items |
---|---|
1 | milk, bread |
2 | bread, butter |
3 | beer, diaper, pizza |
4 | milk, bread, butter |
5 | bread, butter |
What is an example of a frequent itemset from these transactions?
How can association rules be applied in education?
In education, various types of “transactions” are happening at different granular levels.
Every year, students are admitted to a program. Each semester, students are enrolled in courses offered by the program. Each week, students enter different buildings on particular dates to take classes. During a particular class session, students are interacting with the instructor and their peers. These events could all be considered transactions in education.
Take course enrollment for example. If we take the analogy of a market basket analysis, each student is a ‘customer’. In each semester, they will make a ‘transaction’ by enrolling in a set of courses. The data will often take the forms of term_ID
,student_ID
, and course_ID
. The data is often stored in a Student Information System (SIS).
term_ID | student_ID | course_ID |
---|---|---|
1 | Maria | STAT_101 |
1 | Maria | PSY_101 |
1 | Maria | ECON_101 |
2 | Maria | STAT_201 |
2 | Maria | CS_101 |
2 | Maria | POL_101 |
1 | Chris | CS_101 |
1 | Chris | MATH_101 |
1 | Chris | PSY_101 |
2 | Chris | CS_201 |
2 | Chris | STAT_101 |
Similar data but in a different format
term_ID | student_ID | transaction_ID | course_ID_set |
---|---|---|---|
1 | Maria | 1 | STAT_101, PSY_101, ECON_101 |
2 | Maria | 2 | STAT_201, CS_101, POL_101 |
1 | Chris | 3 | CS_101, MATH_101, PSY_101 |
2 | Chris | 4 | CS_201, STAT_101 |
Using the information available in the SIS, we can answer research questions such as:
- Which set of courses are often taken together by students?
- Which set of courses do students often take after taking
CS_101
andSTAT_101
?
For transactions with different levels of temporal granularity, we can ask similar questions.
What are the limitations of association rules?
Association rules only focus on the most frequent (popular) items, and it misses out all the information that are present in the “long tail” of user preference that are essential to provide customized recommendations to customers/users.
In such scenarios, other recommender systems such as collaborative filtering (i.e. recommendations based on user similarity) or content-based filtering (i.e., recommendations based on item similarity) will help detect customers with similar interests even if the absolute number of transactions is small (e.g., customer who are interested in a niche category).
A demo from an education scenario
We have a sample dataset about course enrollment transactions.
read_csv('data/course.csv') %>%
head(10)
Rows: 246 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 1
x
<chr>
1 POLS
2 ECON,MATH,STATS
3 ECON
4 BIO,PHY,CHEM,ENG
5 PSY,BIO
6 ENG,BIO
7 POLS,ENG,ECON,STATS,PHY
8 POLS,BIO,STATS,PHY,CHEM
9 BIO,PSY
10 STATS,CHEM,POLS,ENG
This dataset is very simple. It only contains one column, with each row representing a transaction.
Read the transactions from the .csv file using the read.transactions
function:
# read in the csv file
<- read.transactions(file = 'data/course.csv',
transactions format = 'basket',
header = FALSE,
sep=',')
transactions
transactions in sparse format with
247 transactions (rows) and
10 items (columns)
Let’s check the most popular courses:
itemFrequencyPlot(transactions,
topN=10,
type="relative",
main="Relative Item Frequency Plot")
2. Measures of Association
Support
Support
tells us how popular an itemset is relative to the total number of transactions.
Support(X,Y) = P(X,Y) = Frequency(A/B) / N
Confidence
Confidence
tells us how likely an item Y is purchased given item X is purchased.
Confidence(X,Y) = P(X,Y) / P(X)
Consider the course enrollment example:
Note that the confidence measure might misrepresent the importance of an association. This is because it only accounts for how popular item X is (in our case PSY) but not Y (in our case POLS).
If beer is also very popular in general, there will be a higher chance that a transaction containing PSY will also contain POLS, thus inflating the confidence measure. To account for the base popularity of both items, we use a third measure called lift.
Lift
Lift
tells us how likely item Y is purchased when item X is purchased, while controlling for how popular items Y and X are
3. Mining Association Rules using a-priori algorithm
<- apriori(transactions, parameter = list(supp=0.02, conf=0.6, maxlen=6)) association.rules
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.6 0.1 1 none FALSE TRUE 5 0.02 1
maxlen target ext
6 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 4
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[10 item(s), 247 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [19 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(association.rules[1:10])
lhs rhs support confidence coverage lift
[1] {MATH, PHY} => {CHEM} 0.06882591 0.6800000 0.10121457 1.646667
[2] {ECON, ENG, MATH} => {STATS} 0.02429150 0.6000000 0.04048583 1.610870
[3] {ENG, MATH, STATS} => {ECON} 0.02429150 0.6000000 0.04048583 1.807317
[4] {ENG, MATH, PHY} => {CHEM} 0.02429150 0.7500000 0.03238866 1.816176
[5] {ECON, MATH, PHY} => {CHEM} 0.02024291 0.7142857 0.02834008 1.729692
[6] {BIO, ECON, MATH} => {PSY} 0.02024291 0.6250000 0.03238866 1.882622
[7] {MATH, PSY, STATS} => {ECON} 0.02024291 0.6250000 0.03238866 1.882622
[8] {MATH, PHY, PSY} => {CHEM} 0.02429150 0.7500000 0.03238866 1.816176
[9] {CHEM, MATH, PSY} => {PHY} 0.02429150 0.6666667 0.03643725 1.937255
[10] {MATH, PHY, STATS} => {CHEM} 0.02024291 0.7142857 0.02834008 1.729692
count
[1] 17
[2] 6
[3] 6
[4] 6
[5] 5
[6] 5
[7] 5
[8] 6
[9] 6
[10] 5
With the identified association rules, we can make queries about association patterns related to given items.
# Find what course selections led to students choosing 'ECON'
<- apriori(transactions, parameter = list(supp=0.02, conf=0.6),
item.association.rules appearance = list(default="lhs", rhs="ECON"))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.6 0.1 1 none FALSE TRUE 5 0.02 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 4
set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[10 item(s), 247 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [2 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(head(item.association.rules))
lhs rhs support confidence coverage lift
[1] {ENG, MATH, STATS} => {ECON} 0.02429150 0.600 0.04048583 1.807317
[2] {MATH, PSY, STATS} => {ECON} 0.02024291 0.625 0.03238866 1.882622
count
[1] 6
[2] 5
# Find what courses students select if choosing 'MATH' and 'PHY'
<- apriori(transactions, parameter = list(supp=0.02, conf=0.6),
item2.association.rules appearance = list(default="rhs",lhs=c("MATH","PHY")))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.6 0.1 1 none FALSE TRUE 5 0.02 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 4
set item appearances ...[2 item(s)] done [0.00s].
set transactions ...[10 item(s), 247 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(head(item2.association.rules))
lhs rhs support confidence coverage lift count
[1] {MATH, PHY} => {CHEM} 0.06882591 0.68 0.1012146 1.646667 17
plot(association.rules, engine = "plotly")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
We can also plot the relationship between individual items in the rule sets using network visualizations.
- In the network visualization below, rules (or itemsets) are represented by circle vertices, and items are represented by square vertices.
- For each rule, the LHS items of this rule point towards the rule’s vertex, which points to the RHS items.
plot(association.rules, method = "graph", engine = "htmlwidget")
Finding optimal parameters
Let’s try a-priori with different levels of support
# Support and confidence values
<- c(0.1, 0.05, 0.01, 0.005)
supportLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)
confidenceLevels
# Empty integers
<- integer(length=9)
rules_sup10 <- integer(length=9)
rules_sup5 <- integer(length=9)
rules_sup1 .5 <- integer(length=9)
rules_sup0
# Apriori algorithm with a support level of 10%
for (i in 1:length(confidenceLevels)) {
<- length(apriori(transactions, parameter=list(sup=supportLevels[1],
rules_sup10[i] conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 5%
for (i in 1:length(confidenceLevels)){
<- length(apriori(transactions, parameter=list(sup=supportLevels[2],
rules_sup5[i] conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)){
<- length(apriori(transactions, parameter=list(sup=supportLevels[3],
rules_sup1[i] conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 0.5%
for (i in 1:length(confidenceLevels)){
.5[i] <- length(apriori(transactions, parameter=list(sup=supportLevels[4],
rules_sup0conf=confidenceLevels[i], target="rules")))
}
# inspect(head(rules_sup10))
# inspect(head(rules_sup0.5))
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
# Data frame
<- data.frame(rules_sup10, rules_sup5, rules_sup1, rules_sup0.5, confidenceLevels)
num_rules
# Number of rules found with a support level of 10%, 5%, 1% and 0.5%
ggplot(data=num_rules, aes(x=confidenceLevels)) +
# Plot line and points (support level of 10%)
geom_line(aes(y=rules_sup10, colour="Support level of 10%")) +
geom_point(aes(y=rules_sup10, colour="Support level of 10%")) +
# Plot line and points (support level of 5%)
geom_line(aes(y=rules_sup5, colour="Support level of 5%")) +
geom_point(aes(y=rules_sup5, colour="Support level of 5%")) +
# Plot line and points (support level of 1%)
geom_line(aes(y=rules_sup1, colour="Support level of 1%")) +
geom_point(aes(y=rules_sup1, colour="Support level of 1%")) +
# Plot line and points (support level of 0.5%)
geom_line(aes(y=rules_sup0.5, colour="Support level of 0.5%")) +
geom_point(aes(y=rules_sup0.5, colour="Support level of 0.5%")) +
# Labs and theme
labs(x="Confidence levels", y="Number of rules found",
title="Apriori algorithm with different support levels") +
theme_bw() +
theme(legend.title=element_blank())
4 Frequent Sequence Mining using the cSPADE algorithm
Frequent Sequence Mining is used to discover a set of patterns shared among objects which have between them a specific order. For instance, students could enroll in multiple courses over different semester. In this case, we may use Frequent Sequence Mining to find that 40% of the students who enrolled in the STAT101 in term 1, continued to enroll in STAT 201 in term 2.
We have another dataset about course selection:
read_csv(file = "data/course_term.csv", col_names = FALSE) %>%
head(10)
Rows: 738 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): X1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 1
X1
<chr>
1 1;1;1;ECON
2 1;2;1;STATS
3 1;3;1;PHY
4 2;1;2;ECON,MATH
5 2;2;2;ENG,POLS
6 2;3;2;BIO,CHEM
7 3;1;1;MATH
8 3;2;1;STATS
9 3;3;1;PHY
10 4;1;3;PSY,ECON,STATS
library(arulesSequences)
<- read_baskets("data/course_term.csv",
trans_matrix sep = ";",
info = c("sequenceID","eventID", "size"))
summary(trans_matrix)
transactions as itemMatrix in sparse format with
738 rows (elements/itemsets/transactions) and
291 columns (items) and a density of 0.003436426
most frequent items:
ECON ENG PSY,ECON STATS STATS,PSY (Other)
15 15 14 13 13 668
element (itemset/transaction) length distribution:
sizes
1
738
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
includes extended item information - examples:
labels
1 BIO
2 BIO,CHEM
3 BIO,CHEM,ENG
includes extended transaction information - examples:
sequenceID eventID size
1 1 1 1
2 1 2 1
3 1 3 1
# Get frequent sequences and corresponding support values
<- cspade(trans_matrix, parameter = list(support = 0.02), control = list(verbose = F))
s1 <- as(s1, "data.frame")
s1.df summary(s1)
set of 55 sequences with
most frequent items:
PHY PHY,POLS POLS ENG BIO (Other)
3 3 3 3 2 54
most frequent elements:
{ENG} {PHY} {PHY,POLS} {POLS} {BIO} (Other)
3 3 3 3 2 54
element (sequence) size distribution:
sizes
1 2
42 13
sequence length distribution:
lengths
1 2
42 13
summary of quality measures:
support
Min. :0.02033
1st Qu.:0.02033
Median :0.02439
Mean :0.02905
3rd Qu.:0.03252
Max. :0.06098
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
trans_matrix 738 246 0.02
inspect(head(s1))
items support
1 <{BIO}> 0.03252033
2 <{CHEM}> 0.02845528
3 <{CHEM,BIO}> 0.02032520
4 <{CHEM,PHY}> 0.02032520
5 <{ECON}> 0.05284553
6 <{ECON,MATH}> 0.03252033
#Convert Back to DS
<- as(s1, "data.frame") %>% as_tibble()
itemsets_df library(tidyverse)
#Top 10 Frequent Item Sets
%>%
itemsets_df slice_max(support, n = 10) %>%
ggplot(aes(x = fct_reorder(sequence, support),
y = support,
fill = sequence)) +
geom_col() +
geom_label(aes(label = support %>% scales::percent()), hjust = 0.5) +
labs(x = "Site", y = "Support", title = "Most Frequently Visited Item Sets",
caption = "**Support** is the percent of segments the contain the item set") +
scale_fill_discrete(guide = F) +
scale_y_continuous(labels = scales::percent,
expand = expansion(mult = c(0, .1))) +
coord_flip() +
::theme_cowplot() cowplot
<- ruleInduction(s1,
rules confidence = 0.01,
control = list(verbose = FALSE))
inspect(head(rules, 5))
lhs rhs support confidence lift
1 <{PHY,POLS}> => <{POLS,CHEM}> 0.0203252 0.5000000 24.600000
2 <{PHY}> => <{POLS}> 0.0203252 0.4545455 9.318182
3 <{STATS,PHY,ENG}> => <{PHY,POLS,BIO}> 0.0203252 1.0000000 49.200000
4 <{STATS,PHY}> => <{PHY,POLS}> 0.0203252 1.0000000 24.600000
5 <{STATS,POLS}> => <{PHY,CHEM}> 0.0203252 1.0000000 49.200000
<- rules[!is.redundant(rules)]
rules_cleaned <- as(rules_cleaned, "data.frame") %>%
rules_df as_tibble() %>%
separate(col = rule, into = c('lhs', 'rhs'), sep = " => ", remove = F)
%>%
rules_df arrange(-confidence) %>%
select(lhs, rhs, support, confidence, lift) %>%
head() %>%
::kable() knitr
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
<{STATS,PHY,ENG}> | <{PHY,POLS,BIO}> | 0.0203252 | 1.0000000 | 49.20 |
<{STATS,PHY}> | <{PHY,POLS}> | 0.0203252 | 1.0000000 | 24.60 |
<{STATS,POLS}> | <{PHY,CHEM}> | 0.0203252 | 1.0000000 | 49.20 |
<{ECON,PHY}> | <{ENG,POLS}> | 0.0243902 | 1.0000000 | 30.75 |
<{POLS,STATS}> | <{CHEM,PHY}> | 0.0203252 | 1.0000000 | 49.20 |
<{POLS,ENG}> | <{CHEM,BIO}> | 0.0203252 | 0.8333333 | 41.00 |
Footnotes
Some content in this document was adapted from materials originally created by Dr. Quan Nguyen from the University of British Columbia.↩︎