Association Rule Mining and Frequent Sequence Mining

Author

Bodong Chen

1. Introduction to Association Rule Mining 1

Association rule mining look for combination of items that frequently occur together in transactions.

A classic example

For example, retail basket analysis is one of the most common applications of association rules. By finding frequent itemsets, retailers could understand what is commonly bought together and use this information to increase sales in various ways, such as placing frequently bought together items next to each other in a store.

Here is an example of five transactions from a grocery store:

transaction_id items
1 milk, bread
2 bread, butter
3 beer, diaper, pizza
4 milk, bread, butter
5 bread, butter

What is an example of a frequent itemset from these transactions?

How can association rules be applied in education?

In education, various types of “transactions” are happening at different granular levels.

Every year, students are admitted to a program. Each semester, students are enrolled in courses offered by the program. Each week, students enter different buildings on particular dates to take classes. During a particular class session, students are interacting with the instructor and their peers. These events could all be considered transactions in education.

Take course enrollment for example. If we take the analogy of a market basket analysis, each student is a ‘customer’. In each semester, they will make a ‘transaction’ by enrolling in a set of courses. The data will often take the forms of term_ID,student_ID, and course_ID. The data is often stored in a Student Information System (SIS).

term_ID student_ID course_ID
1 Maria STAT_101
1 Maria PSY_101
1 Maria ECON_101
2 Maria STAT_201
2 Maria CS_101
2 Maria POL_101
1 Chris CS_101
1 Chris MATH_101
1 Chris PSY_101
2 Chris CS_201
2 Chris STAT_101

Similar data but in a different format

term_ID student_ID transaction_ID course_ID_set
1 Maria 1 STAT_101, PSY_101, ECON_101
2 Maria 2 STAT_201, CS_101, POL_101
1 Chris 3 CS_101, MATH_101, PSY_101
2 Chris 4 CS_201, STAT_101

Using the information available in the SIS, we can answer research questions such as:

  • Which set of courses are often taken together by students?
  • Which set of courses do students often take after taking CS_101 and STAT_101?

For transactions with different levels of temporal granularity, we can ask similar questions.

What are the limitations of association rules?

Association rules only focus on the most frequent (popular) items, and it misses out all the information that are present in the “long tail” of user preference that are essential to provide customized recommendations to customers/users.

In such scenarios, other recommender systems such as collaborative filtering (i.e. recommendations based on user similarity) or content-based filtering (i.e., recommendations based on item similarity) will help detect customers with similar interests even if the absolute number of transactions is small (e.g., customer who are interested in a niche category).

A demo from an education scenario

# set options to hide warnings
options(warn=-1)

# Load relevant libraries (with startup messages hidden)
suppressPackageStartupMessages({
  library(tidyverse)
  
  library(arules)
  library(arulesViz)
  library(arulesSequences)
})

We have a sample dataset about course enrollment transactions.

read_csv('data/course.csv') %>%
  head(10)
Rows: 246 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 1
   x                      
   <chr>                  
 1 POLS                   
 2 ECON,MATH,STATS        
 3 ECON                   
 4 BIO,PHY,CHEM,ENG       
 5 PSY,BIO                
 6 ENG,BIO                
 7 POLS,ENG,ECON,STATS,PHY
 8 POLS,BIO,STATS,PHY,CHEM
 9 BIO,PSY                
10 STATS,CHEM,POLS,ENG    

This dataset is very simple. It only contains one column, with each row representing a transaction.

Read the transactions from the .csv file using the read.transactions function:

# read in the csv file
transactions <- read.transactions(file = 'data/course.csv', 
                                  format = 'basket', 
                                  header = FALSE, 
                                  sep=',')
transactions
transactions in sparse format with
 247 transactions (rows) and
 10 items (columns)

Let’s check the most popular courses:

itemFrequencyPlot(transactions, 
                  topN=10, 
                  type="relative", 
                  main="Relative Item Frequency Plot")

2. Measures of Association

Support

Support tells us how popular an itemset is relative to the total number of transactions.

Support(X,Y) = P(X,Y) = Frequency(A/B) / N

Confidence

Confidence tells us how likely an item Y is purchased given item X is purchased.

Confidence(X,Y) = P(X,Y) / P(X)

Consider the course enrollment example:

Note that the confidence measure might misrepresent the importance of an association. This is because it only accounts for how popular item X is (in our case PSY) but not Y (in our case POLS).

If beer is also very popular in general, there will be a higher chance that a transaction containing PSY will also contain POLS, thus inflating the confidence measure. To account for the base popularity of both items, we use a third measure called lift.

Lift

Lift tells us how likely item Y is purchased when item X is purchased, while controlling for how popular items Y and X are

3. Mining Association Rules using a-priori algorithm

association.rules <- apriori(transactions, parameter = list(supp=0.02, conf=0.6, maxlen=6))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.6    0.1    1 none FALSE            TRUE       5    0.02      1
 maxlen target  ext
      6  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 4 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[10 item(s), 247 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [19 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
inspect(association.rules[1:10])
     lhs                   rhs     support    confidence coverage   lift    
[1]  {MATH, PHY}        => {CHEM}  0.06882591 0.6800000  0.10121457 1.646667
[2]  {ECON, ENG, MATH}  => {STATS} 0.02429150 0.6000000  0.04048583 1.610870
[3]  {ENG, MATH, STATS} => {ECON}  0.02429150 0.6000000  0.04048583 1.807317
[4]  {ENG, MATH, PHY}   => {CHEM}  0.02429150 0.7500000  0.03238866 1.816176
[5]  {ECON, MATH, PHY}  => {CHEM}  0.02024291 0.7142857  0.02834008 1.729692
[6]  {BIO, ECON, MATH}  => {PSY}   0.02024291 0.6250000  0.03238866 1.882622
[7]  {MATH, PSY, STATS} => {ECON}  0.02024291 0.6250000  0.03238866 1.882622
[8]  {MATH, PHY, PSY}   => {CHEM}  0.02429150 0.7500000  0.03238866 1.816176
[9]  {CHEM, MATH, PSY}  => {PHY}   0.02429150 0.6666667  0.03643725 1.937255
[10] {MATH, PHY, STATS} => {CHEM}  0.02024291 0.7142857  0.02834008 1.729692
     count
[1]  17   
[2]   6   
[3]   6   
[4]   6   
[5]   5   
[6]   5   
[7]   5   
[8]   6   
[9]   6   
[10]  5   

With the identified association rules, we can make queries about association patterns related to given items.

# Find what course selections led to students choosing 'ECON'
item.association.rules <- apriori(transactions, parameter = list(supp=0.02, conf=0.6),
                                  appearance = list(default="lhs", rhs="ECON"))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.6    0.1    1 none FALSE            TRUE       5    0.02      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 4 

set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[10 item(s), 247 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [2 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
inspect(head(item.association.rules))
    lhs                   rhs    support    confidence coverage   lift    
[1] {ENG, MATH, STATS} => {ECON} 0.02429150 0.600      0.04048583 1.807317
[2] {MATH, PSY, STATS} => {ECON} 0.02024291 0.625      0.03238866 1.882622
    count
[1] 6    
[2] 5    
# Find what courses students select if choosing 'MATH' and 'PHY'
item2.association.rules <- apriori(transactions, parameter = list(supp=0.02, conf=0.6),
                                   appearance = list(default="rhs",lhs=c("MATH","PHY")))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.6    0.1    1 none FALSE            TRUE       5    0.02      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 4 

set item appearances ...[2 item(s)] done [0.00s].
set transactions ...[10 item(s), 247 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
inspect(head(item2.association.rules))
    lhs            rhs    support    confidence coverage  lift     count
[1] {MATH, PHY} => {CHEM} 0.06882591 0.68       0.1012146 1.646667 17   
plot(association.rules, engine = "plotly")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

We can also plot the relationship between individual items in the rule sets using network visualizations.

  • In the network visualization below, rules (or itemsets) are represented by circle vertices, and items are represented by square vertices.
  • For each rule, the LHS items of this rule point towards the rule’s vertex, which points to the RHS items.
plot(association.rules, method = "graph",  engine = "htmlwidget")

Finding optimal parameters

Let’s try a-priori with different levels of support

# Support and confidence values
supportLevels <- c(0.1, 0.05, 0.01, 0.005)
confidenceLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)

# Empty integers 
rules_sup10 <- integer(length=9)
rules_sup5 <- integer(length=9)
rules_sup1 <- integer(length=9)
rules_sup0.5 <- integer(length=9)

# Apriori algorithm with a support level of 10%
for (i in 1:length(confidenceLevels)) {
  
  rules_sup10[i] <- length(apriori(transactions, parameter=list(sup=supportLevels[1], 
                                   conf=confidenceLevels[i], target="rules")))
  
}

# Apriori algorithm with a support level of 5%
for (i in 1:length(confidenceLevels)){
  
  rules_sup5[i] <- length(apriori(transactions, parameter=list(sup=supportLevels[2], 
                                  conf=confidenceLevels[i], target="rules")))
  
}

# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)){
  
  rules_sup1[i] <- length(apriori(transactions, parameter=list(sup=supportLevels[3], 
                                  conf=confidenceLevels[i], target="rules")))
  
}

# Apriori algorithm with a support level of 0.5%
for (i in 1:length(confidenceLevels)){
  
  rules_sup0.5[i] <- length(apriori(transactions, parameter=list(sup=supportLevels[4], 
                                    conf=confidenceLevels[i], target="rules")))
  
}
# inspect(head(rules_sup10))
# inspect(head(rules_sup0.5))

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine

# Data frame
num_rules <- data.frame(rules_sup10, rules_sup5, rules_sup1, rules_sup0.5, confidenceLevels)

# Number of rules found with a support level of 10%, 5%, 1% and 0.5%
ggplot(data=num_rules, aes(x=confidenceLevels)) +
  
  # Plot line and points (support level of 10%)
  geom_line(aes(y=rules_sup10, colour="Support level of 10%")) + 
  geom_point(aes(y=rules_sup10, colour="Support level of 10%")) +
  
  # Plot line and points (support level of 5%)
  geom_line(aes(y=rules_sup5, colour="Support level of 5%")) +
  geom_point(aes(y=rules_sup5, colour="Support level of 5%")) +
  
  # Plot line and points (support level of 1%)
  geom_line(aes(y=rules_sup1, colour="Support level of 1%")) + 
  geom_point(aes(y=rules_sup1, colour="Support level of 1%")) +
  
  # Plot line and points (support level of 0.5%)
  geom_line(aes(y=rules_sup0.5, colour="Support level of 0.5%")) +
  geom_point(aes(y=rules_sup0.5, colour="Support level of 0.5%")) +
  
  # Labs and theme
  labs(x="Confidence levels", y="Number of rules found", 
       title="Apriori algorithm with different support levels") +
  theme_bw() +
  theme(legend.title=element_blank())

4 Frequent Sequence Mining using the cSPADE algorithm

Frequent Sequence Mining is used to discover a set of patterns shared among objects which have between them a specific order. For instance, students could enroll in multiple courses over different semester. In this case, we may use Frequent Sequence Mining to find that 40% of the students who enrolled in the STAT101 in term 1, continued to enroll in STAT 201 in term 2.

We have another dataset about course selection:

read_csv(file = "data/course_term.csv", col_names = FALSE) %>%
  head(10)
Rows: 738 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): X1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 1
   X1                  
   <chr>               
 1 1;1;1;ECON          
 2 1;2;1;STATS         
 3 1;3;1;PHY           
 4 2;1;2;ECON,MATH     
 5 2;2;2;ENG,POLS      
 6 2;3;2;BIO,CHEM      
 7 3;1;1;MATH          
 8 3;2;1;STATS         
 9 3;3;1;PHY           
10 4;1;3;PSY,ECON,STATS
library(arulesSequences)

trans_matrix <- read_baskets("data/course_term.csv", 
                             sep = ";", 
                             info = c("sequenceID","eventID", "size"))
summary(trans_matrix)
transactions as itemMatrix in sparse format with
 738 rows (elements/itemsets/transactions) and
 291 columns (items) and a density of 0.003436426 

most frequent items:
     ECON       ENG  PSY,ECON     STATS STATS,PSY   (Other) 
       15        15        14        13        13       668 

element (itemset/transaction) length distribution:
sizes
  1 
738 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

includes extended item information - examples:
        labels
1          BIO
2     BIO,CHEM
3 BIO,CHEM,ENG

includes extended transaction information - examples:
  sequenceID eventID size
1          1       1    1
2          1       2    1
3          1       3    1
# Get frequent sequences and corresponding support values
s1 <- cspade(trans_matrix, parameter = list(support = 0.02), control = list(verbose = F))
s1.df <- as(s1, "data.frame")
summary(s1)
set of 55 sequences with

most frequent items:
     PHY PHY,POLS     POLS      ENG      BIO  (Other) 
       3        3        3        3        2       54 

most frequent elements:
     {ENG}      {PHY} {PHY,POLS}     {POLS}      {BIO}    (Other) 
         3          3          3          3          2         54 

element (sequence) size distribution:
sizes
 1  2 
42 13 

sequence length distribution:
lengths
 1  2 
42 13 

summary of quality measures:
    support       
 Min.   :0.02033  
 1st Qu.:0.02033  
 Median :0.02439  
 Mean   :0.02905  
 3rd Qu.:0.03252  
 Max.   :0.06098  

includes transaction ID lists: FALSE 

mining info:
         data ntransactions nsequences support
 trans_matrix           738        246    0.02
inspect(head(s1))
   items            support 
 1 <{BIO}>       0.03252033 
 2 <{CHEM}>      0.02845528 
 3 <{CHEM,BIO}>  0.02032520 
 4 <{CHEM,PHY}>  0.02032520 
 5 <{ECON}>      0.05284553 
 6 <{ECON,MATH}> 0.03252033 
 
#Convert Back to DS
itemsets_df <- as(s1, "data.frame") %>% as_tibble()
library(tidyverse)
#Top 10 Frequent Item Sets
itemsets_df %>%
  slice_max(support, n = 10) %>% 
  ggplot(aes(x = fct_reorder(sequence, support),
                    y = support,
                    fill = sequence)) + 
    geom_col() + 
    geom_label(aes(label = support %>% scales::percent()), hjust = 0.5) + 
    labs(x = "Site", y = "Support", title = "Most Frequently Visited Item Sets",
         caption = "**Support** is the percent of segments the contain the item set") + 
    scale_fill_discrete(guide = F) +
    scale_y_continuous(labels = scales::percent,
                       expand = expansion(mult = c(0, .1))) + 
    coord_flip() + 
    cowplot::theme_cowplot() 

rules <- ruleInduction(s1, 
                       confidence = 0.01, 
                       control = list(verbose = FALSE))

inspect(head(rules, 5))
   lhs                  rhs                 support confidence      lift 
 1 <{PHY,POLS}>      => <{POLS,CHEM}>     0.0203252  0.5000000 24.600000 
 2 <{PHY}>           => <{POLS}>          0.0203252  0.4545455  9.318182 
 3 <{STATS,PHY,ENG}> => <{PHY,POLS,BIO}>  0.0203252  1.0000000 49.200000 
 4 <{STATS,PHY}>     => <{PHY,POLS}>      0.0203252  1.0000000 24.600000 
 5 <{STATS,POLS}>    => <{PHY,CHEM}>      0.0203252  1.0000000 49.200000 
 
rules_cleaned <- rules[!is.redundant(rules)]
rules_df <- as(rules_cleaned, "data.frame") %>% 
  as_tibble() %>% 
  separate(col = rule, into = c('lhs', 'rhs'), sep = " => ", remove = F)

rules_df %>% 
  arrange(-confidence) %>% 
  select(lhs, rhs, support, confidence, lift) %>% 
  head() %>% 
  knitr::kable()
lhs rhs support confidence lift
<{STATS,PHY,ENG}> <{PHY,POLS,BIO}> 0.0203252 1.0000000 49.20
<{STATS,PHY}> <{PHY,POLS}> 0.0203252 1.0000000 24.60
<{STATS,POLS}> <{PHY,CHEM}> 0.0203252 1.0000000 49.20
<{ECON,PHY}> <{ENG,POLS}> 0.0243902 1.0000000 30.75
<{POLS,STATS}> <{CHEM,PHY}> 0.0203252 1.0000000 49.20
<{POLS,ENG}> <{CHEM,BIO}> 0.0203252 0.8333333 41.00

Footnotes

  1. Some content in this document was adapted from materials originally created by Dr. Quan Nguyen from the University of British Columbia.↩︎