Exploratory Analysis of Sequence Data

Author

Bodong Chen

0. Preparation for this module

We will use R in this module. Participants are expected to know R and encouraged to use RStudio, Jupyter Notebook, or an equivalent environment.
Participants are encouraged to familiarize with R’s Tidyverse, especially the dplyr package.

1. Temporal Analysis of Data in Education ¹

Educational activities unfold over time and are temporal by nature. As a field, learning analytics aims to generate understanding of, and support for, such processes of learning. Indeed, a core characteristic of learning analytics is the generation of high-resolution temporal data about various types of actions.

However, temporality has typically been underexplored in both basic and applied learning research. Research “frequently neglect to make full use of information relating to time and order” (Reimann, 2009, p. 239).

Types of temporal data in education

Many types of data are recorded in digital learning platforms. They have different temporal granularity and can be used to answer different questions.

For example:

Data source	Time granularity	Questions
SIS records	Semester/Year	What are the common learning pathways in a program based on students’ course enrollment data?
LMS log files	Second/Day/Semester	What are the common learning patterns of students during a course?
Eye-tracking	Millisecond/Second/Minute	What engagement patterns exist when students read a digital textbook?

Types of research questions that are temporal/sequential specific

Exploratory: What are the common patterns exist in the data?
Predictive: Can we predict the next event given a sequence? Can we predict a learning outcome given a sequence?
Causal: Does X cause Y?

Guiding questions for temporal considerations in learning analytics

According to Knight, Wise, & Chen (2017), insightful temporal analysis requires:

Conceptualising the temporal nature of learning constructs,
Translating these theoretical propositions into specific methodological approaches for the capture and analysis of temporal data, and
Practical methods for capturing temporal data features and using analyses to impact learning contexts

In learning analytics, it is often important to ask meta-questions when planning temporal analysis:

What are the key learning constructs and how are they conceptualized with respect to time?
Where are the learning constructs observed and how are they represented in data?
How are the theorized learning constructs analyzed with regard to their temporal features?
To whom and how do the learning analytics provide temporal insight that can lead to temporal impact on learning processes?

In this module, we will attempt to ask these questions and play with several popular tools for temporal and sequential analysis.

2. Exploratory Sequential Analysis using `TramineR`

TraMineR is a R-package for mining, describing and visualizing sequences of states or events.

Useful links about TraMineR:

TramineR is commonly used to:

Visualize sequences
Characterize sequences using descriptive statistics (e.g., frequencies, transitional probabilities, entropy)
Cluster sequences based on similarity analysis of sequences
Conduct discrepancy analyses to study how sequences are related to covariates

Install and Load TramineR

If you have not installed TraMineR, you can get it installed in R using the command below.

install.packages("TraMineR", dependencies = TRUE)

# Load TraMineR
library("TraMineR")


TraMineR stable version 2.2-8 (Built: 2023-09-18)

Website: http://traminer.unige.ch

Please type 'citation("TraMineR")' for citation information.

# Load tidyverse packages
library(tidyverse)

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Essential terminologies in TraMineR

States vs. Events

States: Read, Write, Discuss, Review
Events: Read -> Write, Write -> Discuss (each change of state is an event)

Let’s take a look at an example dataset.

TraMineR has several example datasets, including a dataset named actual. This dataset contains 2000 individual sequences of monthly activity statuses/states from January to December 2000.

There are four possible states:

A = Full-time paid job (> 37 hours)
B = Long part-time paid job (19-36 hours)
C = Short part-time paid job (1-18 hours)
D = Unemployed (no work)

?TraMineR::actcal # help info about the dataset

data(actcal) # load the data
glimpse(actcal) # check the data frame's structure

Rows: 2,000
Columns: 24
$ idhous00 <dbl> 60671, 25321, 53221, 13911, 145301, 40022, 61791, 3541, 10662…
$ age00    <int> 47, 21, 50, 37, 20, 27, 72, 34, 39, 36, 54, 26, 40, 36, 58, 3…
$ educat00 <fct> "maturity", "maturity", "full-time vocational school", "matur…
$ civsta00 <fct> "married", "single, never married", "married", "single, never…
$ nbadul00 <int> 3, 2, 2, 1, 3, 2, 1, 2, 2, 1, 3, 2, 3, 2, 2, 2, 2, 2, 2, 1, 2…
$ nbkid00  <int> 2, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2, 0, 0…
$ aoldki00 <int> 17, -3, -3, -3, -3, -3, -3, 1, 8, -3, -3, -3, -3, 4, -3, 4, -…
$ ayouki00 <int> 14, -3, -3, -3, -3, -3, -3, 1, 6, -3, -3, -3, -3, 4, -3, 4, -…
$ region00 <fct> "Middleland (BE, FR, SO, NE, JU)", "Zurich", "Lake Geneva (VD…
$ com2.00  <fct> Industrial and tertiary sector communes, Suburban communes, P…
$ sex      <fct> woman, woman, woman, woman, man, woman, woman, man, man, man,…
$ birthy   <int> 1953, 1979, 1950, 1963, 1980, 1973, 1928, 1966, 1961, 1964, 1…
$ jan00    <fct> B, D, B, C, A, D, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ feb00    <fct> B, D, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ mar00    <fct> B, D, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ apr00    <fct> B, D, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ may00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ jun00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ jul00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ aug00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ sep00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ oct00    <fct> B, A, B, B, A, B, D, A, A, A, B, B, A, B, B, C, D, A, B, B, D…
$ nov00    <fct> B, A, B, B, A, B, D, A, A, A, B, B, A, B, B, C, D, A, B, B, D…
$ dec00    <fct> B, D, B, B, A, B, D, A, A, A, B, B, A, B, C, C, D, A, B, B, D…

# Create a state sequence object
actcal.seq <- seqdef(actcal, var = 13:24)

 [>] 4 distinct states appear in the data:

     1 = A

     2 = B

     3 = C

     4 = D

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  A           A        A

     2  B           B        B

     3  C           C        C

     4  D           D        D

 [>] 2000 sequences in the data set

 [>] min/max sequence length: 12/12

# Plot the sequences
seqiplot(actcal.seq, with.legend = TRUE)

In the 2nd sequence, two events occurred: D -> A, and A -> D.
In the 4th sequence, one event occurred: C -> B.

The alphabet is the number of unique states (or events) in the data. In the example above, we have four unique states so the alphabet = 4.

alphabet(actcal.seq)

[1] "A" "B" "C" "D"

3. Data Manipulation with Sequences

Data formats

The ‘states-sequence’ (STS) format

Each row is an individual
In this format, the successive states (statuses) of an individual are given in consecutive columns. Each column is supposed to correspond to a predetermined time unit

head(actcal.seq)

     Sequence               
2848 B-B-B-B-B-B-B-B-B-B-B-B
1230 D-D-D-D-A-A-A-A-A-A-A-D
2468 B-B-B-B-B-B-B-B-B-B-B-B
654  C-C-C-C-C-C-C-C-C-B-B-B
6946 A-A-A-A-A-A-A-A-A-A-A-A
1872 D-B-B-B-B-B-B-B-B-B-B-B

The ‘state-permanence-sequence’ (SPS) format

Each row is an individual
Each successive distinct state in the sequence is given together with its duration

# print(head(actcal.seq), format='SPS')
actcal.sps <- seqformat(actcal, 13:24, from = "STS", to = "SPS", compress = TRUE)

 [>] converting STS sequences to 2000 SPS sequences

 [>] compressing SPS sequences

head(actcal.sps)

     Sequence           
2848 "(B,12)"           
1230 "(D,4)-(A,7)-(D,1)"
2468 "(B,12)"           
654  "(C,9)-(B,3)"      
6946 "(A,12)"           
1872 "(D,1)-(B,11)"

The ‘time-stamped-event’ (TSE) format

Each row is an event
Each record of the TSE representation usually contains a case identifier, a time stamp and codes identifying the event occurring

tstate <- seqetm(actcal.seq, method='state')
actcal.tse <- seqformat(actcal, 13:24, from = "STS", to = "TSE", tevent=tstate)

 [!!] 'id' set to NULL as it is not specified (backward compatibility with TraMineR 1.8)

 [!!] replacing original IDs in the output by the sequence indexes

 [>] converting STS sequences to 2579 TSE sequences

head(actcal.tse)

  id time event
1  1    0     B
2  2    0     D
3  2    4     A
4  2   11     D
5  3    0     B
6  4    0     C

The ‘spell’ (SPELL) format

Each row is a state
Each record of SPELL contains an ID, start time, end time, and the state

actcal.spell <- seqformat(actcal, 13:24, from = "STS", to = "SPELL")

 [>] converting STS sequences to 2579 spells

head(actcal.seq)

     Sequence               
2848 B-B-B-B-B-B-B-B-B-B-B-B
1230 D-D-D-D-A-A-A-A-A-A-A-D
2468 B-B-B-B-B-B-B-B-B-B-B-B
654  C-C-C-C-C-C-C-C-C-B-B-B
6946 A-A-A-A-A-A-A-A-A-A-A-A
1872 D-B-B-B-B-B-B-B-B-B-B-B

head(actcal.spell)

    id begin end states
1 2848     1  12      B
2 1230     1   4      D
3 1230     5  11      A
4 1230    12  12      D
5 2468     1  12      B
6  654     1   9      C

Converting between format

seqformat(data, var = ..., from = "...", to = ",,,", compress = FALSE/TRUE)

Examples:

seqformat(actcal, 13:24, from = "STS", to = "SPELL")
seqformat(actcal, 13:24, from = "STS", to = "SPS", compress = TRUE)

Create a sequence object from data

We can use the seqdef function to create a sequence object from an existing dataframe.

Let’s work with the actcal dataset again. As a reminder, it consists of:

The sequence data were collected on a monthly basis on each participant in columns: “jan00”, “feb00”, “mar00”, “apr00”, “may00”, “jun00”, “jul00”, “aug00”, “sep00”, “oct00”, “nov00”, “dec00”
The covariates such as age, education, region, etc.

glimpse(actcal)

Rows: 2,000
Columns: 24
$ idhous00 <dbl> 60671, 25321, 53221, 13911, 145301, 40022, 61791, 3541, 10662…
$ age00    <int> 47, 21, 50, 37, 20, 27, 72, 34, 39, 36, 54, 26, 40, 36, 58, 3…
$ educat00 <fct> "maturity", "maturity", "full-time vocational school", "matur…
$ civsta00 <fct> "married", "single, never married", "married", "single, never…
$ nbadul00 <int> 3, 2, 2, 1, 3, 2, 1, 2, 2, 1, 3, 2, 3, 2, 2, 2, 2, 2, 2, 1, 2…
$ nbkid00  <int> 2, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2, 0, 0…
$ aoldki00 <int> 17, -3, -3, -3, -3, -3, -3, 1, 8, -3, -3, -3, -3, 4, -3, 4, -…
$ ayouki00 <int> 14, -3, -3, -3, -3, -3, -3, 1, 6, -3, -3, -3, -3, 4, -3, 4, -…
$ region00 <fct> "Middleland (BE, FR, SO, NE, JU)", "Zurich", "Lake Geneva (VD…
$ com2.00  <fct> Industrial and tertiary sector communes, Suburban communes, P…
$ sex      <fct> woman, woman, woman, woman, man, woman, woman, man, man, man,…
$ birthy   <int> 1953, 1979, 1950, 1963, 1980, 1973, 1928, 1966, 1961, 1964, 1…
$ jan00    <fct> B, D, B, C, A, D, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ feb00    <fct> B, D, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ mar00    <fct> B, D, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ apr00    <fct> B, D, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ may00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ jun00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ jul00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ aug00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ sep00    <fct> B, A, B, C, A, B, D, A, A, A, B, B, A, C, B, C, D, A, B, B, D…
$ oct00    <fct> B, A, B, B, A, B, D, A, A, A, B, B, A, B, B, C, D, A, B, B, D…
$ nov00    <fct> B, A, B, B, A, B, D, A, A, A, B, B, A, B, B, C, D, A, B, B, D…
$ dec00    <fct> B, D, B, B, A, B, D, A, A, A, B, B, A, B, C, C, D, A, B, B, D…

To create a state sequence object, we can use the seqdef() function.

actcal.seq <- seqdef(actcal, var = c("jan00", "feb00", "mar00",
         "apr00", "may00", "jun00", "jul00", "aug00", "sep00", "oct00",
         "nov00", "dec00"))

 [>] 4 distinct states appear in the data:

     1 = A

     2 = B

     3 = C

     4 = D

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  A           A        A

     2  B           B        B

     3  C           C        C

     4  D           D        D

 [>] 2000 sequences in the data set

 [>] min/max sequence length: 12/12

We can see that there were 2000 sequences, with the length of 12, and consists of 4 states

But this is a relatively clean and small dataset, because it has defined a time unit (monthly). There were also no missing value or no misalignment in the timing of data collection (everyone started and ended at the same time).

Import a messier log file

Let’s try to process a messier example of a log file level1_logs.csv.

This file contains detailed log data of 184 players playing one level of a game named Baba Is You.

lvl1_logs <- as.data.frame(read_csv("data/level1_logs.csv"))

Rows: 3640 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): states
dbl (3): id, begin, end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(lvl1_logs, 20)

    id begin end      states
1    1     1   1       start
2    1     5  17   key_press
3    1    18  18 rule_remove
4    1    18  63   key_press
5    1    63  63    rule_add
6    1    63  65   key_press
7    1    66  66         win
8    1    66  66   key_press
9   10     1   1       start
10  10     6   8   key_press
11  10     8   8 rule_remove
12  10     8  21   key_press
13  10    21  21    rule_add
14  10    21  22   key_press
15  10    22  22         win
16  10    22  22   key_press
17 100     1   1       start
18 100     5   9   key_press
19 100    10  10        idle
20 100    11  12   key_press

The original dataset is even messier. Below is the original log file before data transformation.

Rows: 37,484
Columns: 32
$ index              <chr> "161", "255", "258", "261", "264", "266", "267", "2…
$ ID                 <chr> "100", "100", "100", "100", "100", "100", "100", "1…
$ Session_timestamp  <chr> "0:1:10:566", "0:1:15:666", "0:1:15:833", "0:1:15:9…
$ Level_timestamp    <chr> "0:0:0:000", "0:0:5:100", "0:0:5:266", "0:0:5:366",…
$ Event_type         <chr> "event", "input", "input", "input", "input", "input…
$ Event              <chr> "start", "right", "left", "right", "up", "up", "up"…
$ Level              <chr> "1level:where do i go?", "1level:where do i go?", "…
$ Details            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ filename           <chr> "100_0.txt", "100_0.txt", "100_0.txt", "100_0.txt",…
$ epoch_time         <chr> "1643220242566.0", "1643220247666.0", "164322024783…
$ object_orientation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x_coordinate       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ y_coordinate       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ objects            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Session_number     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Level_progression  <chr> "2.0", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Ruleset            <chr> NA, "['baba is you ', 'wall is stop ']", "['baba is…
$ Levels_complete    <chr> "['0level:baba is you']", "['0level:baba is you']",…
$ LevelTime          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ SessionTime        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ drop               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rule_remove_1      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ retrap             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pure_hit           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ buffer_hit         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Event2             <chr> "start", "key_press", "key_press", "key_press", "ke…
$ date_time          <dttm> 2022-01-26 13:04:02, 2022-01-26 13:04:07, 2022-01-…
$ session_time       <dbl> 71, 76, 76, 76, 80, 80, 80, 80, 81, 82, 82, 82, 83,…
$ level_time         <dbl> 0, 5, 5, 5, 9, 9, 9, 9, 11, 11, 12, 12, 12, 12, 13,…
$ session_time_start <dbl> 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71,…
$ level_time2        <dbl> 1, 5, 5, 5, 9, 9, 9, 9, 10, 11, 11, 11, 12, 12, 12,…
$ segment_id         <int> 1, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, …

The cleaned dataset we have have the following variables/columns:

id refers to anonymized player ID
begin refers to the time a state begins within the session, in seconds
end refers to the time a state ends within the session, in seconds
states refers to a particular state, including start, key_press, rule_remove, restart, win, etc.

Now we need to make some decisions:

What is a sequence in our data?
- Each sequence could be a player
- Each sequence could be an attempt (a player could have attempted this level multiple times)
What is the time unit in our data?
- Every minute?
- Every minute?
- Every 5 mins?
- Every hour?

Good news is the data file is already in the SPELL format that can be directly imported into TraMineR.

Create sequence data from SPELL format

We can use the seqdef() function to convert the SPELL data into STS format, which will be used for subsequent analyses.

Note: For some reasons, seqdef() does not play well with tibble format or any other formats than data.frame, so make sure to convert your data using the as.data.frame() function.

log_spell <- as.data.frame(lvl1_logs)
log_sts.seq <- seqdef(log_spell, var = c(id="id", begin="begin", end="end", status="states"), 
                   informat = "SPELL",  process = FALSE, overwrite = FALSE)

 [>] time axis: 1 -> 3616

 [>] converting SPELL data into 184 STS sequences (internal format)

 [>] found missing values ('NA') in sequence data

 [>] preparing 184 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 9 distinct states appear in the data:

     1 = end

     2 = idle

     3 = key_press

     4 = restart

     5 = rule_add

     6 = rule_remove

     7 = start

     8 = undo

     9 = win

 [>] state coding:

       [alphabet]  [label]     [long label]

     1  end         end         end

     2  idle        idle        idle

     3  key_press   key_press   key_press

     4  restart     restart     restart

     5  rule_add    rule_add    rule_add

     6  rule_remove rule_remove rule_remove

     7  start       start       start

     8  undo        undo        undo

     9  win         win         win

 [>] 184 sequences in the data set

 [>] min/max sequence length: 12/3616

What can we observe from the output above?

There were 184 sequences in the data set
min/max sequence length: 12/3616
9 distinct states
coding void elements with ‘%’ and missing values with ’*’

Print out the first 6 sequences:

head(log_sts.seq)

    Sequence                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1   start-*-*-*-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-rule_remove-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-win                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
10  start-*-*-*-*-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
100 start-*-*-*-key_press-key_press-key_press-key_press-key_press-idle-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-*-*-*-undo-key_press-key_press-key_press-key_press-undo-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-undo-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-*-*-*-*-*-*-*-*-*-*-*-*-*-key_press-key_press-key_press-undo-*-key_press-undo-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-undo-*-*-*-key_press-key_press-key_press-rule_remove-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-win
101 start-*-*-*-*-*-*-*-*-*-key_press-key_press-key_press-key_press-key_press-idle-*-*-*-*-*-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-win                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
102 start-*-*-*-*-*-*-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-rule_add-key_press-win                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
103 start-*-*-*-*-*-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-rule_remove-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-win-*-*-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press-key_press

We can also print out these sequences using the SPS format for easy observations:

print(log_sts.seq[1:6, ], format="SPS")

    Sequence                                                                                                                                                                                                                                                                                                  
1   (start,1)-(*,3)-(key_press,13)-(rule_remove,1)-(key_press,47)-(win,1)                                                                                                                                                                                                                                     
10  (start,1)-(*,4)-(key_press,17)                                                                                                                                                                                                                                                                            
100 (start,1)-(*,3)-(key_press,5)-(idle,1)-(key_press,30)-(*,3)-(undo,1)-(key_press,4)-(undo,1)-(key_press,12)-(undo,10)-(key_press,24)-(undo,1)-(key_press,23)-(*,13)-(key_press,3)-(undo,1)-(*,1)-(key_press,1)-(undo,1)-(key_press,44)-(undo,20)-(*,3)-(key_press,3)-(rule_remove,1)-(key_press,12)-(win,1)
101 (start,1)-(*,9)-(key_press,5)-(idle,1)-(*,5)-(key_press,24)-(win,1)                                                                                                                                                                                                                                       
102 (start,1)-(*,6)-(key_press,38)-(rule_add,1)-(key_press,1)-(win,1)                                                                                                                                                                                                                                         
103 (start,1)-(*,5)-(key_press,8)-(rule_remove,1)-(key_press,27)-(win,1)-(*,2)-(key_press,32)

Let’s plot these example sequences using the seqiplot() function

seqiplot(log_sts.seq[1:10, 1:100], with.legend = T, main = "Index plot (10 first sequences)")

As a final note, there is not a single approach to pre-process your data. In practice, there are a few ‘nobs’ that you could adjust when pre-processing your data:

What is a state? What is an event?
What is a sequence in your study? (e.g., each person is a sequence, or each learing session - however you define a learning session)
What is the time granularity of your sequence (e.g., every second, every minute, every 5 mins)

(Bonus) Truncations, gaps and missing values

Sequences defined as the list of successive states without duration information are typically of varying length.
In event sequences, the number of events experienced by each individual differs from one individual to the other.
The length of the follow up is not the same for all individuals or sequences may be right or left censored.
Sequences may not be left aligned depending on the time axis on which they are defined.
Data may not be available for all measuring points yielding internal gaps in the sequences.

Let’s simulate an example of data when sequences are not aligned. In this example, three sequences (s1,s2,s3) have different start and end date. Participant b also have a gap in 1993. If the respondents entered the study at different points in time and we represent the data on a calendar time axis, the data could look like this

s1 <- c("a","b","c","d",NA,NA)
s2 <- c(NA,"a","b",NA,"c","d")
s3 <- c(NA,NA, "a","b","c","d")

df <- data.frame(rbind(s1,s2,s3))
colnames(df) <- c(1990:1995)
df

   1990 1991 1992 1993 1994 1995
s1    a    b    c    d <NA> <NA>
s2 <NA>    a    b <NA>    c    d
s3 <NA> <NA>    a    b    c    d

Let’s create a sequence object from df. The default values of the seqdef() function are left=NA, gaps=NA and right="DEL". We will see what these options mean in a moment

seqdef(df)

 [>] found missing values ('NA') in sequence data

 [>] preparing 3 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 4 distinct states appear in the data:

     1 = a

     2 = b

     3 = c

     4 = d

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  a           a        a

     2  b           b        b

     3  c           c        c

     4  d           d        d

 [>] 3 sequences in the data set

 [>] min/max sequence length: 4/6

   Sequence   
s1 a-b-c-d    
s2 *-a-b-*-c-d
s3 *-*-a-b-c-d

In this case it may be more appropriate to represent the data on a process time axis where all sequences would be left aligned, meaning that their common origin is not a specific year but the beginning of the observed 4 year duration.

The left part of sequences s2 and s3 which do not begin in the first column of the matrix, has been considered as part of them. To remedy to this problem, we could use the left=“DEL” option.

seqdef(df,left="DEL")

 [>] found missing values ('NA') in sequence data

 [>] preparing 3 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 4 distinct states appear in the data:

     1 = a

     2 = b

     3 = c

     4 = d

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  a           a        a

     2  b           b        b

     3  c           c        c

     4  d           d        d

 [>] 3 sequences in the data set

 [>] min/max sequence length: 4/5

   Sequence 
s1 a-b-c-d  
s2 a-b-*-c-d
s3 a-b-c-d

Now that all the 3 sequences have been left aligned. But sequence s2 has a gap in the data. Each missing value is left as an explicit missing element. We could also delete the missing values encountered in the center part of the sequence by setting gaps="DEL"

seqdef(df,left="DEL", gaps="DEL")

 [>] found missing values ('NA') in sequence data

 [>] preparing 3 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 4 distinct states appear in the data:

     1 = a

     2 = b

     3 = c

     4 = d

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  a           a        a

     2  b           b        b

     3  c           c        c

     4  d           d        d

 [>] 3 sequences in the data set

 [>] min/max sequence length: 4/4

   Sequence
s1 a-b-c-d 
s2 a-b-c-d 
s3 a-b-c-d

With the sample dataset, we could remove missing values and gaps, if pauses by the players are not important.

log_sts_nogap.seq <- seqdef(log_spell, var = c(id="id", begin="begin", end="end", status="states"), 
                   informat = "SPELL",  process = FALSE, overwrite = FALSE, gaps="DEL")

 [>] time axis: 1 -> 3616

 [>] converting SPELL data into 184 STS sequences (internal format)

 [>] found missing values ('NA') in sequence data

 [>] preparing 184 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 9 distinct states appear in the data:

     1 = end

     2 = idle

     3 = key_press

     4 = restart

     5 = rule_add

     6 = rule_remove

     7 = start

     8 = undo

     9 = win

 [>] state coding:

       [alphabet]  [label]     [long label]

     1  end         end         end

     2  idle        idle        idle

     3  key_press   key_press   key_press

     4  restart     restart     restart

     5  rule_add    rule_add    rule_add

     6  rule_remove rule_remove rule_remove

     7  start       start       start

     8  undo        undo        undo

     9  win         win         win

 [>] 184 sequences in the data set

 [>] min/max sequence length: 12/728

Now the missing values are gone:

print(log_sts_nogap.seq[1:6, ], format="SPS")

    Sequence                                                                                                                                                                                                                                                     
1   (start,1)-(key_press,13)-(rule_remove,1)-(key_press,47)-(win,1)                                                                                                                                                                                              
10  (start,1)-(key_press,17)                                                                                                                                                                                                                                     
100 (start,1)-(key_press,5)-(idle,1)-(key_press,30)-(undo,1)-(key_press,4)-(undo,1)-(key_press,12)-(undo,10)-(key_press,24)-(undo,1)-(key_press,26)-(undo,1)-(key_press,1)-(undo,1)-(key_press,44)-(undo,20)-(key_press,3)-(rule_remove,1)-(key_press,12)-(win,1)
101 (start,1)-(key_press,5)-(idle,1)-(key_press,24)-(win,1)                                                                                                                                                                                                      
102 (start,1)-(key_press,38)-(rule_add,1)-(key_press,1)-(win,1)                                                                                                                                                                                                  
103 (start,1)-(key_press,8)-(rule_remove,1)-(key_press,27)-(win,1)-(key_press,32)

3. Exploratory Analysis of Sequences

Sequence index plot

We can easily plot the top 20 sequences using the seqiplot() function.

seqiplot(data, main = "Index plot (first 20 sequences)", idxs = 1:20)

# Plot data during the first 100 seconds only

seqiplot(log_sts.seq[,1:100], main = "Index plot (first 20 sequences)", idxs = 1:20)

State distribution plot

The seqdplot() function plots a graphic showing the state distribution at each time point

seqdplot(data, main = "State distribution plot")

seqplot(log_sts.seq[,1:100], type="d", main = "State distribution plot", with.legend = TRUE)

Beside plotting the distribution of the states at each time point, you may want to get the figures of the distribution. The seqstatd() function returns the table of the state distributions together with the number of valid states and an entropy measure for each time unit. We will explain what entropy means later.

Here’s I am going to return a table of state distributions of 5 time units as an example

print(seqstatd(log_sts.seq[,1:5]))

                      [State frequencies]
            y1   y2   y3   y4    y5
end          0 0.00 0.00 0.00 0.000
idle         0 0.00 0.14 0.15 0.034
key_press    0 0.67 0.71 0.85 0.914
restart      0 0.00 0.00 0.00 0.000
rule_add     0 0.00 0.00 0.00 0.000
rule_remove  0 0.00 0.14 0.00 0.052
start        1 0.33 0.00 0.00 0.000
undo         0 0.00 0.00 0.00 0.000
win          0 0.00 0.00 0.00 0.000

                       [Valid states]
             y1 y2 y3 y4 y5
N           184  3  7 20 58

                       [Entropy index]
            y1   y2   y3   y4   y5
H            0 0.29 0.36 0.19 0.16

Sequence frequency plot

The seqfplot() function plots the most frequent sequences. By default, the 10 most frequent sequences are plotted. You can adjust this with the idxs option. The sequences are ordered by decreasing frequency from bottom up and the bar widths are set proportional to the sequence frequency (pbarw = TRUE)

seqfplot(data, main = "Sequence frequency plot", idxs = 1:10)

seqplot(log_sts.seq[,1:100], type="f", main = "Sequence frequency plot", pbarw = TRUE)

You can also return the frequency table using seqtab() function

print(seqtab(log_sts.seq[,1:100]))

                                                                                               Freq
start/1                                                                                           2
start/1-*/4-key_press/17                                                                          2
start/1-*/4-key_press/20                                                                          2
start/1-*/1-idle/1-*/1-key_press/18-rule_add/1-*/1-win/1                                          1
start/1-*/1-key_press/1-*/1-rule_remove/1-key_press/95                                            1
start/1-*/1-key_press/12-*/3-undo/1-key_press/10                                                  1
start/1-*/1-key_press/2-rule_remove/1-key_press/10                                                1
start/1-*/10-end/1-start/1-*/27-key_press/24-*/6-end/1-*/1-key_press/25-rule_add/1-key_press/2    1
start/1-*/10-key_press/1-rule_remove/1-key_press/24-*/3-undo/1-key_press/2-win/1                  1
start/1-*/10-key_press/12-*/4-end/1-start/1-key_press/17-rule_add/1-win/1                         1
                                                                                               Percent
start/1                                                                                           1.09
start/1-*/4-key_press/17                                                                          1.09
start/1-*/4-key_press/20                                                                          1.09
start/1-*/1-idle/1-*/1-key_press/18-rule_add/1-*/1-win/1                                          0.54
start/1-*/1-key_press/1-*/1-rule_remove/1-key_press/95                                            0.54
start/1-*/1-key_press/12-*/3-undo/1-key_press/10                                                  0.54
start/1-*/1-key_press/2-rule_remove/1-key_press/10                                                0.54
start/1-*/10-end/1-start/1-*/27-key_press/24-*/6-end/1-*/1-key_press/25-rule_add/1-key_press/2    0.54
start/1-*/10-key_press/1-rule_remove/1-key_press/24-*/3-undo/1-key_press/2-win/1                  0.54
start/1-*/10-key_press/12-*/4-end/1-start/1-key_press/17-rule_add/1-win/1                         0.54

Transition rates

The seqtrate() function computes the transition rates between states or events. The outcome is a matrix where each rows gives a transition distribution from associated originating state (or event) in t to the states in t + 1 (the figures sum to one in each row).

# ?seqtrate

seqtrate(log_sts.seq[,1:100])

 [>] computing transition probabilities for states end/idle/key_press/restart/rule_add/rule_remove/start/undo/win ...

                 [-> end]   [-> idle] [-> key_press] [-> restart] [-> rule_add]
[end ->]                0 0.000000000     0.08333333 0.0000000000   0.000000000
[idle ->]               0 0.239130435     0.73913043 0.0000000000   0.000000000
[key_press ->]          0 0.003524022     0.96558205 0.0007048044   0.006695642
[restart ->]            0 0.000000000     0.25000000 0.0000000000   0.000000000
[rule_add ->]           0 0.000000000     0.86363636 0.0000000000   0.000000000
[rule_remove ->]        0 0.000000000     0.98709677 0.0000000000   0.000000000
[start ->]              0 0.000000000     0.96428571 0.0000000000   0.000000000
[undo ->]               0 0.000000000     0.41891892 0.0000000000   0.000000000
[win ->]                0 0.000000000     0.00000000 0.0000000000   0.000000000
                 [-> rule_remove]  [-> start]   [-> undo]    [-> win]
[end ->]               0.00000000 0.916666667 0.000000000 0.000000000
[idle ->]              0.00000000 0.000000000 0.021739130 0.000000000
[key_press ->]         0.01597557 0.000587337 0.001762011 0.005168566
[restart ->]           0.00000000 0.750000000 0.000000000 0.000000000
[rule_add ->]          0.06818182 0.000000000 0.000000000 0.068181818
[rule_remove ->]       0.00000000 0.006451613 0.006451613 0.000000000
[start ->]             0.00000000 0.035714286 0.000000000 0.000000000
[undo ->]              0.00000000 0.000000000 0.581081081 0.000000000
[win ->]               0.00000000 0.000000000 0.000000000 0.000000000

Average time spent in each state

The seqmtplot() function to visualize the mean time spent in each state. You can also create multiple plots by group using the group option.

seqplot(log_sts.seq[,1:100], type = "mt", group = NULL, title = "Mean time")

 [!!] In .main() : title is deprecated, use main instead.

Descriptive statisitics of sequences

Sequence length using the `seqlength()` function

hist(seqlength(log_sts.seq), main = 'Histogram of sequence length', xlab="Sequence length in minutes")

Distinct states sequence (DSS) using the `seqdss()` function

seqdss(log_sts.seq[31:41,1:100])

    Sequence                                                                                                                                                                                      
144 start-key_press-undo-key_press                                                                                                                                                                
145 start-key_press                                                                                                                                                                               
147 start-key_press                                                                                                                                                                               
149 start-key_press                                                                                                                                                                               
15  start-key_press-end-start-key_press-rule_remove-key_press                                                                                                                                     
150 start-key_press                                                                                                                                                                               
151 start-key_press-rule_remove-key_press-restart-start-key_press-rule_remove-key_press-restart-start-key_press-rule_remove-key_press-idle-key_press-restart-start-key_press-rule_remove-key_press
152 start-key_press-rule_add                                                                                                                                                                      
153 start-key_press-rule_remove-key_press-rule_remove-key_press-restart-start-key_press-rule_add-key_press                                                                                        
154 start-key_press-rule_remove-key_press-rule_add-key_press                                                                                                                                      
155 start-key_press-rule_remove-key_press-rule_remove-key_press-rule_add-key_press-rule_remove-key_press-rule_remove-key_press-start-key_press

Shanon’s entropy

Shanon’s entropy is calculated as:

\[h = -\sum^s_i{\pi_i log{\pi_i}}\]

where s is the size of the alphabet
\(\pi_i\) the proportion of occurrences of the ith state in the considered sequence

The entropy can be interpreted as the ‘uncertainty’ of predicting the states in a given sequence. If all states in the sequence are the same, the entropy is equal to 0. It is maximum when the cases are equally distributed between the states

The seqient() function returns by default a normalized Shanon’s entropy

seqplot(log_sts.seq[31:41,1:100], type = "i")

seqient(log_sts.seq[31:41,1:100], norm = TRUE)

       Entropy
144 0.15683342
145 0.09034818
147 0.05026935
149 0.08415498
15  0.12128234
150 0.04392312
151 0.25890388
152 0.09849430
153 0.25107006
154 0.12923190
155 0.15301814

Let’s plot a histogram of entropy for the whole dataset

hist(seqient(log_sts.seq), main = 'Histogram of sequence entropy', xlab="Normalized Shannon's entropy")

Let us have a look at the sequences near the minimum, median and maximum entropy. For that, we draw sets of sequences having an entropy lower or equal to the 55th percentile, an entropy 50-90th percentile, and an entropy greater than the 90th percentile.

ient.quant <- quantile(seqient(log_sts.seq[,1:100]), c(0, 0.55, 0.9, 1))
ient.quant

       0%       55%       90%      100% 
0.0000000 0.1568334 0.2563379 0.3670561

ient.group <- cut(seqient(log_sts.seq[,1:100]), ient.quant, labels = c("55th or lower", "55th-90th", "above 90th"), include.lowest = T)
ient.group <- factor(ient.group, levels = c("55th or lower", "55th-90th", "above 90th"))
table(ient.group)

ient.group
55th or lower     55th-90th    above 90th 
          103            62            19

options(repr.plot.width=15, repr.plot.height=8)
seqfplot(log_sts.seq[,1:100], group = ient.group, pbarw = TRUE)

Turbulence

The Turbulence depends on the length of the sequence. The Turbulence is based on the number \(\phi\)(x) of distinct subsequences that can be extracted from the distinct state sequence and the variance of the consecutive times ti spent in the distinct states.

seqiplot(log_sts.seq[31:41,1:100])

cbind(seqST(log_sts.seq[31:41,1:100]),seqient(log_sts.seq[31:41,1:100]))

    Turbulence    Entropy
144   5.327362 0.15683342
145   2.000000 0.09034818
147   2.000000 0.05026935
149   2.000000 0.08415498
15    7.853472 0.12128234
150   2.000000 0.04392312
151  21.751442 0.25890388
152   3.000000 0.09849430
153  10.352264 0.25107006
154   5.737932 0.12923190
155  14.306251 0.15301814

Finally, we can use a single function seqindic to produce multiple descriptive statistics of the sequences. Here’s a few examples

“lgth” (sequence length)
“trans” (number of state changes)
“entr” (longitudinal normalized entropy)
“cplx” (complexity index)
“turb” (turbulence)

# ?seqindic

seqindic(log_sts.seq[31:41,1:100], indic=c("lgth","trans","entr","turbn","cplx"))

    Lgth Trans       Entr       Cplx      Turbn
144   28     3 0.15683342 0.13200733 0.04376791
145   33     1 0.09034818 0.05313549 0.01011423
147   51     1 0.05026935 0.03170784 0.01011423
149   27     1 0.08415498 0.05689227 0.01011423
15    82     6 0.12128234 0.09478332 0.06931756
150   64     1 0.04392312 0.02640440 0.01011423
151  100    20 0.25890388 0.22870027 0.20988475
152   51     2 0.09849430 0.06276760 0.02022845
153   59    10 0.25107006 0.20805755 0.09459091
154   62     5 0.12923190 0.10292123 0.04792051
155  100    13 0.15301814 0.14175081 0.13458242

Footnotes

Some content in this document was adapted from materials originally created by Dr. Quan Nguyen from the University of British Columbia.↩︎