Readings for this section
https://www.tidytextmining.com/tidytext
Expectations for this section
A working knowledge of R.
The process of deriving information from text that can be used to examine research questions
Also called
Texts here can refer to
Text analytics involves
Language features commonly assessed in NLP include
The tidyverse is a collection of R packages designed for data science.
What are the underlying principles?
Use function names that are descriptive and explicit over those that are short and implicit
Focus on verbs (e.g., fit, arrange, etc.) for general methods.
Verb-noun pairs are particularly effective in naming; consider invert_matrix() as a hypothetical function name
Names should be as self-documenting as possible.
Functions should avoid returning a novel data structure. If the results are conducive to an existing data structure, it should be used.
New columns can be added to the results without affecting the class of the data
May result in the potential loss of computational performance
The data frame is the preferred data structure in tidyverse and tidymodels packages
We will be using the tidytext framework
Tidytext Uses tidy data principles to make text analytics tasks easier, more effective, and consistent with commonly used tools in R.
Tidytext is tidy data that uses data frame structures
Tidy text format is a table with one-token-per-row.
There are other R packages that do more advanced text analytics
The tidytext pipeline
We are going to look at forum posts from a massive open online course (MOOC) offered by Ryan Baker in 2013 on Coursera.
For more information, see
Crossley, S. A., Paquette, L., Dascalu, M., McNamara, D., & Baker, R. (2016). Combining Click-Stream Data with NLP Tools to Better Understand MOOC Completion. In Gasevic, D., & Lynch, G. (eds.). Proceedings of the 6th International Learning Analytics and Knowledge (LAK) Conference. (pp. 6-14). New York, NY: ACM. doi: 10.1145/2883851.2883931
Big Data in Education
Purpose: Enable students to apply a range of educational data mining methods to answer education research questions and drive intervention and improvement in educational software and systems.
Students
Outcome variables
The texts for analysis
Final data differs from published data
The dataframe also includes a variety of clickstream data including
#clear all memory in R (warning, this clears all memory in R!)
rm(list=ls(all=TRUE))
Ensure that the dataframe is in the same folder as this R code. You can do this by setting the directory the source file location as shown below.
Call in our dataframe and break up by completion
#if you do not have this libraries, you will need to install them using the hashed out code below
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidytext)
library(stringr) #this is the tidyverse package for working with strings/characters
#get the forum data from the students that completed the MOOC
comp_texts <- read_csv("final_mooc_baker_data.csv")%>%
filter(Completion == 1)
## Rows: 298 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Annonized_name, text, avereage_lecture, average_forum_reads
## dbl (17): nw, No._contributions, No_of_new_threads_started, Page_View, Lectu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(comp_texts)
## spc_tbl_ [180 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Annonized_name : chr [1:180] "Member_365" "Member_304" "Member_187" "Member_272" ...
## $ text : chr [1:180] "Well got into some trouble too. My input data is like this: Any hints.. or someone can put one or two lines wit"| __truncated__ "I am on Linux, but one obvious guess would be to try 64-bit version. It is possible that your Windows is actual"| __truncated__ "Hello Professor Baker, Do you have a bibliography of the literature you reference in the video lectures (e.g. P"| __truncated__ "Hi, to what does \"this detector\" in question 7 refer to? To the detector in the previous question 5 that alwa"| __truncated__ ...
## $ nw : num [1:180] 27 38 51 51 51 51 53 54 55 57 ...
## $ No._contributions : num [1:180] 1 2 2 2 3 1 1 1 1 2 ...
## $ No_of_new_threads_started : num [1:180] 0 0 1 0 2 0 0 0 0 1 ...
## $ Page_View : num [1:180] 513 780 406 457 801 429 860 484 381 452 ...
## $ Lecture_Action : num [1:180] 1689 536 1102 575 4727 ...
## $ Syllabus_Views : num [1:180] 4 7 5 7 5 10 15 8 4 0 ...
## $ averege_assignment_score : num [1:180] 0.917 1 0.937 0.94 0.878 ...
## $ avereage_lecture : chr [1:180] "7.5" "7.25" "9" "6.75" ...
## $ Average_num_quizzes : num [1:180] 4 2.71 5 5 5.5 ...
## $ average_forum_reads : chr [1:180] "9" "20.125" "9.333333333" "9.285714286" ...
## $ num_upvotes_(total) : num [1:180] 0 5 3 0 0 0 37 0 0 1 ...
## $ Forum_reputation : num [1:180] 0 0 1 0 0 0 0 1 1 0 ...
## $ average_video_viewing : num [1:180] 0.962 0.792 0.819 0.986 0.946 ...
## $ Time_before_deadline_for_first_attempt(Average): num [1:180] 1004320 619588 673750 1124493 732558 ...
## $ Time_before_deadline_for_last_attempt(Average) : num [1:180] 992574 618682 673185 854676 520490 ...
## $ average_page_views : num [1:180] 64.1 74.5 43.9 57.1 87.4 ...
## $ average_syllabus_views : num [1:180] 1 1.17 1.67 1 1.67 ...
## $ Completion : num [1:180] 1 1 1 1 1 1 1 1 1 1 ...
## $ Final_Score : num [1:180] 0.97 1 0.985 1 0.953 ...
## - attr(*, "spec")=
## .. cols(
## .. Annonized_name = col_character(),
## .. text = col_character(),
## .. nw = col_double(),
## .. No._contributions = col_double(),
## .. No_of_new_threads_started = col_double(),
## .. Page_View = col_double(),
## .. Lecture_Action = col_double(),
## .. Syllabus_Views = col_double(),
## .. averege_assignment_score = col_double(),
## .. avereage_lecture = col_character(),
## .. Average_num_quizzes = col_double(),
## .. average_forum_reads = col_character(),
## .. `num_upvotes_(total)` = col_double(),
## .. Forum_reputation = col_double(),
## .. average_video_viewing = col_double(),
## .. `Time_before_deadline_for_first_attempt(Average)` = col_double(),
## .. `Time_before_deadline_for_last_attempt(Average)` = col_double(),
## .. average_page_views = col_double(),
## .. average_syllabus_views = col_double(),
## .. Completion = col_double(),
## .. Final_Score = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(comp_texts) #look at the head of the dataframe
## # A tibble: 6 × 21
## Annonized_name text nw No._contributions No_of_new_threads_st…¹ Page_View
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Member_365 "Well… 27 1 0 513
## 2 Member_304 "I am… 38 2 0 780
## 3 Member_187 "Hell… 51 2 1 406
## 4 Member_272 "Hi, … 51 2 0 457
## 5 Member_313 "I ca… 51 3 2 801
## 6 Member_550 "Hi--… 51 1 0 429
## # ℹ abbreviated name: ¹​No_of_new_threads_started
## # ℹ 15 more variables: Lecture_Action <dbl>, Syllabus_Views <dbl>,
## # averege_assignment_score <dbl>, avereage_lecture <chr>,
## # Average_num_quizzes <dbl>, average_forum_reads <chr>,
## # `num_upvotes_(total)` <dbl>, Forum_reputation <dbl>,
## # average_video_viewing <dbl>,
## # `Time_before_deadline_for_first_attempt(Average)` <dbl>, …
dim(comp_texts)
## [1] 180 21
#get the forum data from the students that did not complete the MOOC
incomp_texts <- read_csv("final_mooc_baker_data.csv")%>%
filter(Completion == 0)
## Rows: 298 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Annonized_name, text, avereage_lecture, average_forum_reads
## dbl (17): nw, No._contributions, No_of_new_threads_started, Page_View, Lectu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(incomp_texts)
## [1] 118 21
What is Tokenization?
Basically to keep
-verbs
-nouns
-advjectives
-adverbs
Get words word per text while removing stop words
data(stop_words) #call in stop word list from tidytext (this is included in the package)
stop_words #(what's in there, over a thousand non-semantic words)
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ℹ 1,139 more rows
#tokenize the data
comp_words <- comp_texts %>%
unnest_tokens(word, text) #one-token-per-row format tokenization in a new column (word) taken from text column
# a tibble that is over 6,000 observations
str(comp_words)
## spc_tbl_ [82,865 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Annonized_name : chr [1:82865] "Member_365" "Member_365" "Member_365" "Member_365" ...
## $ nw : num [1:82865] 27 27 27 27 27 27 27 27 27 27 ...
## $ No._contributions : num [1:82865] 1 1 1 1 1 1 1 1 1 1 ...
## $ No_of_new_threads_started : num [1:82865] 0 0 0 0 0 0 0 0 0 0 ...
## $ Page_View : num [1:82865] 513 513 513 513 513 513 513 513 513 513 ...
## $ Lecture_Action : num [1:82865] 1689 1689 1689 1689 1689 ...
## $ Syllabus_Views : num [1:82865] 4 4 4 4 4 4 4 4 4 4 ...
## $ averege_assignment_score : num [1:82865] 0.917 0.917 0.917 0.917 0.917 ...
## $ avereage_lecture : chr [1:82865] "7.5" "7.5" "7.5" "7.5" ...
## $ Average_num_quizzes : num [1:82865] 4 4 4 4 4 4 4 4 4 4 ...
## $ average_forum_reads : chr [1:82865] "9" "9" "9" "9" ...
## $ num_upvotes_(total) : num [1:82865] 0 0 0 0 0 0 0 0 0 0 ...
## $ Forum_reputation : num [1:82865] 0 0 0 0 0 0 0 0 0 0 ...
## $ average_video_viewing : num [1:82865] 0.962 0.962 0.962 0.962 0.962 ...
## $ Time_before_deadline_for_first_attempt(Average): num [1:82865] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
## $ Time_before_deadline_for_last_attempt(Average) : num [1:82865] 992574 992574 992574 992574 992574 ...
## $ average_page_views : num [1:82865] 64.1 64.1 64.1 64.1 64.1 ...
## $ average_syllabus_views : num [1:82865] 1 1 1 1 1 1 1 1 1 1 ...
## $ Completion : num [1:82865] 1 1 1 1 1 1 1 1 1 1 ...
## $ Final_Score : num [1:82865] 0.97 0.97 0.97 0.97 0.97 ...
## $ word : chr [1:82865] "well" "got" "into" "some" ...
## - attr(*, "spec")=
## .. cols(
## .. Annonized_name = col_character(),
## .. text = col_character(),
## .. nw = col_double(),
## .. No._contributions = col_double(),
## .. No_of_new_threads_started = col_double(),
## .. Page_View = col_double(),
## .. Lecture_Action = col_double(),
## .. Syllabus_Views = col_double(),
## .. averege_assignment_score = col_double(),
## .. avereage_lecture = col_character(),
## .. Average_num_quizzes = col_double(),
## .. average_forum_reads = col_character(),
## .. `num_upvotes_(total)` = col_double(),
## .. Forum_reputation = col_double(),
## .. average_video_viewing = col_double(),
## .. `Time_before_deadline_for_first_attempt(Average)` = col_double(),
## .. `Time_before_deadline_for_last_attempt(Average)` = col_double(),
## .. average_page_views = col_double(),
## .. average_syllabus_views = col_double(),
## .. Completion = col_double(),
## .. Final_Score = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#remove stop words
comp_words <- comp_words %>%
anti_join(stop_words) #remove stop words using anti_join
## Joining with `by = join_by(word)`
# a tibble without stop words that is around 33K word
#what are the word counts?
comp_words_freq <- comp_words %>%
count(word, sort = TRUE)
# a tibble with a count for each word (6,194) organized by the number of words
# most common word is 'data' which makes sense
head(comp_words_freq)
## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 data 611
## 2 1 489
## 3 lt 363
## 4 gt 313
## 5 2 279
## 6 question 276
str(comp_words_freq)
## spc_tbl_ [6,194 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ word: chr [1:6194] "data" "1" "lt" "gt" ...
## $ n : int [1:6194] 611 489 363 313 279 276 231 204 194 171 ...
## - attr(*, "spec")=
## .. cols(
## .. Annonized_name = col_character(),
## .. text = col_character(),
## .. nw = col_double(),
## .. No._contributions = col_double(),
## .. No_of_new_threads_started = col_double(),
## .. Page_View = col_double(),
## .. Lecture_Action = col_double(),
## .. Syllabus_Views = col_double(),
## .. averege_assignment_score = col_double(),
## .. avereage_lecture = col_character(),
## .. Average_num_quizzes = col_double(),
## .. average_forum_reads = col_character(),
## .. `num_upvotes_(total)` = col_double(),
## .. Forum_reputation = col_double(),
## .. average_video_viewing = col_double(),
## .. `Time_before_deadline_for_first_attempt(Average)` = col_double(),
## .. `Time_before_deadline_for_last_attempt(Average)` = col_double(),
## .. average_page_views = col_double(),
## .. average_syllabus_views = col_double(),
## .. Completion = col_double(),
## .. Final_Score = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# a more efficient manner to code the above (for words for students that did not complete the MOOC)
incomp_words_freq <- incomp_texts %>%
unnest_tokens(word, text) %>% #one-token-per-row format tokenization
anti_join(stop_words) %>% #remove stop words using anti_join
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
head(incomp_words_freq)
## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 data 190
## 2 quiz 103
## 3 question 81
## 4 rapidminer 79
## 5 1 77
## 6 model 74
str(incomp_words_freq)
## spc_tbl_ [3,389 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ word: chr [1:3389] "data" "quiz" "question" "rapidminer" ...
## $ n : int [1:3389] 190 103 81 79 77 74 72 70 69 63 ...
## - attr(*, "spec")=
## .. cols(
## .. Annonized_name = col_character(),
## .. text = col_character(),
## .. nw = col_double(),
## .. No._contributions = col_double(),
## .. No_of_new_threads_started = col_double(),
## .. Page_View = col_double(),
## .. Lecture_Action = col_double(),
## .. Syllabus_Views = col_double(),
## .. averege_assignment_score = col_double(),
## .. avereage_lecture = col_character(),
## .. Average_num_quizzes = col_double(),
## .. average_forum_reads = col_character(),
## .. `num_upvotes_(total)` = col_double(),
## .. Forum_reputation = col_double(),
## .. average_video_viewing = col_double(),
## .. `Time_before_deadline_for_first_attempt(Average)` = col_double(),
## .. `Time_before_deadline_for_last_attempt(Average)` = col_double(),
## .. average_page_views = col_double(),
## .. average_syllabus_views = col_double(),
## .. Completion = col_double(),
## .. Final_Score = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Vizualize the most common words using ggplot
library(ggplot2) #call in ggplot
comp_words_freq %>%
filter(n > 75) %>% #only select common words (i.e., words with more than 100 counts)
mutate(word = reorder(word, n)) %>% #reorder based on size
ggplot(aes(n, word)) + #the AES
geom_col() + #Columns
xlab("Frequency") +
ylab("Word") +
ggtitle("Bar plot: Most frequent words complete")
incomp_words_freq %>%
filter(n > 30) %>% #more than 30 incidences (it is obvious that incomp has low incidence counts)
mutate(word = reorder(word, n)) %>% #reorder based on size
ggplot(aes(n, word)) + #the AES
geom_col() + #Columns
xlab("Frequency") +
ylab("Word") +
ggtitle("Bar plot: Most frequent words incomplete")
# Are there any differences?
Differences?
Incomplete:
- Greater focus on quiz
- Kappa
- Error
Complete
- More numbers and functions
- More reference to rapidminer (tool used in MOOC)
Combine data for visualization
Create dataframe with forum posts counts by students who complete and did not complete the MOOC side by side and plot it out to see differences
- This will create a better visualization
- Additionally, when comparing two groups, we need to normalize counts for total word counts for the entire corpus.
-This is very important
- Make sure that corpora/texts with more words do not have higher counts based on number of words alone
#create shared tibble
comp_incomp_freq <- bind_rows(mutate(comp_words_freq, Completion = "complete"), mutate(incomp_words_freq, Completion = "incomplete")) %>% #use bind_rows to create new tibble and mutate to create new columns and name
mutate(proportion = (n / sum(n))*1000) %>% #norm proportions by sum of words (per 1,000 words)
dplyr::select(-n) %>% #get rid raw count
pivot_wider(names_from = Completion, values_from = proportion) %>% #get columns for neg and pos word
na.omit() #keep only words that are shared in both positive and negative reviews
str(comp_incomp_freq)
## tibble [2,102 × 3] (S3: tbl_df/tbl/data.frame)
## $ word : chr [1:2102] "data" "1" "lt" "gt" ...
## $ complete : num [1:2102] 13.62 10.9 8.09 6.98 6.22 ...
## $ incomplete: num [1:2102] 4.235 1.716 0.312 0.334 1.293 ...
## - attr(*, "na.action")= 'omit' Named int [1:5379] 72 88 97 132 137 138 153 156 161 166 ...
## ..- attr(*, "names")= chr [1:5379] "72" "88" "97" "132" ...
head(comp_incomp_freq)
## # A tibble: 6 × 3
## word complete incomplete
## <chr> <dbl> <dbl>
## 1 data 13.6 4.24
## 2 1 10.9 1.72
## 3 lt 8.09 0.312
## 4 gt 6.98 0.334
## 5 2 6.22 1.29
## 6 question 6.15 1.81
#visualize frequency by word
ggplot(comp_incomp_freq, aes(complete, incomplete, label = word)) + #scatterplot with points labeled by words
geom_text(size = 2) + #make words smaller to see
labs(x= "Complete", y="Incomplete") +
geom_smooth(method=lm, aes(x = complete, y = incomplete)) #put in a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
To check differences and similarties
t-tests
library(psych) #psych has a nice function that allows descriptive statistics to be easily calculate for dataframes
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(comp_incomp_freq) #get mean and SD by group
## vars n mean sd median trimmed mad min max range
## word* 1 2102 1051.50 606.94 1051.50 1051.50 779.11 1.00 2102.00 2101.00
## complete 2 2102 0.27 0.64 0.09 0.15 0.10 0.02 13.62 13.60
## incomplete 3 2102 0.10 0.20 0.04 0.06 0.03 0.02 4.24 4.21
## skew kurtosis se
## word* 0.00 -1.20 13.24
## complete 10.13 154.07 0.01
## incomplete 8.18 116.29 0.00
#correlation for similarities between word frequencies for complete and incomplete students
cor(comp_incomp_freq$complete, comp_incomp_freq$incomplete)
## [1] 0.8083514
#there are strong correlations across all the words
#t-test for differences between positive and negative groups
t.test(comp_incomp_freq$complete, comp_incomp_freq$incomplete)
##
## Welch Two Sample t-test
##
## data: comp_incomp_freq$complete and comp_incomp_freq$incomplete
## t = 11.126, df = 2495.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1338705 0.1911568
## sample estimates:
## mean of x mean of y
## 0.2659353 0.1034216
#there are also significant differences