1 What is text analytics

The process of deriving information from text that can be used to examine research questions

Also called

Text mining
Natural language processing
Computational linguistics

Texts here can refer to

Websites
Discussion posts
Blogs
Social media
e-mails
Professional writing
Reading texts
Student writing
Books
Speech transcriptions
Etc…

Text analytics involves

Structuring input texts
- Text wrangling
Deriving language features
- Natural language processing
Assessing patterns in the structured data
- Statistical analysis
- Machine learning
Interpretation and evaluation

Language features commonly assessed in NLP include

Part of speech tags
Lexical information
Cohesion
Syntactic strutures
Semantics/meaning
Sentiment analysis
Phraseology
Topic modeling
Named entity extraction
Discourse structure

1.1 The tidyverse

The tidyverse is a collection of R packages designed for data science.

All packages share an underlying design philosophy, grammar, and data structures.

What are the underlying principles?

Use function names that are descriptive and explicit over those that are short and implicit
Focus on verbs (e.g., fit, arrange, etc.) for general methods.
Verb-noun pairs are particularly effective in naming; consider invert_matrix() as a hypothetical function name
Names should be as self-documenting as possible.
Functions should avoid returning a novel data structure. If the results are conducive to an existing data structure, it should be used.
New columns can be added to the results without affecting the class of the data
May result in the potential loss of computational performance
The data frame is the preferred data structure in tidyverse and tidymodels packages
- the structure is a good fit for a broad swath of data science tasks.

We will be using the tidytext framework

Tidytext Uses tidy data principles to make text analytics tasks easier, more effective, and consistent with commonly used tools in R.
Tidytext is tidy data that uses data frame structures
1. Each variable is a column
2. Each observation is a row
3. Each type of observational unit is a table

Tidy text format is a table with one-token-per-row.

A token is generally a word
Tokens can also be
- n-grams (phrases)
- sentence
- paragraph

There are other R packages that do more advanced text analytics

If interested
- spacy
- tm
- quanteda

The tidytext pipeline

1.1.1 Class Dataset

We are going to look at forum posts from a massive open online course (MOOC) offered by Ryan Baker in 2013 on Coursera.

For more information, see

Crossley, S. A., Paquette, L., Dascalu, M., McNamara, D., & Baker, R. (2016). Combining Click-Stream Data with NLP Tools to Better Understand MOOC Completion. In Gasevic, D., & Lynch, G. (eds.). Proceedings of the 6th International Learning Analytics and Knowledge (LAK) Conference. (pp. 6-14). New York, NY: ACM. doi: 10.1145/2883851.2883931

Big Data in Education

Purpose: Enable students to apply a range of educational data mining methods to answer education research questions and drive intervention and improvement in educational software and systems.

Course comprised lecture videos and 8 weekly assignments.
All the weekly assignments were automatically graded, and composed of numeric input or multiple-choice questions.

Students

The course had a total enrollment of over 48,000
13,314 students watched at least one video
1,242 students watched all the videos
1,380 students completed at least one assignment
710 made a post in the discussion forums.
Of those with posts, 426 completed at least one class assignment
638 students completed the online course and received a certificate

Outcome variables

Completion rate
Final score

The texts for analysis

Discussion forum
- provide students with a platform to exchange ideas, discuss the lectures, ask questions about the course, and seek technical help
- lead to the production of language in a natural setting
- each week, new discussion threads were created for each week’s content including both videos and assignments - forum participation did not count toward student’s final grades

Final data differs from published data

298 students
Removed most non-language data (this cut down number of words in text)
Removed non-English data
Removed text shorter than 25 words

The dataframe also includes a variety of clickstream data including

Video lectures
Forum interaction
Page views
Assignments

1.1.1.1 Tokenization

#clear all memory in R (warning, this clears all memory in R!)

rm(list=ls(all=TRUE))

Ensure that the dataframe is in the same folder as this R code. You can do this by setting the directory the source file location as shown below.

Call in our dataframe and break up by completion

#if you do not have this libraries, you will need to install them using the hashed out code below

#install.packages("tidyverse")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(tidytext)
library(stringr) #this is the tidyverse package for working with strings/characters


#get the forum data from the students that completed the MOOC
comp_texts <- read_csv("final_mooc_baker_data.csv")%>% 
  filter(Completion == 1)

## Rows: 298 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Annonized_name, text, avereage_lecture, average_forum_reads
## dbl (17): nw, No._contributions, No_of_new_threads_started, Page_View, Lectu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(comp_texts)

## spc_tbl_ [180 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Annonized_name                                 : chr [1:180] "Member_365" "Member_304" "Member_187" "Member_272" ...
##  $ text                                           : chr [1:180] "Well got into some trouble too. My input data is like this: Any hints.. or someone can put one or two lines wit"| __truncated__ "I am on Linux, but one obvious guess would be to try 64-bit version. It is possible that your Windows is actual"| __truncated__ "Hello Professor Baker, Do you have a bibliography of the literature you reference in the video lectures (e.g. P"| __truncated__ "Hi, to what does \"this detector\" in question 7 refer to? To the detector in the previous question 5 that alwa"| __truncated__ ...
##  $ nw                                             : num [1:180] 27 38 51 51 51 51 53 54 55 57 ...
##  $ No._contributions                              : num [1:180] 1 2 2 2 3 1 1 1 1 2 ...
##  $ No_of_new_threads_started                      : num [1:180] 0 0 1 0 2 0 0 0 0 1 ...
##  $ Page_View                                      : num [1:180] 513 780 406 457 801 429 860 484 381 452 ...
##  $ Lecture_Action                                 : num [1:180] 1689 536 1102 575 4727 ...
##  $ Syllabus_Views                                 : num [1:180] 4 7 5 7 5 10 15 8 4 0 ...
##  $ averege_assignment_score                       : num [1:180] 0.917 1 0.937 0.94 0.878 ...
##  $ avereage_lecture                               : chr [1:180] "7.5" "7.25" "9" "6.75" ...
##  $ Average_num_quizzes                            : num [1:180] 4 2.71 5 5 5.5 ...
##  $ average_forum_reads                            : chr [1:180] "9" "20.125" "9.333333333" "9.285714286" ...
##  $ num_upvotes_(total)                            : num [1:180] 0 5 3 0 0 0 37 0 0 1 ...
##  $ Forum_reputation                               : num [1:180] 0 0 1 0 0 0 0 1 1 0 ...
##  $ average_video_viewing                          : num [1:180] 0.962 0.792 0.819 0.986 0.946 ...
##  $ Time_before_deadline_for_first_attempt(Average): num [1:180] 1004320 619588 673750 1124493 732558 ...
##  $ Time_before_deadline_for_last_attempt(Average) : num [1:180] 992574 618682 673185 854676 520490 ...
##  $ average_page_views                             : num [1:180] 64.1 74.5 43.9 57.1 87.4 ...
##  $ average_syllabus_views                         : num [1:180] 1 1.17 1.67 1 1.67 ...
##  $ Completion                                     : num [1:180] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Final_Score                                    : num [1:180] 0.97 1 0.985 1 0.953 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Annonized_name = col_character(),
##   ..   text = col_character(),
##   ..   nw = col_double(),
##   ..   No._contributions = col_double(),
##   ..   No_of_new_threads_started = col_double(),
##   ..   Page_View = col_double(),
##   ..   Lecture_Action = col_double(),
##   ..   Syllabus_Views = col_double(),
##   ..   averege_assignment_score = col_double(),
##   ..   avereage_lecture = col_character(),
##   ..   Average_num_quizzes = col_double(),
##   ..   average_forum_reads = col_character(),
##   ..   `num_upvotes_(total)` = col_double(),
##   ..   Forum_reputation = col_double(),
##   ..   average_video_viewing = col_double(),
##   ..   `Time_before_deadline_for_first_attempt(Average)` = col_double(),
##   ..   `Time_before_deadline_for_last_attempt(Average)` = col_double(),
##   ..   average_page_views = col_double(),
##   ..   average_syllabus_views = col_double(),
##   ..   Completion = col_double(),
##   ..   Final_Score = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(comp_texts) #look at the head of the dataframe

## # A tibble: 6 × 21
##   Annonized_name text      nw No._contributions No_of_new_threads_st…¹ Page_View
##   <chr>          <chr>  <dbl>             <dbl>                  <dbl>     <dbl>
## 1 Member_365     "Well…    27                 1                      0       513
## 2 Member_304     "I am…    38                 2                      0       780
## 3 Member_187     "Hell…    51                 2                      1       406
## 4 Member_272     "Hi, …    51                 2                      0       457
## 5 Member_313     "I ca…    51                 3                      2       801
## 6 Member_550     "Hi--…    51                 1                      0       429
## # ℹ abbreviated name: ¹No_of_new_threads_started
## # ℹ 15 more variables: Lecture_Action <dbl>, Syllabus_Views <dbl>,
## #   averege_assignment_score <dbl>, avereage_lecture <chr>,
## #   Average_num_quizzes <dbl>, average_forum_reads <chr>,
## #   `num_upvotes_(total)` <dbl>, Forum_reputation <dbl>,
## #   average_video_viewing <dbl>,
## #   `Time_before_deadline_for_first_attempt(Average)` <dbl>, …

dim(comp_texts)

## [1] 180  21

#get the forum data from the students that did not complete the MOOC
incomp_texts <- read_csv("final_mooc_baker_data.csv")%>% 
  filter(Completion == 0)

## Rows: 298 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Annonized_name, text, avereage_lecture, average_forum_reads
## dbl (17): nw, No._contributions, No_of_new_threads_started, Page_View, Lectu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(incomp_texts)

## [1] 118  21

What is Tokenization?

Getting words from text
function is unnest_tokens(word, text)
The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.
Often want to remove stop/function words
- prepostions, connectives, articles, pronouns, etc…
  - Basically to keep
    
    -verbs
    
    -nouns
    
    -advjectives
    
    -adverbs
-Stop/function words do not contain as much semantic information

Get words word per text while removing stop words

data(stop_words) #call in stop word list from tidytext (this is included in the package)
stop_words #(what's in there, over a thousand non-semantic words)

## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ℹ 1,139 more rows

#tokenize the data

comp_words <- comp_texts %>%  
  unnest_tokens(word, text) #one-token-per-row format tokenization in a new column (word) taken from text column
# a tibble that is over 6,000 observations

str(comp_words)

## spc_tbl_ [82,865 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Annonized_name                                 : chr [1:82865] "Member_365" "Member_365" "Member_365" "Member_365" ...
##  $ nw                                             : num [1:82865] 27 27 27 27 27 27 27 27 27 27 ...
##  $ No._contributions                              : num [1:82865] 1 1 1 1 1 1 1 1 1 1 ...
##  $ No_of_new_threads_started                      : num [1:82865] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Page_View                                      : num [1:82865] 513 513 513 513 513 513 513 513 513 513 ...
##  $ Lecture_Action                                 : num [1:82865] 1689 1689 1689 1689 1689 ...
##  $ Syllabus_Views                                 : num [1:82865] 4 4 4 4 4 4 4 4 4 4 ...
##  $ averege_assignment_score                       : num [1:82865] 0.917 0.917 0.917 0.917 0.917 ...
##  $ avereage_lecture                               : chr [1:82865] "7.5" "7.5" "7.5" "7.5" ...
##  $ Average_num_quizzes                            : num [1:82865] 4 4 4 4 4 4 4 4 4 4 ...
##  $ average_forum_reads                            : chr [1:82865] "9" "9" "9" "9" ...
##  $ num_upvotes_(total)                            : num [1:82865] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Forum_reputation                               : num [1:82865] 0 0 0 0 0 0 0 0 0 0 ...
##  $ average_video_viewing                          : num [1:82865] 0.962 0.962 0.962 0.962 0.962 ...
##  $ Time_before_deadline_for_first_attempt(Average): num [1:82865] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ Time_before_deadline_for_last_attempt(Average) : num [1:82865] 992574 992574 992574 992574 992574 ...
##  $ average_page_views                             : num [1:82865] 64.1 64.1 64.1 64.1 64.1 ...
##  $ average_syllabus_views                         : num [1:82865] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Completion                                     : num [1:82865] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Final_Score                                    : num [1:82865] 0.97 0.97 0.97 0.97 0.97 ...
##  $ word                                           : chr [1:82865] "well" "got" "into" "some" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Annonized_name = col_character(),
##   ..   text = col_character(),
##   ..   nw = col_double(),
##   ..   No._contributions = col_double(),
##   ..   No_of_new_threads_started = col_double(),
##   ..   Page_View = col_double(),
##   ..   Lecture_Action = col_double(),
##   ..   Syllabus_Views = col_double(),
##   ..   averege_assignment_score = col_double(),
##   ..   avereage_lecture = col_character(),
##   ..   Average_num_quizzes = col_double(),
##   ..   average_forum_reads = col_character(),
##   ..   `num_upvotes_(total)` = col_double(),
##   ..   Forum_reputation = col_double(),
##   ..   average_video_viewing = col_double(),
##   ..   `Time_before_deadline_for_first_attempt(Average)` = col_double(),
##   ..   `Time_before_deadline_for_last_attempt(Average)` = col_double(),
##   ..   average_page_views = col_double(),
##   ..   average_syllabus_views = col_double(),
##   ..   Completion = col_double(),
##   ..   Final_Score = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

#remove stop words
comp_words <- comp_words %>% 
  anti_join(stop_words)  #remove stop words using anti_join

## Joining with `by = join_by(word)`

# a tibble without stop words that is around 33K word

#what are the word counts?
comp_words_freq <- comp_words %>% 
  count(word, sort = TRUE) 
# a tibble with a count for each word (6,194) organized by the number of words
# most common word is 'data' which makes sense

head(comp_words_freq)

## # A tibble: 6 × 2
##   word         n
##   <chr>    <int>
## 1 data       611
## 2 1          489
## 3 lt         363
## 4 gt         313
## 5 2          279
## 6 question   276

str(comp_words_freq)

## spc_tbl_ [6,194 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ word: chr [1:6194] "data" "1" "lt" "gt" ...
##  $ n   : int [1:6194] 611 489 363 313 279 276 231 204 194 171 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Annonized_name = col_character(),
##   ..   text = col_character(),
##   ..   nw = col_double(),
##   ..   No._contributions = col_double(),
##   ..   No_of_new_threads_started = col_double(),
##   ..   Page_View = col_double(),
##   ..   Lecture_Action = col_double(),
##   ..   Syllabus_Views = col_double(),
##   ..   averege_assignment_score = col_double(),
##   ..   avereage_lecture = col_character(),
##   ..   Average_num_quizzes = col_double(),
##   ..   average_forum_reads = col_character(),
##   ..   `num_upvotes_(total)` = col_double(),
##   ..   Forum_reputation = col_double(),
##   ..   average_video_viewing = col_double(),
##   ..   `Time_before_deadline_for_first_attempt(Average)` = col_double(),
##   ..   `Time_before_deadline_for_last_attempt(Average)` = col_double(),
##   ..   average_page_views = col_double(),
##   ..   average_syllabus_views = col_double(),
##   ..   Completion = col_double(),
##   ..   Final_Score = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# a more efficient manner to code the above (for words for students that did not complete the MOOC)
incomp_words_freq <- incomp_texts %>%  
  unnest_tokens(word, text) %>% #one-token-per-row format tokenization
  anti_join(stop_words) %>% #remove stop words using anti_join
  count(word, sort = TRUE)

## Joining with `by = join_by(word)`

head(incomp_words_freq)

## # A tibble: 6 × 2
##   word           n
##   <chr>      <int>
## 1 data         190
## 2 quiz         103
## 3 question      81
## 4 rapidminer    79
## 5 1             77
## 6 model         74

str(incomp_words_freq)

## spc_tbl_ [3,389 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ word: chr [1:3389] "data" "quiz" "question" "rapidminer" ...
##  $ n   : int [1:3389] 190 103 81 79 77 74 72 70 69 63 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Annonized_name = col_character(),
##   ..   text = col_character(),
##   ..   nw = col_double(),
##   ..   No._contributions = col_double(),
##   ..   No_of_new_threads_started = col_double(),
##   ..   Page_View = col_double(),
##   ..   Lecture_Action = col_double(),
##   ..   Syllabus_Views = col_double(),
##   ..   averege_assignment_score = col_double(),
##   ..   avereage_lecture = col_character(),
##   ..   Average_num_quizzes = col_double(),
##   ..   average_forum_reads = col_character(),
##   ..   `num_upvotes_(total)` = col_double(),
##   ..   Forum_reputation = col_double(),
##   ..   average_video_viewing = col_double(),
##   ..   `Time_before_deadline_for_first_attempt(Average)` = col_double(),
##   ..   `Time_before_deadline_for_last_attempt(Average)` = col_double(),
##   ..   average_page_views = col_double(),
##   ..   average_syllabus_views = col_double(),
##   ..   Completion = col_double(),
##   ..   Final_Score = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

1.1.1.2 Visualization

Vizualize the most common words using ggplot

library(ggplot2) #call in ggplot


comp_words_freq %>%
  filter(n > 75) %>% #only select common words (i.e., words with more than 100 counts)
  mutate(word = reorder(word, n)) %>% #reorder based on size
  ggplot(aes(n, word)) + #the AES
  geom_col() + #Columns
  xlab("Frequency") +
  ylab("Word") +
  ggtitle("Bar plot: Most frequent words complete")

incomp_words_freq %>%
  filter(n > 30) %>% #more than 30 incidences (it is obvious that incomp has low incidence counts)
  mutate(word = reorder(word, n)) %>% #reorder based on size
  ggplot(aes(n, word)) + #the AES
  geom_col() + #Columns
  xlab("Frequency") +
  ylab("Word") +
  ggtitle("Bar plot: Most frequent words incomplete")

# Are there any differences?

Differences?

Incomplete:

- Greater focus on quiz
- Kappa
- Error

Complete

- More numbers and functions
- More reference to rapidminer (tool used in MOOC)

Combine data for visualization

Create dataframe with forum posts counts by students who complete and did not complete the MOOC side by side and plot it out to see differences

- This will create a better visualization
- Additionally, when comparing two groups, we need to normalize counts for total word counts for the entire corpus.
  -This is very important
- Make sure that corpora/texts with more words do not have higher counts based on number of words alone

#create shared tibble
comp_incomp_freq <- bind_rows(mutate(comp_words_freq, Completion = "complete"), mutate(incomp_words_freq, Completion = "incomplete")) %>% #use bind_rows to create new tibble and mutate to create new columns and name
  mutate(proportion = (n / sum(n))*1000) %>% #norm proportions by sum of words (per 1,000 words)
  dplyr::select(-n) %>%  #get rid raw count
  pivot_wider(names_from = Completion, values_from = proportion) %>% #get columns for neg and pos word
  na.omit() #keep only words that are shared in both positive and negative reviews

str(comp_incomp_freq)

## tibble [2,102 × 3] (S3: tbl_df/tbl/data.frame)
##  $ word      : chr [1:2102] "data" "1" "lt" "gt" ...
##  $ complete  : num [1:2102] 13.62 10.9 8.09 6.98 6.22 ...
##  $ incomplete: num [1:2102] 4.235 1.716 0.312 0.334 1.293 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5379] 72 88 97 132 137 138 153 156 161 166 ...
##   ..- attr(*, "names")= chr [1:5379] "72" "88" "97" "132" ...

head(comp_incomp_freq)

## # A tibble: 6 × 3
##   word     complete incomplete
##   <chr>       <dbl>      <dbl>
## 1 data        13.6       4.24 
## 2 1           10.9       1.72 
## 3 lt           8.09      0.312
## 4 gt           6.98      0.334
## 5 2            6.22      1.29 
## 6 question     6.15      1.81

#visualize frequency by word
ggplot(comp_incomp_freq, aes(complete, incomplete, label = word)) + #scatterplot with points labeled by words
  geom_text(size = 2) + #make words smaller to see
  labs(x= "Complete", y="Incomplete") +
  geom_smooth(method=lm, aes(x = complete, y = incomplete)) #put in a regression line

## `geom_smooth()` using formula = 'y ~ x'

## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

1.1.1.3 Basic Statistics

To check differences and similarties

t-tests

t-test compares two group’s means along with standard deviations for group variance
reports a t value
and a p value
are two groups (complete/incomplete) different from one another

library(psych) #psych has a nice function that allows descriptive statistics to be easily calculate for dataframes

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(comp_incomp_freq) #get mean and SD by group

##            vars    n    mean     sd  median trimmed    mad  min     max   range
## word*         1 2102 1051.50 606.94 1051.50 1051.50 779.11 1.00 2102.00 2101.00
## complete      2 2102    0.27   0.64    0.09    0.15   0.10 0.02   13.62   13.60
## incomplete    3 2102    0.10   0.20    0.04    0.06   0.03 0.02    4.24    4.21
##             skew kurtosis    se
## word*       0.00    -1.20 13.24
## complete   10.13   154.07  0.01
## incomplete  8.18   116.29  0.00

#correlation for similarities between word frequencies for complete and incomplete students
cor(comp_incomp_freq$complete, comp_incomp_freq$incomplete)

## [1] 0.8083514

#there are strong correlations across all the words


#t-test for differences between positive and negative groups

t.test(comp_incomp_freq$complete, comp_incomp_freq$incomplete)

## 
##  Welch Two Sample t-test
## 
## data:  comp_incomp_freq$complete and comp_incomp_freq$incomplete
## t = 11.126, df = 2495.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1338705 0.1911568
## sample estimates:
## mean of x mean of y 
## 0.2659353 0.1034216

#there are also significant differences

Tidy Text Introduction

Crossley

2023-11-15