Core Questions
What is the big data perspective on educational data?
Is it sufficient just to have lots of data? How much does data quality matter?
Is it the case that the language text corpus that Halevy et al discuss is really characterized just by size? What about the presence of high-quality translations in a large part of the corpus (the EU transcripts, in specific?)
MOOC data is well known to be large in number of learners... but often not very well reified -- student thinking is not very visible in watching videos and completing quizzes where only final answers are entered. How much does this matter?
Halevy et al recommend using concepts already present in the data, but a lot of educational data mining relies upon discovering new concepts (e.g. gaming the system) and/or connecting concepts in the world to log data (e.g. affect detection). What are the relative merits of each approach?
The approach cited in Halevy et al. leverages domain knowledge and scientific knowledge in linguistics. Does it contribute back to scientific knowledge in linguistics?
Despite the clear use of domain knowledge and scientific knowledge in linguistics, one major figure involved in this perspective has publicly said that existing learning sciences research is not useful for the design, analysis, and improvement of MOOCs. Why might this be? What are your thoughts about this?
Secondary Questions
MOOCs are tiny in scale compared to the linguistic data available from the world wide web; so are intelligent tutors, and every other source of educational data. Even the sum total of all the data in Blackboard (almost totally unreified) is several orders of magnitude smaller. How can we leverage the unreasonable effectiveness of not-actually-that-big-data?
Halevy et al argue that imperfect connections and labels can be compensated for by sheer scale. Is there some way to know where on the tradeoff we are? How bad can data be at a certain size?