Educational data mining

The educational technology and digital learning wiki
Jump to navigation Jump to search


Educational Data Mining is a research area. Let's look at a few definitions:

“Educational Data Mining (EDM) is an emerging multidisciplinary research area, in which methods and techniques for exploring data originating from various educational information systems have been developed. EDM is both a learning science, as well as a rich application area for data mining, due to the growing availability of educational data. EDM contributes to the study of how students learn, and the set- tings in which they learn. It enables data-driven decision making for improving the current educational practice and learning material.” (Calders and Pechenizkiy, 2012).

“Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in. Whether educational data is taken from students' use of interactive learning environments, computer-supported collaborative learning, or administrative data from schools and universities, it often has multiple levels of meaningful hierarchy, which often need to be determined by properties in the data itself, rather than in advance. Issues of time, sequence, and context also play important roles in the study of educational data.” (Educational Data Mining Society home page, retrieved Jan 17, 2014)

“Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings in which they learn.” (JEDM - Journal of Educational Data Mining, retrieved Jan 17, 2014)

Educational data mining is rooted in general data mining. However, there are specifics: “Data mining, also called Knowledge Discovery in Databases (KDD), is the field of discovering novel and potentially useful information from large amounts of data [Witten and Frank 1999]. It has been proposed that educational data mining methods are often different from standard data mining methods, due to the need to explicitly account for (and the opportunities to exploit) the multi-level hierarchy and non-independence in educational data [Baker in press]. For this reason, it is increasingly common to see the use of models drawn from the psychometrics literature in educational data mining publications [Barnes 2005; Desmarais and Pu 2005; Pavlik et al. 2008].” (Baker & Yacef, 2009)

See also:

The data mining pipeline

A typical research process is described by the Report on the Experiences of First Respondents to the Digging Into Data Challenge (Williford & Henry, 2012).

a. hypothesis and/or question formation;
b. selection of a corpus or corpora;
c. exploration of a corpus or corpora;
d. querying and correcting, modifying, or amending the data as needed;
e. pulling together subsets of data relevant to a given question;
f. making observations about those data; and
g. drawing conclusions from and/or interpreting those data.

Data mining vs. learning analytics

According to Baker (2013), “Since the 1960s, methods for extracting useful information from large data sets, termed analytics or data mining, have played a key role in fields such as physics and biology. In the last few years, the same trend has emerged in educational research and practice, an area termed learning analytics (LA; Ferguson, 2012) or educational data mining (EDM; Baker & Yacef, 2009). In brief, these two research areas seek to find ways to make beneficial use of the increasing amounts of data available about learners in order to better understand the processes of learning and the social and motivational factors surrounding learning.”

According to Baker and Siemens (2013), “Siemens and Baker (2012) noted that the two communities have considerable overlap (in terms of both research and researchers), and that the two communities strongly believe in conducting research that has applications that benefit learners as well as informing and enhancing the learning sciences.”

Data mining components

Romero&Ventura (2007) according to Baker & Yacef (2009) identifies the following types of educational data mining

  1. Statistics and visualization
  2. Web mining
    1. Clustering, classification, and outlier detection
    2. Association rule mining and sequential pattern mining
    3. Text mining

Baker & Yacef (2009) then summarize a new typology defined in Baker (2010):

  1. Prediction
    • Classification
    • Regression
    • Density estimation
  2. Clustering
  3. Relationship mining
    • Association rule mining
    • Correlation mining
    • Sequential pattern mining
    • Causal data mining
  4. Distillation of data for human judgment
  5. Discovery with models

Calders & Pechenizkiy (2011) associate basic EDM tasks to traditional data mining problems, i.e.:

Classic DM problems Educational Example Author
Classification: categorizing and profiling students, determine their learning styles and preferences Cha et al. 2006
Predictive modeling: inducing models that can predict whether (and when) a student will pass a course or not or will eventually graduate or drop out Hämäläinen & Vinni, 2006; Dekker et al, 2009
Clustering: grouping similar students (based on behavior, performance, etc) or grouping similar courses, assignments, etc together, exploring collaborative learning patterns Perera et al, 2009
Biclustering: finding which questions (tasks, courses, etc) are difficult/easy for which students.
Frequent pattern mining: finding (elective) courses often taken together or popular paths in study programs or actions in LMS Zaïane, 2001
Emerging pattern mining: finding patterns that capture significant differences in behavior of students who graduated vs. those students who did not or that explain the changes in behavior of student generations over different years.
Collaborative filtering and recommendations: recommending suitable learning objects, based on the analysis of the performance of other learners, recommending remedial classes to students Perera et al., 2009.
Visual analytics: facilitating reasoning about the educational processes or learning results via interactive data/model visualization, e.g. visualizing collaborations of students.
Process mining: understanding the study curriculum, how students follow it, (not) obeying particular constraints, understanding bottlenecks in particular study programs

Data mining relies on several types of sources:

  • Log files (if you have access)
  • Analytics databases filled with data from client-side JavaScript code (user actions such as entering a page can be recorded and the user can be traced through cookies)
  • Web page contents
  • Data base contents
  • Productions (other than website), e.g. word processing documents
  • ...

Data logging and warehousing standards

According to Baker and Siemens (2013), there exist standards for logging data and they cite Koedinger et al. 2010. However, most projects and systems seem to rely on ad-hoc standards.

EDM for learning analytics

Types of analytics that can be obtained from EDM

  • quality of text
  • richness of content
  • content (with respect to some benchmark text)
  • similarity of content (among productions)
  • etc. (this list needs to be completed)





  • JEDM - Journal of Educational Data Mining


Organizations and communities


  • Ryan Baker (Ryan Shaun Joazeiro de Baker). Includes many interesting online (or draft) EDM publications.


  • Baepler, P. M., Cynthia James. (2010). Academic Analytics and Data Mining in Higher Education. International Journal for the Scholarship of Teaching and Learning, 4(2).
  • Hübscher, R. & Puntambekar, S. (2008). Integrating knowledge gained from data mining with pedagogical knowledge. In Proceedings of the 1st International Conference on Educational Data Mining (EDM2008), 97–106. (PDF).
  • Baker, R., Siemens, G. (in press) Educational data mining and learning analytics. To appear in Sawyer, K. (Ed.) Cambridge Handbook of the Learning Sciences: 2nd Edition preprint draft pdf
  • Baker, R.S.J.d. (2010) Data Mining for Education. In McGaw, B., Peterson, P., Baker, E. (Eds.) International Encyclopedia of Education (3rd edition), vol. 7, pp. 112-118. Oxford, UK: Elsevier. Draft PDF - draft pdf
  • Baker, R.S.J.d. (2010) Mining Data for Student Models. In Nkmabou, R., Mizoguchi, R., & Bourdeau, J. (Eds.) Advances in Intelligent Tutoring Systems, pp. 323-338. Secaucus, NJ: Springer. PDF Reprint
  • Baker, R.S.J.d. (2013) Learning, Schooling, and Data Analytics. Handbook on Innovations in Learning for States, Districts, and Schools, pp.179-190. Philadelphia, PA: Center on Innovations in Learning. pdf
  • Baker, R.S.J.d., Inventado, P.S. (in press) Educational Data Mining and Learning Analytics. To appear in J.A. Larusson, B. White (Eds.) Learning Analytics: From Research to Practice. Berlin, Germany: Springer. preprint draft pdf
  • Baker, R.S.J.d., Yacef, K. (2009) The State of Educational Data Mining in 2009: A Review and Future Visions. Journal of Educational Data Mining, 1 (1), 3-17. PDF
    • This is a frequently cited article
  • Baker, R.S.J.d., de Carvalho, A. M. J. A. (2008) Labeling Student Behavior Faster and More Precisely with Text Replays. Proceedings of the 1st International Conference on Educational Data Mining, 38-47.
  • Barnes, T. 2005. The q-matrix method: Mining student response data for knowledge. In Proceedings of the AAAI-2005 Workshop on Educational Data Mining.
  • Calders, Toon & Mykola Pechenizkiy (2011), Introduction to The Special Section on Educational Data Mining, SIGKDD Explorations 13 (2). PDF Reprint
  • Calders, Toon & Mykola Pechenizkiy (2011b), Data Mining for Improving Texbooks, SIGKDD Explorations 13 (2). PDF
  • Dekker. G, M. Pechenizkiy, and J. Vleeshouwers (2009). Predicting students drop out: A case study. In Proceedings of the 2nd International Conference on Educational Data Mining, EDM'09, pages 41-50.
  • Desmarais, M.C. And Pu, X. 2005. A Bayesian Student Model without Hidden Nodes and Its Comparison with Item Response Theory. International Journal of Artificial Intelligence in Education 15, 291-323.
  • Gobert, J.D., Sao Pedro, M., Raziuddin, J., Baker, R. (2013) From Log Files to Assessment Metrics: Measuring Students' Science Inquiry Skills Using Educational Data Mining. Journal of the Learning Sciences, 22 (4), 521-563 official pdf
  • Gobert, J.D., Sao Pedro, M.A., Baker, R.S.J.d., Toto, E., Montalvo, O. (2012) Leveraging Educational Data Mining for Real-time Perfomance Assesment of Scientific Inquiry Skills within Microworlds. Journal of Educational Data Mining, 4 (1), 111-143 pdf
  • Cha, H. J., Y. S. Kim, S. H. Park, T. B. Yoon, Y. M. Jung, and J.-H. Lee (2006). Learning styles diagnosis based on user interface behaviors for the customization of learning interfaces in an intelligent tutoring system. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems, ITS 2006, volume 4053 of Lecture Notes in Computer Science , pages 513-524. Springer.
  • Hübscher, R., Puntambekar, S., & Nye, A. H. (2007). Domain specific interactive data mining. In Proceedings of Workshop on Data Mining for User Modeling, 11th International Conference on User Modeling, Corfu, Greece, 81–90. (PDF)
  • Koedinger, K.R., Baker, R.S.J.d., Cunningham, K., Skogsholm, A., Leber, B., Stamper, J. (2010) A Data Repository for the EDM community: The PSLC DataShop. In Romero, C., Ventura, S., Pechenizkiy, M., Baker, R.S.J.d. (Eds.) Handbook of Educational Data Mining. Boca Raton, FL: CRC Press, pp. 43-56.
  • Macfadyen, L. P. and Sorenson. P. (2010) “Using LiMS (the Learner Interaction Monitoring System) to track online learner engagement and evaluate course design.” In Proceedings of the 3rd international conference on educational data mining (pp. 301–302), Pittsburgh, USA.
  • Macfayden, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning” system for educators: a proof of concept. Computers & Education, 54(2), 588–599.
    • This is an often cited text defining how EDM could be directly useful to educators.
  • Merceron, A. and K. Yacef (2005b). Educational data mining: a case study. In C. K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, editors, Proceedings of the 12th Conference on Artificial Intelligence in Education, pages 467-474, Amsterdam, The Netherlands, 2005. IOS Press.
  • Merceron, A. and K. Yacef. Interestingness measures for association rules in educational data. In Proceedings of Educational Data Mining Conference, pages 57-66, 2008.
  • Lent, B., R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, August 1997. PDF
  • Pavlik, P., Cen, H., Wu, L. And Koedinger, K. 2008. Using Item-type Performance Covariance to Improve the Skill Model of an Existing Tutor. In Proceedings of the 1st International Conference on Educational Data Mining, 77-86.
  • Perera, D.; J. Kay, K. Yacef, and I. Koprinska. Mining learners' traces from an online collaboration tool. In Proceedings of Educational Data Mining workshop, pages 60-69, 2007. HTML
  • Perera. D, J. Kay, I. Koprinska, K. Yacef, and O. R. Zaïane (2009). Clustering and sequential pattern mining of online collaborative learning data. IEEE Transactions on Knowledge and Data Engineering, 21(6):759-772.
  • Romero, C. , Ventura, S.N., & Garcia, E. (2008). Data mining in course management systems: Moodle case study and tutorial, Computers & Education, 51(1), 368-384.
  • Romero, C. Ventura, S. (2010) Educational Data Mining: A Review of the State-of-the-Art. IEEE Transaction on Systems, Man, and Cybernetics, 40 (6), Part C: Applications and Reviews. PDF (Access restricted)
  • Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135-146. PDF reprint
  • Southavilay. V, K. Yacef, and R. A. Calvo. Analysis of collaborative writing processes using hidden markov models and semantic heuristics. In Submitted to Workshop on Semantic Aspects in Data Mining (SADM) at ICDM2010, 2010.
  • Southavilay. V, K. Yacef, and R. A. Calvo (2010). Process mining to support students’ collaborative writing (best student paper award). In Educational Data Mining conference proceedings, pages 257-266, 2010
  • Southavilay Vilaythong, Lina Markauskaite, Michael J. Jacobson (2013). From “Events” to “Activities”: Creating Abstraction Techniques for Mining Students’ Model-Based Inquiry Processes, EDM 2013 Conference. PDF
  • W. Hämäläinen and M. Vinni (2006). Comparison of machine learning methods for intelligent tutoring systems. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems, ITS 2006, volume 4053 of Lecture Notes in Computer Science, pages 525-534. Springer.
  • Williford, Christaand Charles Henry. Research Design by Amy Friedlander, (2012). One Culture. Computationally Intensive Research in the Humanities and Social Sciences. A Report on the Experiences of First Respondents to the Digging into Data Challenge. CLIR pub151, ISBN 978-1-932326-40-6,
  • Winne, P.H., Baker, R.S.J.d. (2013) The Potentials of Educational Data Mining for Researching Metacognition, Motivation, and Self-Regulated Learning. Journal of Educational Data Mining, 5 (1), 1-8. [pdf]
  • Witten, I.H. And Frank, E. 1999. Data mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Fransisco, CA.
  • Y. Ma, B. Liu, C. K. Wong, P. S. Yu, and S. M. Lee (2000). Targeting the right students using data mining. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'00, pages 457-464, New York, USA, 2000. ACM.
  • Zaïane, O. R. (2001). Web usage mining for a better web-based learning environment. In Proceedings of the Conference on Advanced Technology for Education, Banff, Alberta, pages 60-64.