Natural language processing

The educational technology and digital learning wiki
Jump to navigation Jump to search


Natural Language Processing (NLP), also known as Human Language Technologies; Computational Linguistics; or Speech Recognition and Synthesis, is a field of computer science which studies the human language as an interface between computer and human. The goal is to allow computers to fully process large amounts of natural language data, making them able to perform tasks such as automatic translation between different languages, answering questions posed by a human using his/her own language format, or understanding and synthesizing speech.

NLP for educational applications has gained visibility outside of the NLP community. Applications such as automated writing evaluation (AWE), speech scoring, and plagiarism detection have already been used for high-stakes assessment and in instructional contexts (e.g., Massive Open Online Courses). Simulation and gaming applications used for instructional purposes, especially ones focused on language learning, also illustrate how NLP can be applied in educational contexts.

The NLP Pipeline

The first step towards giving a computer the ability to understand language is making it able to recognize words. This step is known as tokenization or word segmentation. In many languages, including English, words are often separated by white spaces, which makes tokenization seem very simple. However, this is not always the case. For example, in many contexts, New York should be considered a single word, while I’m needs to be separated into the two words I and am. Besides that, some languages, such as Chinese, Japanese, and Thai, do not use white spaces between words.

In addition to word segmentation, sequence segmentation (separating phrases) is a very important step to understanding text. Once again, this may seem a simple task, since the separation between sentences is often based on punctuation, but sometimes punctuation characters are ambiguous. The period character ‘.’ for example, can be used to not only separate sentences but also in abbreviations like ‘Mr.’ and ‘Inc.’, in acronyms like ‘m.p.h.’, in website addresses such ‘’ or numbers like ’12.5’. For these reasons, many of the advanced techniques for word and sentence segmentation available nowadays are based on machine learning approaches.

Once the segmentation process is done, we want to ascribe “meaning” to words and sentences. This is a very challenging step, mainly due to the ambiguous nature of language. Jurafsky[1], provides the following example to illustrate this challenge: the sentence I made her duck has at least five different meanings:

  • cooked waterfowl for her.
  • I cooked waterfowl belonging to her.
  • I created the (plaster?) duck she owns.
  • I caused her to quickly lower her head or body.
  • I waved my magic wand and turned her into undifferentiated waterfowl.

The ambiguity of the sentence is due to several different reasons. For example, the word make has two different meanings: create or cook. If we consider the sentence in its spoken format, there is also a phonetics ambiguity since the first word could have been eye or the second word maid. Therefore, processing natural language requires that we resolve or disambiguate these ambiguities. Techniques like part-of-speech tagging, word-sense disambiguation, syntactic disambiguation, and lexical disambiguation can be used for this purpose. This means that to fully understand natural language, we need to know about the following to get started:

  • Phonetics and Phonology: the sounds of human speech for a given word.
  • Morphology: how words are formed, and their relationship to other words in the same language.
  • Syntax: rules that define the structural relationships between words and, therefore, govern the structure of a sentence.
  • Semantics: knowledge of the meaning of each word in a language.
  • Pragmatics: understanding relationships between sentence meaning and the speaker’s intentions
  • Discourse: knowledge about linguistic units larger than a single utterance

NLP for Educational Applications

The 12th Workshop on Innovative Use of NLP for Building Educational Applications highlights the following applications for NLP in Educational Applications:

  • Automated scoring/evaluation for written student responses
    • Content analysis for scoring/assessment
    • Analysis of the structure of argumentation
    • Grammatical error detection and correction
    • Discourse and stylistic analysis
    • Plagiarism detection
    • Machine translation for assessment, instruction and curriculum development
    • Detection of non-literal language (e.g., metaphor)
    • Sentiment analysis Non-traditional genres (beyond essay scoring)
  • Intelligent Tutoring (IT) and Game-based assessment that incorporates NLP
    • Game-based learning
    • Dialogue systems in education
    • Hypothesis formation and testing
    • Multi-modal communication between students and computers
    • Generation of tutorial responses
    • Knowledge representation in learning systems
    • Concept visualization in learning systems
  • Learner cognition
    • Assessment of learners' language and cognitive skill levels
    • Systems that detect and adapt to learners' cognitive or emotional states
    • Tools for learners with special needs
  • Use of corpora in educational tools
    • Data mining of learner and other corpora for tool building
    • Annotation standards and schemas / annotator agreement
  • Tools and applications for classroom teachers and/or test developers
    • NLP tools for second and foreign language learners
    • Semantic-based access to instructional materials to identify appropriate texts
    • Tools that automatically generate test questions
    • Processing of and access to lecture materials across topics and genres
    • Adaptation of instructional text to individual learners' grade levels
    • Tools for text-based curriculum development
    • E-learning tools for personalized course content
    • Language-based educational games

To summarize, NLP plays an important role in the development of Educational Applications. It can be used to support a wide range of learning domains, including writing, speaking, reading, science, and mathematics. Although it plays an especially important role in language learning. For this reason, it has gained visibility outside of the NLP community. Some applications using NLP have already been deployed commercially. Automated writing evaluation (AWE) and speech scoring applications, for example, are already used in high-stakes assessment and instructional contexts. It is also incorporated into Massive Open Online Courses (MOOCs) systems to manage the thousands of assignments. Plagiarism detection is also prevalent among commercially available NLP applications in education.


See Also


  1. Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (Vol. 3). London:: Pearson.