Latent semantic analysis and indexing

The educational technology and digital learning wiki
Jump to navigation Jump to search

Draft

Introduction

“Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.” (Deerwester et al, 1988 cited by Wikipedia

“Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the totality of information about all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and set of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests, it mimics human word sorting and category judgments, simulates word-word and passage-word lexical priming data and, as reported in Group Papers, accurately estimates passage coherence, learnability of passages by individual students and the quality and quantity of knowledge contained in an essay.” (What is LSA?, retrieved 12:10, 12 March 2012 (CET).

Software

Free latent semantic analysis software is difficult to find.

  • LSA package for R (developed by Fridolin Wild)
  • Gensim - Topic Modelling for Humans, implemented in Python. “Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.” (introduction), retrieved 12:10, 12 March 2012 (CET)
  • SenseClusters by Ted Pedersen et al. This “is a package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination. The supported methods include the native SenseClusters techniques and Latent Semantic Analysis.” (SenseClusters, retrieved 12:10, 12 March 2012 (CET).

Links

Introductions

  • Patterns in Unstructured Data, A Presentation to the Andrew W. Mellon Foundation by Clara Yu, John Cuadrado, Maciej Ceglowski, J. Scott Payne (undated). A good introduction to LSI and its use in search engines.

Technical introductions


Bibliography

  • Landauer, T. K., & Dumais, S. T. (1996). How come you know so much? From practical problem to theory. In D. Hermann, C. McEvoy, M. Johnson, & P. Hertel (Eds.), Basic and applied memory: Memory in context. Mahwah, NJ: Erlbaum, 105-126.
  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.
  • Dumais, Susan T. (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188.