Latent semantic analysis and indexing

The educational technology and digital learning wiki
Jump to navigation Jump to search

Draft

Introduction

Latent Semantic Indexing (LSI) and Latent Semantic Analysis (LSA refer to a family of text indexing and retrieval methods.

We believe that both LSI and LSA refer to the same topic, but LSI is rather used in the context of web search, whereas LSA is the term used in the context of various forms of academic content analysis. - Daniel K. Schneider 12:32, 12 March 2012 (CET)

“Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.” (Deerwester et al, 1988 cited by Wikipedia

“Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the totality of information about all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and set of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests, it mimics human word sorting and category judgments, simulates word-word and passage-word lexical priming data and, as reported in Group Papers, accurately estimates passage coherence, learnability of passages by individual students and the quality and quantity of knowledge contained in an essay.” (What is LSA?, retrieved 12:10, 12 March 2012 (CET).

Software

Free latent semantic analysis and easy to use software is difficult to find. However, there are a number of good packages for the tech savy, e.g.:

  • Gensim - Topic Modelling for Humans, implemented in Python. “Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.” (introduction), retrieved 12:10, 12 March 2012 (CET)
  • SenseClusters by Ted Pedersen et al. This “is a package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination. The supported methods include the native SenseClusters techniques and Latent Semantic Analysis.” (SenseClusters, retrieved 12:10, 12 March 2012 (CET).
  • Package a Java package by Jurgens David and Keith Stevens. “The S-Space Package is a collection of algorithms for building Semantic Spaces as well as a highly-scalable library for designing new distributional semantics algorithms. Distributional algorithms process text corpora and represent the semantic for words as high dimensional feature vectors. These approaches are known by many names, such as word spaces, semantic spaces, or distributed semantics and rest upon the Distributional Hypothesis: words that appear in similar contexts have similar meanings.” (Project overview, retrieved 12:32, 12 March 2012 (CET))

Links

Introductions

  • Patterns in Unstructured Data, A Presentation to the Andrew W. Mellon Foundation by Clara Yu, John Cuadrado, Maciej Ceglowski, J. Scott Payne (undated). A good introduction to LSI and its use in search engines.

Technical introductions


Bibliography

  • Landauer, T. K., & Dumais, S. T. (1996). How come you know so much? From practical problem to theory. In D. Hermann, C. McEvoy, M. Johnson, & P. Hertel (Eds.), Basic and applied memory: Memory in context. Mahwah, NJ: Erlbaum, 105-126.
  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.
  • Dumais, Susan T. (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188.
  • Jurgens, David and Keith Stevens, (2010). The S-Space Package: An Open Source Package for Word Space Models. In System Papers of the Association of Computational Linguistics. PDF
  • Wild, Fridolin and Christina Stahl, (2007). Investigating Unstructured Texts with Latent Semantic Analysis, in Lenz, Hans -J. (ed). Advances in Data Analysis,

Studies in Classification, Data Analysis, and Knowledge Organization, Advances in Data Analysis, Part V, 383-390, DOI: 10.1007/978-3-540-70981-7_43