Latent semantic analysis and indexing: Difference between revisions

The educational technology and digital learning wiki
Jump to navigation Jump to search
Line 5: Line 5:
'''Latent Semantic Indexing (LSI)''' and '''Latent Semantic Analysis (LSA''' refer to a family of text indexing and retrieval methods.  
'''Latent Semantic Indexing (LSI)''' and '''Latent Semantic Analysis (LSA''' refer to a family of text indexing and retrieval methods.  


We believe that both LSI and LSA refer to the same topic, but LSI is rather used in the context of web search, whereas LSA is the term used in the context of various forms of academic [[content analysis]]. - [[User:Daniel K. Schneider|Daniel K. Schneider]] 12:32, 12 March 2012 (CET)
We believe that both LSI and LSA refer to the same topic, but LSI is rather used in the context of web search, whereas LSA is the term used in the context of various forms of academic [[content analysis]]. - [[User:Daniel K. Schneider|Daniel K. Schneider]] 12:49, 12 March 2012 (CET)


{{quotation|Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.}} (Deerwester et al, 1988 cited by [http://en.wikipedia.org/wiki/Latent_semantic_indexing Wikipedia]
{{quotation|Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.}} (Deerwester et al, 1988 cited by [http://en.wikipedia.org/wiki/Latent_semantic_indexing Wikipedia]
Line 15: Line 15:
Free '''latent semantic analysis''' and easy to use software is difficult to find. However, there are a number of good packages for the tech savy, e.g.:
Free '''latent semantic analysis''' and easy to use software is difficult to find. However, there are a number of good packages for the tech savy, e.g.:


* [http://cran.at.r-project.org/web/packages/lsa/index.html LSA package for R] developed by Fridolin Wild. {{quotation|The basic idea of latent semantic analysis (LSA) is, that text do have a higher order (=latent semantic) structure which, however, is obscured by word usage (e.g. through the use of synonyms or polysemy). By using conceptual indices that are derived statistically via a truncated singular value decomposition (a two-mode factor analysis) over a given document-term matrix, this variability problem can be overcome.}} ([http://cran.at.r-project.org/web/packages/lsa/index.html lsa: Latent Semantic Analysis], retrieved 12:32, 12 March 2012 (CET))
* [http://cran.at.r-project.org/web/packages/lsa/index.html LSA package for R] developed by Fridolin Wild. {{quotation|The basic idea of latent semantic analysis (LSA) is, that text do have a higher order (=latent semantic) structure which, however, is obscured by word usage (e.g. through the use of synonyms or polysemy). By using conceptual indices that are derived statistically via a truncated singular value decomposition (a two-mode factor analysis) over a given document-term matrix, this variability problem can be overcome.}} ([http://cran.at.r-project.org/web/packages/lsa/index.html lsa: Latent Semantic Analysis], retrieved 12:49, 12 March 2012 (CET))


* [http://radimrehurek.com/gensim/ Gensim - Topic Modelling for Humans], implemented in Python. {{quotation|Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.}} ([http://radimrehurek.com/gensim/intro.html introduction]), retrieved 12:10, 12 March 2012 (CET)
* [http://radimrehurek.com/gensim/ Gensim - Topic Modelling for Humans], implemented in Python. {{quotation|Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.}} ([http://radimrehurek.com/gensim/intro.html introduction]), retrieved 12:10, 12 March 2012 (CET)
* [http://www.d.umn.edu/~tpederse/senseclusters.html SenseClusters] by Ted Pedersen et al. This {{quotation|is a package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination. The supported methods include the native SenseClusters techniques and Latent Semantic Analysis. }} ([http://www.d.umn.edu/~tpederse/senseclusters.html SenseClusters], retrieved 12:10, 12 March 2012 (CET).
* [http://www.d.umn.edu/~tpederse/senseclusters.html SenseClusters] by Ted Pedersen et al. This {{quotation|is a package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination. The supported methods include the native SenseClusters techniques and Latent Semantic Analysis. }} ([http://www.d.umn.edu/~tpederse/senseclusters.html SenseClusters], retrieved 12:10, 12 March 2012 (CET).
* [http://code.google.com/p/airhead-research/S-Space Package] a Java package by  Jurgens David and Keith Stevens. {{quotation|The S-Space Package is a collection of algorithms for building Semantic Spaces as well as a highly-scalable library for designing new distributional semantics algorithms. Distributional algorithms process text corpora and represent the semantic for words as high dimensional feature vectors. These approaches are known by many names, such as word spaces, semantic spaces, or distributed semantics and rest upon the Distributional Hypothesis: words that appear in similar contexts have similar meanings.}} ([http://code.google.com/p/airhead-research/ Project overview], retrieved 12:32, 12 March 2012 (CET))
* [http://code.google.com/p/airhead-research/S-Space Package] a Java package by  Jurgens David and Keith Stevens. {{quotation|The S-Space Package is a collection of algorithms for building Semantic Spaces as well as a highly-scalable library for designing new distributional semantics algorithms. Distributional algorithms process text corpora and represent the semantic for words as high dimensional feature vectors. These approaches are known by many names, such as word spaces, semantic spaces, or distributed semantics and rest upon the Distributional Hypothesis: words that appear in similar contexts have similar meanings.}} ([http://code.google.com/p/airhead-research/ Project overview], retrieved 12:49, 12 March 2012 (CET))


== Links ==
== Links ==
Line 36: Line 36:


* [http://en.wikipedia.org/wiki/Latent_semantic_analysis Latent semantic analysis] (Wikipedia)
* [http://en.wikipedia.org/wiki/Latent_semantic_analysis Latent semantic analysis] (Wikipedia)


== Bibliography ==
== Bibliography ==

Revision as of 13:49, 12 March 2012

Draft

Introduction

Latent Semantic Indexing (LSI) and Latent Semantic Analysis (LSA refer to a family of text indexing and retrieval methods.

We believe that both LSI and LSA refer to the same topic, but LSI is rather used in the context of web search, whereas LSA is the term used in the context of various forms of academic content analysis. - Daniel K. Schneider 12:49, 12 March 2012 (CET)

“Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.” (Deerwester et al, 1988 cited by Wikipedia

“Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the totality of information about all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and set of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests, it mimics human word sorting and category judgments, simulates word-word and passage-word lexical priming data and, as reported in Group Papers, accurately estimates passage coherence, learnability of passages by individual students and the quality and quantity of knowledge contained in an essay.” (What is LSA?, retrieved 12:10, 12 March 2012 (CET).

Software

Free latent semantic analysis and easy to use software is difficult to find. However, there are a number of good packages for the tech savy, e.g.:

  • Gensim - Topic Modelling for Humans, implemented in Python. “Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.” (introduction), retrieved 12:10, 12 March 2012 (CET)
  • SenseClusters by Ted Pedersen et al. This “is a package of (mostly) Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination. The supported methods include the native SenseClusters techniques and Latent Semantic Analysis.” (SenseClusters, retrieved 12:10, 12 March 2012 (CET).
  • Package a Java package by Jurgens David and Keith Stevens. “The S-Space Package is a collection of algorithms for building Semantic Spaces as well as a highly-scalable library for designing new distributional semantics algorithms. Distributional algorithms process text corpora and represent the semantic for words as high dimensional feature vectors. These approaches are known by many names, such as word spaces, semantic spaces, or distributed semantics and rest upon the Distributional Hypothesis: words that appear in similar contexts have similar meanings.” (Project overview, retrieved 12:49, 12 March 2012 (CET))

Links

Introductions

  • Patterns in Unstructured Data, A Presentation to the Andrew W. Mellon Foundation by Clara Yu, John Cuadrado, Maciej Ceglowski, J. Scott Payne (undated). A good introduction to LSI and its use in search engines.

Technical introductions

Bibliography

  • Landauer, T. K., & Dumais, S. T. (1996). How come you know so much? From practical problem to theory. In D. Hermann, C. McEvoy, M. Johnson, & P. Hertel (Eds.), Basic and applied memory: Memory in context. Mahwah, NJ: Erlbaum, 105-126.
  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.
  • Dumais, Susan T. (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188.
  • Jurgens, David and Keith Stevens, (2010). The S-Space Package: An Open Source Package for Word Space Models. In System Papers of the Association of Computational Linguistics. PDF
  • Wild, Fridolin and Christina Stahl, (2007). Investigating Unstructured Texts with Latent Semantic Analysis, in Lenz, Hans -J. (ed). Advances in Data Analysis,

Studies in Classification, Data Analysis, and Knowledge Organization, Advances in Data Analysis, Part V, 383-390, DOI: 10.1007/978-3-540-70981-7_43