Analyse de sentiments avec R
Analytique et exploration de données | |
---|---|
Module: Text mining avec R | |
◀▬ | |
⚐ brouillon | ☸ débutant |
⚒ 2020/03/22 | |
Prérequis | |
Voir aussi | |
Sous-pages et productions: | |
Catégorie: Tutoriels R |
Introduction
Lire http://en.wikipedia.org/wiki/Sentiment_analysis
Le paquet "sentiment" de R est assez rudimentaire. Il fonctionne avec un dictionnaire de mots qui ont une charge négative/positive qui varie entre -5 et +5.
Voici quelques lignes du tableau afin96 qui contient 1480 items (afin111 contient 2476)
abandon abandons abandoned absentee
-2 -2 -2 -1
absentees aboard abducted abduction
-1 1 -2 -2
abductions abuse abused abuses
-2 -3 -3 -3
accept accepting accepts accepted
1 1 1 1
....
axed backed backing backs
-1 1 2 1
bad badly bailout bamboozle
-3 -3 -2 -2
.....
frikin frustration ftw fuck
-2 -2 3 -4
fucked fuckers fucking fud
-4 -4 -4 -3
fulfill fulfilled fulfills fun
2 2 2 4
....
worst worth wow wowow
-3 2 4 4
wowww wrong zealot zealots
4 -2 -2 -2
Analyse de sentiments avec sentiment
Installation
# Package installation and test
install.packages("devtools")
library("devtools")
install_github("sentiment", "andrie")
library("sentiment")
On peut tester un peu et regarder les dictionnaires qui viennent avec
library("sentiment")
sentiment(c("There is a terrible mistake in this work", "This is wonderful!"))
# Two dictonaries for English (built-in)
afinn96
afinn111
# Simple test
sentiment(c("There is a terrible mistake in this work", "This is wonderful!", "this is bloody brilliant"))
Analyse de papiers de position sur EduTech Wiki Anglais
Pour tester ce paquet, on prend la categorie "Position paper" de EduTechWiki. «These position papers were written by students enrolled in course Education 6620, Issues and Trends in Educational Computing at Memorial University of Newfoundland, Newfoundland and Labrador, Canada.». Evidémment, on ne devrait pas trouver des sentiments très contrastés dans un corpus très académique. On verra...
Le script R
Une autre version se trouve ici: Analyse_de_sentiments_avec_R/test_script
# ---- set working directory
library("sentiment")
library(tm)
library(XML)
library(tm.plugin.webmining)
# Get list of position papers from EduTechpospap
cat_pospap <- "http://edutechwiki.unige.ch/mediawiki/api.php?action=query&list=categorymembers&cmtitle=Category:Position_paper&cmlimit=500&cmtype=page&format=xml"
XML_list <- xmlTreeParse(cat_pospap,useInternalNodes = TRUE)
XML_list
XML2_list <- xpathSApply(XML_list, "//cm")
title_list = sapply(XML2_list, function(el) xmlGetAttr(el, "title"))
id_list = sapply(XML2_list, function(el) xmlGetAttr(el, "pageid"))
title_list[[1]]
id_list[[1]]
# --- Identify the URLs for each page (article)
# début et fin de l'URL. Notez le "pageid" qui va nous sortir un article avec sa "pageid"
url_en_start <- "http://edutechwiki.unige.ch/mediawiki/api.php?action=parse&pageid="
url_en_end <- "&format=xml"
article_ids_list <- character(length(id_list))
for (i in 1:length(id_list)) {
article_ids_list[i] <- (paste (url_en_start, id_list[i], url_en_end, sep=""))
}
# This is the list of articles
article_ids_list
# Define a reader function that will only read the "text" element
readMWXML <-
readXML (spec = list (content = list ("node", "//text"),
heading = list ("attribute", "//parse/@title")
),
doc=PlainTextDocument())
# ----- download the page contents
pospap.source <- VCorpus(URISource(article_ids_list, encoding="UTF-8"),
readerControl=list(reader=readMWXML, language="en"))
names(pospap.source)
# On change les "id" (titres à la place d'URLs illisibles)
for (j in seq.int (pospap.source)) {
meta(pospap.source[[j]],"id") <- title_list[j]
}
names(pospap.source)
# Ajouter une balise html autour du tout - c'est du bon vodoo
pospap.source <- tm_map (pospap.source, encloseHTML)
# Ecrire les fragments HTML dans des fichiers (inutile, mais permet l'inspection)
writeCorpus(pospap.source, path="./wiki_pospap_source")
# ------------------------------- Clean text into bags of words
pospap.cl1 <- tm_map(pospap.source, content_transformer(tolower))
pospap.cl2 <- tm_map(pospap.cl1, content_transformer(extractHTMLStrip))
pospap.cl2 <- tm_map (pospap.cl2, removePunctuation, preserve_intra_word_dashes = TRUE)
# curly quotes = \u2019
(kill_chars <- content_transformer (function(x, pattern) gsub(pattern, " ", x)))
pospap.cl2 <- tm_map (pospap.cl2, kill_chars, "\u2019")
pospap.cl2 <- tm_map (pospap.cl2, kill_chars,"'")
pospap.cl2 <- tm_map (pospap.cl2, kill_chars,"\\[modifier\\]")
pospap.cl2 <- tm_map (pospap.cl2, kill_chars,"[«»”“\"]")
pospap.essence <- tm_map (pospap.cl2, removeWords, stopwords("english"))
# pospap.roots <- tm_map (pospap.essence, stemDocument, language="english")
pospap.roots <- tm_map (pospap.roots, stripWhitespace)
# test
pospap.roots[[1]]
class(pospap.racines)
for (docN in seq.int (pospap.roots)) {
print (paste ( pospap.roots[[docN]]$meta$heading,
" = ",
sentiment (pospap.roots[[docN]]$content )))
}
Résultats
Résultats bruts: On voit qu'il y a une petite variance entre les divers articles. Certains on un ton plus positif que d'autres ....
- "Achievement = 0.311111111111111"
- "Active Learning = 1.05882352941176"
- "Assessments = 0.75"
- "At-risk Learners = -0.573770491803279"
- "Career and Guidance = 0.96875"
- "Classroom management = 0.307692307692308"
- "Collaboration = 1.17721518987342"
- "Comprehension = 0.833333333333333"
- "Constructivist learning = 1.24590163934426"
- "Critical Thinking = 0.623188405797101"
- "Differentiated learning = 0.761904761904762"
- "Early Childhood Education = 0.933333333333333"
- "English Language Arts = 0.3125"
- "ESL = 0.12280701754386"
- "Hearing Imparied Learner = 0.523809523809524"
- "Inclusive learning = 0.644067796610169"
- "Interaction = 1.03703703703704"
- "Learning in Rural Contexts = -0.5"
- "Lifelong Learning = 0.921875"
- "Literacy = 0.652777777777778"
- "Mathematics = 0.643835616438356"
- "Medicine = 0.0181818181818182"
- "Music = 0.647058823529412"
- "Non-Formal Learning = 0.786885245901639"
- "Open Educational Resources = 0.51063829787234"
- "Parents = 0.482758620689655"
- "Personalized learning = 0.46"
- "Portfolio = 1.07272727272727"
- "Professional Learning Communities = 0.795918367346939"
- "Reading = 1.01030927835052"
- "Reflecting = 0.753846153846154"
- "Science = 0.615384615384615"
- "Second Language Assessment = 1"
- "Vocational Learning = 0.528301886792453"
- "Workplace Learning = 0.403508771929825"
- "Writing = 1.01612903225806"