StatMediaWiki: Difference between revisions

The educational technology and digital learning wiki
Jump to navigation Jump to search
Line 283: Line 283:
Fix:  
Fix:  
* kill orphaned pages with a maintenance script
* kill orphaned pages with a maintenance script
  /maintenance# php deleteOrphanedRevisions.php  
  cd your_path/maintenance
php deleteOrphanedRevisions.php  


I had to learn some python in order to figure this one out. Here is a modified getTotalRevisionsByNamespace function in file smwget.py
I had to learn some python in order to figure this one out. Here is a modified getTotalRevisionsByNamespace function in file smwget.py

Revision as of 15:21, 13 June 2018

Introduction

StatMediaWiki was a project that creates tools to collect and aggregate information available in a MediaWiki installation. StatMediaWiki is free software under the GPL v3 or higher license. There are currently two versions of this software: Classic (stable software) and Interactive (currently Beta).

As of May 2018 or Mediawiki 1.27+, this tool is dead, i.e. the code breaks with an error message.

_mysql_exceptions.OperationalError: (1054, "Unknown column 'page_counter' in 'field list'")

As far as I know, MediaWiki did indeed remove the page counters and this kills. It's probably not too hard to fix this properly. I just removed things (see later)


See also:

Classic StatMediaWiki

Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development. The tool seems to be well suited for summarizing student contributions, in particular when used over a limited time range (e.g. 6 month).

Interactive StatMediaWiki

This version is currently under development. It is an interactive application with several menus, which generate analysis, graphs and tables according to user instructions.

Installation

(under Ubuntu/Debian)

Get the software

This will retrieve the whole archive

svn checkout https://forja.rediris.es/svn/statmediawiki

Other software needed

(for now, we assume that you already have Python installed)

You may have to install some or all of the following:

apt-get install python-gnuplot
apt-get install python-MySQLdb
apt-get install python-NumPy
apt-get install python-SciPy
apt-get install python-Matplotlib

In addition (optional) you may need Graphviz

Create a database user with read-only access to the wiki database

Add a user to the MySQL server
  • E.g. user="analysis" password="xxx" with a SELECT priviledge for database "MyWiki"
Add a .my.cnf configuration file to your home directory and specify the follow four lines.
[client]
user = analysis
password = xxx
host = localhost
Running on another machine ?

If you don't want to run analysis scripts on the MediaWiki server, you should add privileges for remote MySQL Access (not tested). Our small Sun Fire X4150 2CPU MediaWiki server managed running an analysis over several days using a typical load average of 1.3. I.e. Python just takes over one of the CPUs and typical CPU usage is around 13%.

Usage of classic

Basically, you can launch a global analysis with the smw.py command line script. This will generate a website that includes the following statistics:

  • Global usage
  • Data per user (content evolution, activity, top pages, uploads, words cloud)
  • Data per page (content evolution, activity, work distribution, top users, words cloud)
  • Data per category
  • A tags cloud

All pages will be analysed (i.e. wiki pages, talk pages, user pages, user talk pages and so forth). I don't know if this is configurable.

Plot data are rendered as PNG, but also can be exported as CSV.

Performance

Depending on the size of your wiki and time period you will have to wait a few minutes, hours, days or weeks. E.g. Analysis of the following type of wiki took about 180 minutes on the server machine mentioned above.

 
Report period:	2006-03-10T00:00:00 – 2012-01-26T17:29:01
Total users:	69
Total pages:	516
Total edits:	7304
Total bytes:	3354641
Total files:	17
Total visits:	37944
Generated in:	2012-01-26T17:29:09.000042

The following configuration takes much more time to complete, i.e. time taken seems to increase exponentially in function of users * pages * edits. The following took more than a week (7 or 8 days) to complete:

Report period:	2006-08-21T00:00:00 –p 2012-01-27T18:37:25
Total users:	529
Total pages:	834 
Total edits:	51957 
Total bytes:	8394153 
Total files:	134
Total visits:	256024 
Generated in:	2012-01-27T18:39:22.652603

Here are the top lines of top after 6 days:

top - 09:29:46 up 8 days,  7:55,  2 users,  load average: 1.08, 1.15, 1.15
Tasks: 272 total,   2 running, 267 sleeping,   0 stopped,   3 zombie
Cpu(s): 12.6%us,  0.1%sy,  0.0%ni, 87.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8193748k total,  7726028k used,   467720k free,   490068k buffers
Swap: 23213884k total,    36644k used, 23177240k free,  3271960k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                         
28173 schneider  20   0 3009m 2.8g 3852 R  100 35.4   8079:32 python                    

Plots shown are not adapted for such long periods. However, it can be interested to analyse a teaching wiki over yearly periods. The same analysis restricted to august 29 2011 and june 1 2012 took much less time, i.e. only a few minutes and the plots were absolutly appropriate.

Site:	BioRousso (recent changes)
Report period:	2011-08-29T00:00:00 – 2012-06-01T00:00:00
Total users:	52
Total pages:	50 [Show/Hide]
Total edits:	2791 [Show/Hide]
Total bytes:	433501 [Show/Hide]
Total files:	2
Total visits:	55515 [Show/Hide]
Generated in:	2012-03-01T12:09:18.149519

wmw.py command line parameters

--outputdir: absolute path to the directory where the HTML report site will be generated.
--index: name of the main (initial) file of the report (by default, index.php)
--sitename: name of the wiki that will be shown on the title of the report
--siteurl: URL of the wiki
--subdir: path that has to be added to the URL to get to the wiki (by default /index.php)
--dbname: name of the database of the wiki
--tableprefix: prefix of the tables in the database (only required if you indicated one when installing MediaWiki)
--anonymous: it replaces usernames by hashes (salty md5). Use this if you plan to publish results.
--startdate: start analysis. Example: --startdate=2010-01-01
--enddate: end of analysis

smw.py command line examples

Simple

python statmediawiki/trunk/smw.py --outputdir="/web/analysis/dewiki" --sitename=DeWiki --siteurl=http://edutechwiki.unige.ch --subdir="/dewiki/" --dbname=dewiki

You should then see something like:

/export/home/schneide/statmediawiki/trunk/smwget.py:19: DeprecationWarning: the md5 module is deprecated; use hashlib instead
  import md5
---------------------------------------------------------------------------
Welcome to StatMediaWiki 1.1. Web: http://statmediawiki.forja.rediris.es
---------------------------------------------------------------------------
Loaded 14 categories
.....

And remember, the process can take quite a long time even for a small wiki.

Date limited examples

python statmediawiki/trunk/smw.py --outputdir="/web/analysis/biorousso-2011-12" --sitename=BioRousso --siteurl=http://edutechwiki.unige.ch --subdir="/biorousso/" --dbname=xxxx --startdate=2011-08-29 --enddate=2012-06-01
python statmediawiki/trunk/smw.py --outputdir="/web/analysis/edutechwiki_fr" --sitename=EduTechWiki_fr --siteurl=http://edutechwiki.unige.ch --subdir="/fr/" --dbname=XXXXX --startdate=2012-02-01 --enddate=2012-07-01


With a database prefix example:

python statmediawiki/trunk/smw.py --outputdir="/web/analysis/edutechwiki_fr" --sitename=EduTechWiki_fr --siteurl=http://edutechwiki.unige.ch --subdir="/fr/" --dbname=XXXXX --tableprefix=mw_

Date limited with database prefix example:

python statmediawiki/trunk/smw.py --outputdir="/web/analysis/edutechwiki_fr" --sitename=EduTechWiki_fr --siteurl=http://edutechwiki.unige.ch --subdir="/fr/" --dbname=XXXXX --tableprefix=mw_ --startdate=2011-09-01 --enddate=2012-04-01

Screenshots

Statmediawiki analytics (EdutechWiki (fr) - Top users (anonymyzed)
Statmediawiki analytics (EdutechWiki (fr) - Most edited pages
Statmediawiki analytics (EdutechWiki (fr) - User contribution to a page (users not shown)
Statmediawiki analytics (EdutechWiki (fr) - Tracking if teacher works at night
Statmediawiki analytics (EdutechWiki (fr), My top contributions

Usage of interactive

The interactive version includes the following features:

  • It will use the MediaWiki api to download all the contents into a (huge) xml file
  • In other words, you can then run this dump on your own computer. We ran it on the server (works easily if you have a Linux desktop)
  • You then can do various sorts of analysis, but we found the information contained in the static version more interesting.

Below are the steps that we figures out.

Run the GUI

In order to run the interactive version, go to the statmedia wiki directory, then type:

  • python branches/interactive/smwgui.py &

This will bring up the GUI:

StatMediaWiki Interactive v. 0.1.7

Create a local dump file

  • Menu Downloader->My Wiki
  • Enter your api URL: E.g. something like http://mywiki.ch/w/api.php
  • To find this URL on your wiki: Click on edit a page, then substitute api.php. You should see the auto-generated MediaWiki API documentation page.

That operation may take some time (between a few minutes and some hours)

Load a dump file

  • Menu Preprocessor
  • Select the dump you just created (or another one)
  • E.g. a file like edutechwikiunigech_mediawiki-20120323-history.xml

The script will then save the dump file into a SQLlite database file and this operation also will take some time. This db file can't be loaded again unless we didn't find this feature.

Analyse

  • Play with all the items in Menu: Analyser

Many items don't work, but that's normal since this is alpha software :) Also, some types of analysis can take time. Be patient.

Bugs static version 1.1 as of Jan 2012

These bugs are mostly related to either unexpected database entries or french language.

Missing page counter

This affects more recent versions, e.g. MW 1.27+

In file: smwload.py, remove page counters (I did not do this properly), at line 177 and later.

 cursor.execute("SELECT page_id, page_namespace, page_title, page_is_redirect FROM %spage WHERE page_id IN (SELECT DISTINCT rev_page FROM %srevision WHERE rev_timestamp>='%s' and rev_timestamp<='%s')" % (smwconfig.preferences["tablePrefix"], smwconfig.preferences["tablePrefix"], smwconfig.preferences["startDateMW"], smwconfig.preferences["endDateMW"]))
 # page_counter = int(row[4])
 "page_counter": 0, #visits

Namespace bug

You will have to manually edit the Python code (see below) if you use extra namespaces in your wiki. The fix works, but you can run into other problems (see the next section) - Daniel K. Schneider 14:41, 21 march 2012 (CET).

In the trace below, KeyError: 102 refers to an extra name space used it seems according to Erkan Yilmaz, 2011-08-18

Welcome to StatMediaWiki 1.1. Web: http://statmediawiki.forja.rediris.es
---------------------------------------------------------------------------
Loaded 105 categories
Loaded 1070 images
Loaded 2186 pages
Loaded 20644 revisions
Loaded 334 users
Traceback (most recent call last):
  File "statmediawiki/trunk/smw.py", line 55, in <module>
    main()
  File "statmediawiki/trunk/smw.py", line 43, in main
    smwload.load()
  File "/export/home/schneide/statmediawiki/trunk/smwload.py", line 42, in load
    fillFullpagetitles()
  File "/export/home/schneide/statmediawiki/trunk/smwload.py", line 68, in fillFullpagetitles
    smwconfig.pages[page_id]["full_page_title"] = page_props["page_namespace"] == 0 and page_props["page_title"] or '%s:%s' % (smwconfig.namespaces[page_props["page_namespace"]], page_props["page_title"])
KeyError: 102

Workaround according to Emilio José Rodríguez Posada (emijrp)

You need to add your non-canonical namespaces to "line 159" in the smwload.py file. Look at the example below (new namespaces at end of line):

smwconfig.namespaces = {-2: "Media", -1: "Special", 0: "Main", 1: "Talk", 2: "User", 3: "User talk", 4: "Project", 5: "Project talk",
6: "File", 7: "File talk", 8: "MediaWiki", 9: "MediaWiki talk", 10: "Template", 11: "Template talk", 
12: "Help", 13: "Help talk", 14: "Category", 15: "Category talk", 102: "your namespace", 103: "other namespace"}

Of course, this list can get quite long and every single namespace must be declared.

    smwconfig.namespaces = {-2: "Media", -1: "Special", 0: "Main", 1: "Talk", 2: "User", 3: "User talk", 4: "Project", 5: "Project talk", 6: "File", 7: "File talk", 8: "MediaWiki", 9: "MediaWiki talk", 10: "Template", 11: "Template talk", 12: "Help", 13: "Help talk", 14: "Category", 15: "Category talk", 102: "STIC", 103: "STIC Discussion", 104: "Blog", 105: "Blog talk", 108: "Attribut", 109: "Discussion attribut", 112: "Formulaire", 113: "Discussion formulaire",  114: "Concept", 115: "Discussion concept", 118: "Filter", 119: "Filter talk", 420: "Layer", 421: "Layer talk", 710: "TimedText", 711: "TimedText talk", 6601: "JUNK", 6602:"REPORTING", 6603:"REPORTING talk"}

You can see all your namespaces in the Special:AllPages (see the pull-down menu) or LocalSettings.php. However, this won't help much if you don't know what numbers your namespaces have. Use the API for that. Example:

http://edutechwiki.unige.ch/fmediawiki/api.php?action=query&meta=siteinfo&siprop=namespaces

Orphan revisions handling

The script will cycle through the revision SQL table and break when a revision doesn't have a parent page. I don't know exactly why this could happen, but it did. You will get a similar error as above, i.e. "key error .... "

Fix:

  • kill orphaned pages with a maintenance script
cd your_path/maintenance
php deleteOrphanedRevisions.php 

I had to learn some python in order to figure this one out. Here is a modified getTotalRevisionsByNamespace function in file smwget.py

Original:

def getTotalRevisionsByNamespace(namespace=0):
    return len([rev_id for rev_id, rev_props in smwconfig.revisions.items() if smwconfig.pages[rev_props["rev_page"]]["page_namespace"] == namespace])

Debug version. After using it, put the old version back. I don't know Python. Btw indentation is crucial, move the last the statement to the right if you want to break it :)

def getTotalRevisionsByNamespace(namespace=0):
    print "namespace = " + repr(namespace)
    revisions = []
    for rev_id, rev_props in smwconfig.revisions.items():
        print "rev_id = " + repr (rev_id) + " rev_page = " + repr (rev_props["rev_page"])
        if smwconfig.pages[rev_props["rev_page"]]["page_namespace"] == namespace:
            revisions.apprend(rev_id)
    return len(revisions)

Instead of cleaning up the database, you also could insert a test in the function above. Anyhow, this new version does not fix anything, it just prints out the revision numbers and you can see with which the code breaks.....

Database prefix bug

If you are unlucky and decided to use database table prefixes, there will be another bug:

  File "/export/home/schneide/statmediawiki/trunk/smwanal.py", line 145, in generateTimeActivity
   cursor.execute("SELECT %s(rev_timestamp) AS time, COUNT(rev_id) AS count FROM %srevision, %spage WHERE rev_page=page_id and 
   rev_timestamp>='%s' and rev_timestamp<='%s' AND %s GROUP BY time ORDER BY time" % (timesplit, smwconfig.preferences["tablePrefix"], 
   smwconfig.preferences["tablePrefix"], smwconfig.preferences["startDateMW"], smwconfig.preferences["endDateMW"], cond))

No Fix:

  • The following won't work, or maybe it did and I got confused by the escaping single quotes bug below ...
  • Edit file smwanal.py around line 255
 conds.append("%s and rev_page in (select cl_from from %scategorylinks where cl_to='%s')" % (cond, smwconfig.preferences["tablePrefix"], category_props["category_title_"].encode(smwconfig.preferences['codification']))) #fix cuidado con nombres de categorías con '

_mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'apprentissage') GROUP BY time ORDER BY time' at line 1")

Dumb fix doesn't work either (I should learn Python)

  • Hard code the table prefix
   conds.append("%s and rev_page in (select cl_from from mw_categorylinks where cl_to='%s')" % (cond, category_props["category_title_"].encode(smwconfig.preferences['codification']))) #fix cuidado con nombres de categorías con

Good fix :)

  • Get rid of the db prefix (it was a dumb decision some years ago as far as I am concerned). There is a Mediawiki maintenance script for this.
maintenance# php renameDbPrefix.php --old=mw_ --new=0

Escaping quotes bugs

Titles that include either single or double quotes will fail in two scripts (at least). Of course, one should use simple titles in a wiki, but try to teach this to education students (...)

Gnuplot

Can't cope with straight or single quotes that you typically would find in a french speaking wiki.

gnuplot> set title "Accumulative work distribution in Proposition de quelques outils d'awareness de groupe "originaux" pour faciliter la 
collaboration et le travail/l'apprentissage collaboratif dans un contexte à distance"
                                                                                                          ^
        line 0: ';' expected
gnuplot> plot "/tmp/tmp4caTry.gnuplot/fifo" title "Edits in "L'oiseau et le cachot, naissance de l'éducation correctionnelle en suisse romande 1800-1913" (all users)" with boxes, "/tmp/tmpC60A5f.gnuplot/fifo" title "Edits in "L'oiseau et le cachot, naissance de l'éducation correctionnelle en suisse romande 1800-1913" (only anonymous users)" with boxes, "/tmp/tmpgnJsqX.gnuplot/fifo" title "Edits in "L'oiseau et le cachot, naissance de l'éducation correctionnelle en suisse romande 1800-1913" (only registered users)" with boxes
                                                                                                  ^
        line 0: invalid character 


gnuplot> set title "Accumulative work distribution in "L'oiseau et le cachot, naissance de l'éducation correctionnelle en suisse romande 1800-1913""

                                                                                            ^
        line 0: invalid character 


Stupid fix:

  • Edit smwload.py and add twice two lines around lines 96 and and again 180.
        page_title = re.sub('_', ' ', unicode(row[2], smwconfig.preferences['codification']))
        page_title_ = re.sub(' ', '_', unicode(row[2], smwconfig.preferences['codification']))
        # DKS - change straight quotes within titles
        page_title  = page_title.replace("'", "\\'").replace("\"", "\\\"")
        page_title_ = page_title_.replace("'", "\\'").replace("\"", "\\\"")

Also do the same for cl_to and cl_to_ around line 80

    #DKS
        cl_to  = cl_to.replace("'", "\\'").replace("\"", "\\\"")
        cl_to_ = cl_to_.replace("'", "\\'").replace("\"", "\\\"")

SQL bug in smwanal

Same issue as above

  File "/export/home/schneide/statmediawiki/trunk/smwanal.py", line 145, in generateTimeActivity
    cursor.execute("SELECT %s(rev_timestamp) AS time, COUNT(rev_id) AS count FROM %srevision, %spage WHERE rev_page=page_id and rev_timestamp>='%s' and rev_timestamp<='%s' AND %s GROUP BY time ORDER BY time" % (timesplit, smwconfig.preferences["tablePrefix"], smwconfig.preferences["tablePrefix"], smwconfig.preferences["startDateMW"], smwconfig.preferences["endDateMW"], cond))
  File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'apprentissage') GROUP BY time ORDER BY time' at line 1")

This error is due to quoting problems (something that french users have to suffer all the time in applications made by others ....) For debugging, I modified the code:

   for cond in conds:
        #todo: en vez de sql-query, usar el dic revisions y datetime.datetime.dow, etc
        sqlStatement = "SELECT %s(rev_timestamp) AS time, COUNT(rev_id) AS count FROM %srevision, %spage WHERE rev_page=page_id and rev_timestamp>='%s' and rev_timestamp<='%s' AND %s GROUP BY time ORDER BY time" % (timesplit, smwconfig.preferences["tablePrefix"], smwconfig.preferences["tablePrefix"], smwconfig.preferences["startDateMW"], smwconfig.preferences["endDateMW"], cond)
        print sqlStatement

Result:

SELECT hour(rev_timestamp) AS time, COUNT(rev_id) AS count FROM revision, page WHERE rev_page=page_id and rev_timestamp>='20110901000000' and rev_timestamp<='20120401000000' AND 1 and rev_page in (select cl_from from categorylinks where cl_to='Théories_d'apprentissage') GROUP BY time ORDER BY time

Fix:

  • See the Gnuplot bug above

Missing

  • Stop words. The list of words for each article is a good idea, but there should be stop words. E.g. Each time I edit I increase "external", because I use an external editor.
  • Collaboration diagrams

Links

Official
Other

Bibliography

  • Rodríguez-Posada, Emilio J.; Juan Manuel Dodero, Manuel Palomo-Duarte, Inmaculada Medina-Bulo (2011). Learning-Oriented Assesment of Wiki Contributions: How to Assess Wiki Contributions in a Higher Education Learning Setting. Proceedings of CSEDU2011, 3rd International Conference on Computer Supported Education. Noordwijkerhout, The Netherlands. , 2011. PDF Reprint
  • See also: Publicaciones (list of publications, mostly in Spanish at StatMediaWiki)