- 1 Introduction
- 2 Learn who your spammers are and block whole domains
- 3 Fight mediawiki spamming
- 4 Close self registration
- 5 Clean up wiki spamming
- 6 Links
- 7 Bibliography
This article mainly is concerned with wiki spamming. Wiki spamming has been increasing over the years and for two reasons:
- Its authors believe that inserting spamming links helps google rankings. This is actually not the case with a default installation since all links include a rel="nofollow" HTML attribute.
- Popular wiki pages may be spammed with "sneaky" links that will be followed by some readers. A typical example in this wiki are frequent attempts to insert links to cheating services, i.e. web sites that offer to write student papers for a fee. Note to students about cheating services: Do not use these services because quality is most often low, the paper doesn't match what your teacher expects from you, and contents most often include plagiarized sections. In other words: Better turn in a bad assignment that you wrote the "night before". You'll get the same bad grade without having to pay for it and you don't risk being punished...
Below we collected some essential strategies and links for mediawiki administrators that manage somewhat closed wikis (i.e. only registered users can edit). If you manage an open wiki, then you likely will have to use extra strategies that are described in various MediaWiki manual pages.
2 Learn who your spammers are and block whole domains
This strategy may allow to block whole domains (e.g. in the httpd.conf file or at the system level). When you use a difficult user account procedure as in this wiki, then wikis can by spammed manually (typically by underpayd third-world people hired by a first-world company). Blocking out whole (sub)domains can help a bit...
If your Mediawiki is spammed: first you will have to go either through your web server logs, e.g. search for "submitlogin" or install an extension that shows the IP number of users. We recommend the latter strategy:
2.1 The CheckUser extension
- allows you to figure out where they come from (connect from) and may help you decide whether you should block a whole IP range or ranges (e.g. a whole country). You either can enter user names or IP numbers. Then you can both trace and block a user.
- Installed on EduTechwiki and also some Wikipedia/media sites.
Alternatively dig through web server access logs and then consult one of these:
- Easywhois.com (alternative to whois.net)
- ipgp.net (find domain names for IP numbers)
2.2 Block a domain at the web server level
Edit an apache configuration file, e.g. /etc/apache2/apache2.conf and use patterns like the following for blocking out whole domains (many thousands of users) or sub-domains.
<Location "/"> Deny from 192.168.87. Deny from 203.177. Deny from 180.190. ... </Location>
Now if you use virtual hosts, then you'll have to edit these files too or else include the same file like this.
If your spammers always show up from the same country, a last resort is to block everyone (not very nice). Retrieve country lists for the country you want to block from a site like IPDeny.com
An other alternative is to install a blacklist extension, as explained further down. We do both :)
2.3 More blocking
In addition, you could edit /etc/hosts.deny in order to block any other attemps via ssh to connect to your server. But since your httpd is a independant daemon, banning hosts in /etc/hosts.deny won't ban them from your web server (remember that).
ALL: 192.168. ALL: .example.com
You can put several IP's on a single line like this:
ALL: 203.177., 222.127., 192.168.
If you don't have access to your server machine, then you also can block IP's from the mediawiki, but this is more resource intensive since PHP can't be as fast as the OS or the web server I believe and it will not protect other wikis that run on the same server from spamming. See Combatin spam (Mediawiki.org).
Finally, to block access at the web server level, there also exist apache extensions (none tested) and firewall programs. Installing a Firewall program is a good option if you
3 Fight mediawiki spamming
There exist several strategies to fight spamming:
3.1 Registered users
To fight spamming, only registered uses should be able to edit (implemented in EduTechWiki)
Edit Localsettings.php and change:
$wgGroupPermissions['*']['edit'] = false; $wgGroupPermissions['*']['createaccount'] = true; $wgGroupPermissions['*']['read'] = true;
- Light-weight user creation that requires some math
This can defeat some scripts
3.2 Using captcha
Make login creation and (optionally) page editing more difficult with captcha, i.e. users will have to type in a code that is generated by the wiki. This can defeat more scripts
- Light-weight solution
- Making user creation even more difficult with recaptcha. This is implemented in EduTechWiki
- Also contributes to a digitalization project....
- Mediawiki ReCaptcah extension at Google (I don't use this one anymore - Daniel K. Schneider 15:02, 21 April 2011 (CEST))
- Extension:ConfirmEdit For Mediawiki 16.1.4 I use this one (recaptcha is included)
- Learn more about the project
In Edutechwiki we roughly use the following setup. However, at some point we may remove the captcha from page editing and install the revision system (see below) instead. Warning dependin on the exact version and sub-version you use, you will have to use a different configuration !! Pay a lot of attention to the changes that happend over the last few month - 15:02, 21 April 2011 (CEST).
The following works with Mediawiki 1.16.4
# Anti Spam ConfirmEdit/ReCaptcha for mediawiki 1.16.4 # http://wiki.recaptcha.net/index.php/Main_Page require_once( "$IP/extensions/ConfirmEdit/ConfirmEdit.php" ); require_once( "$IP/extensions/ConfirmEdit/ReCaptcha.php" ); $wgCaptchaClass = 'ReCaptcha'; $recaptcha_public_key = '................'; $recaptcha_private_key = '................'; # Users must be registered, once they are in, they they still must fill in captchas (at least over the summer) $wgCaptchaTriggers['edit'] = true; $wgCaptchaTriggers['addurl'] = false; $wgCaptchaTriggers['create'] = true; $wgCaptchaTriggers['createaccount'] = true;
The following works with Mediawiki 1.17.x (param names are changed !!)
# ReCaptcha # See the docs in extensions/recaptcha/ConfirmEdit.php # http://wiki.recaptcha.net/index.php/Main_Page require_once( "$IP/extensions/ConfirmEdit/ConfirmEdit.php" ); require_once( "$IP/extensions/ConfirmEdit/ReCaptcha.php" ); $wgCaptchaClass = 'ReCaptcha'; $wgReCaptchaPublicKey = '...'; $wgReCaptchaPrivateKey = '....'; # our users must be registered, once they are in they they still must fill in captchas. $wgCaptchaTriggers['edit'] = true; $wgCaptchaTriggers['addurl'] = false; $wgCaptchaTriggers['create'] = true; $wgCaptchaTriggers['createaccount'] = true; # Skip recaptcha for confirmed authors (they will inherit normal user rights) $wgGroupPermissions['authors']['skipcaptcha'] = true;
According to wikipedia, “CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One approach involves relaying the puzzles to a group of human operators who can solve CAPTCHAs. In this scheme, a computer fills out a form and when it reaches a CAPTCHA, it gives the CAPTCHA to the human operator to solve. Spammers pay about $0.80 to $1.20 for each 1,000 solved CAPTCHAs to companies employing human solvers in Bangladesh, China, India, and many other developing nations”
This is why each edit requires a normal user to solve a captcha. Of course, for authors we know, we then can remove this requirement. Technically speaking we put them into an authors user group that has skipcaptcha=true
$wgGroupPermissions['authors']['skipcaptcha'] = true;
To make this works, add the user to this group using user rights management in the special pages. You don't need to create the group, this is implicitly done when you assign a permission in $wgGroupPermissions array in the LocalSetting.php file.
In addition you may create an unfriendly message in MediaWiki:Recaptcha-createaccount. It probably won't deter many spammers since - again - these are most often just dramatically underpaid people who have to accept that kind of a job by necessity.
3.3 Filtering edits and page names
Prevent creation of pages with bad words in the title and/or the text.
- The builtin WgSpamRegex variable
Mediawiki includes a $wgSpamRegex variable. The goals is prevent three things: (a) bad words, (b) links to bad web sites and (c) CSS tricks to hide contents.
Insert in LocalSettings.php something like:
$wgSpamRegex = "/badword1|barword2|abcdefghi-website\.com|display_remove_:none|overflow_remove_:\s*auto;\s*height:\s*[0-4]px;/i"
I will not show ours here since I can't include it in this page ;)
Don't forget to edit MediaWiki:Spamprotectiontext
3.4 Spam blacklists extension
The SpamBlacklist extension prevents edits that contain URL hosts that match regular expression patterns defined in specified files or wiki pages.
- The default source for SpamBlacklist's list of forbidden URLs is the Wikimedia spam blacklist on Meta-Wiki, at http://meta.wikimedia.org/wiki/Spam_blacklist. By default, the extension uses this list, and reloads it once every 10-15 minutes. For many wikis, using this list will be enough to block most spamming attempts.
- In addition, you can add your own files and also use the local pages MediaWiki:Spam-blacklist and MediaWiki:Spam-whitelist. However, make sure to protect these pages from editing.
- Saving files may encounter a performance hit and you may have to install a bytecode cache.
3.5 rel = "nofollow"
Wiki spammers aim at two things:
- Insert well placed links in articles dealing somewhat with the spam content's subject area so that people will actually see them and then follow (same principle as google ads). This requires understanding of an article content. Since most paid wiki spammers are poorly trained from poor non-English speaking countries, this strategy most often fails.
- Get a better Google ranking. This second purpose will not work in this wiki, since under the default configuration, MediaWiki adds rel='nofollow' to external links in wiki pages, to indicate that these are user-supplied, might contain spam, and should therefore not be used to influence page ranking algorithms. Popular search engines such as Google honour this attribute. (Manual:Combating spam). Most wiki spammers abusing edutechwiki are too stupid to know about this (some labour really must come cheap ....)
To some companies, wiki spamming may seem to be a good strategy, but most often it is not...
3.6 Flagged Revisions
Article validation allows for Editor and Reviewer users to rate revisions of articles and set those revisions as the default revision to show upon normal page view. These revisions will remain the same even if included templates are changed or images are overwritten. This allows for MediaWiki to act more as a Content Management System (CMS). I probably will install this any time soon - Daniel K. Schneider 11:32, 30 July 2010 (UTC).
- Install Extension:FlaggedRevs
According to the FlaggedRevs Help page (retrieved Junly 31 2010):
3.7 Block Proxy servers
Wiki spammers often use open proxies to cover their tracks. Blocking the ISP doesn't help very much since they will just switch to another proxy.
Proxy blocking, however, may affect many legitime users. Some organizations (e.g. our own university uses an output cache to improve internal network traffic). However, as far as we understand open proxies can be configured in different ways, e.g. they should at least include a XFF header that shows the origin of the sender.
- Extension:AutoProxyBlock, recent extension as of jan 2012.
- Extension:RudeProxyBlock blocks all the open proxies that Wikipedia had blocked.
Currently (Jan 2012) we are testing AutoProxyBlock with:
$wgProxyCanPerform = array('read'); $wgTagProxyActions = true; $wgAutoProxyBlockLog = true; $wgAutoProxyBlockSources['api'] = 'http://en.wikipedia.org/w/api.php';
Read about Wikipedia's proxy blocking (Meta). As of Jan 2011, not sure that this article reflects current proxy blocking strategies.
4 Close self registration
This is a measure you may have to take when your wiki gets really popular. On oct. 2011, EduTechwiki had about one fishy account creation/day and about 2 spams / week. This was still manageable ...
On Feb 2012, two new spam pages per day were created, despite heavy anti-spam measures (all of the above)
Therefore I installed the ConfirmAccount extension. It requires:
- Biographic information
- E-mail confirmation
- Approval of the account by an administrator
In order to fight lying spammers, captcha and other anti-spam measures remain in place for new users. I may relax that in the future.
5 Clean up wiki spamming
5.1 Block user and revert changes
5.2 Mass deletion and manipulation scripts
There are several command line scripts that allow for some simple surgery (but read the code/comments on top before you use these !)
- mass deletion and other scripts, e.g. deleteBatch.php and deleteRevision.php.
- Nuke is an extension that makes it possible for sysops to mass delete pages. If you have command line access you also can use deleteBatch.php
5.3 Hide revisions with inappropriate content
Core MediaWiki has a feature (disabled by default) that adds a special page called Special:RevisionDelete. Deleted revisions and events will still appear in the page history and logs, but parts of their content will be inaccessible to the public. This is useful if you believe that some wiki spammers will link to older revisions in your wiki.
$wgGroupPermissions['sysop']['deleterevision'] = true;
In addition you may rename bad user names. Not often needed IMHO for fighting spam, but can be useful to rename inappropriately created logins by students for example.
- Spam (electronic) (Wikipedia)
- Spamdexing (Wikipedia)
- Six Apart Guide to Comment Spam (good reading for web log owners)
- Fight Comment Spam, Ban IP's A large list of banned IP addresses by Chieh Cheng. (There exist others)
- Report Of The Oecd Task Force On Spam: Anti-Spam Toolkit Of Recommended Policies And Measures (2006), PDF.
- Wikis: The Next Frontier for Spammers? (Netcraft, 2004).
6.2 Legal issues and official policy
- Wiki spamming is worse than e-mail spamming, because it relates to vandalism and therefore additional laws can apply.
- Official EU and OECD websites are often unstable (link decay, e.g. the www.oecd-antispam.org official website which is linked to from many places is dead ...)
- USA (main direct or indirect source of spamming)
- Spam Laws: The United States CAN-SPAM Act
- The CAN-SPAM Act: A Compliance Guide for Business
- CAN-SPAM Act of 2003 (Wikipedia)
- The European Coalition Against Unsolicited Commercial Email (EuroCAUCE).
- Unsolicited communications - Fighting Spam (EU Information society portal,, retrieved 11:07, 16 July 2010 (UTC)).
- Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions on fighting spam, spyware and malicious software, retrieved 11:07, 16 July 2010 (UTC).
- Spam Law Summary (Scotch Spam)
- Privacy and Electronic Communications (EC Directive) Regulations 2003 (information commissioner's office)
6.3 General wiki spamming
- Examples from content guidelines - what is spam ?
- Wiki Spam (from the original wiki)
There are several MediaWiki pages dealing with spam. As of July 2010 they are not all up-to-date and coordinated.
- Combating spam (Mediawiki Manual)
- Combating vandalism (Mediawiki Manual)
- Anti spam features (Mediawiki)
- Spam Filter (This is development page of Mediawiki. I includes extra information, e.g. cleanup scripts.)
- Help:Spam (Wikia) Wikia is a commercial version of Wikipedia with many user-managed subwikis that have their own aims and content policies.
- Proxy blocking (Meta)
6.6 Lists of bad words
- West, Andrew G., Sampath Kannany and Insup Lee (2010). Detecting Wikipedia Vandalism via Spatio-Temporal Analysis of Revision Metadata, Department of Computer & Information Science, Technical Reports (CIS), University of Pennsylvania. PDF. See also: STiki (Spatio-Temporal analysis over Wikipedia)
- Martin Potthast, Benno Stein and Robert Gerling. Automatic Vandalism Detection in Wikipedia. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR 2008), Glasgow, UK, 4956 of Lecture Notes in Computer Science, pages 663-668, 2008. Springer. ISBN 978-3-540-78645-0. DOI:10.1007/978-3-540-78646-7_75, PDF Reprint.
- Potthast, Martin; Benno Stein, and Teresa Holfeld (2010). Overview of the 1st International Competition on Wikipedia Vandalism Detection, in Martin Braschler and Donna Harman (Eds.): Notebook Papers of CLEF 2010 LABs and Workshops, 22-23 September, Padua, Italy. ISBN 978-88-904810-0-0. 2010, PDF Reprint.
- See also other wiki-related papers in the publication list of Uni Weimars's MediaSystems, Web Technology & Information Systems group.