RapidMiner Studio: Difference between revisions

The educational technology and digital learning wiki
Jump to navigation Jump to search
 
(82 intermediate revisions by 2 users not shown)
Line 8: Line 8:
|field_plugin_of=
|field_plugin_of=
|field_language=
|field_language=
|field_license_type=Free&Open source
|field_license_type=Commercial&Open source
|field_free_software_licence=Affero General Public License version 1
|field_free_software_licence=
|field_last_release=2014/02/26
|field_last_release=2014/02/26
|field_last_version=5.3.015
|field_last_version=5.3.015
Line 16: Line 16:
In a few words, RapidMiner Studio is a "downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics". It can also be used (for most purposes) in batch mode (command line mode).
In a few words, RapidMiner Studio is a "downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics". It can also be used (for most purposes) in batch mode (command line mode).


[[User:Camacab0|Camacab0]] ([[User talk:Camacab0|talk]]) 15:54, 10 November 2014 (CET)
[[User:Camacab0|Camacab0]] ([[User talk:Camacab0|talk]])
|field_analysis_orientation=General analysis
|field_analysis_orientation=General analysis
|field_data_analysis_objective=
|field_data_analysis_objective=
Line 28: Line 28:
|field_data_transformation_capabilities=Simple data format conversion, Simple data transformation operations, Advanced data transformation operations, Mathematical transformation of data for analysis
|field_data_transformation_capabilities=Simple data format conversion, Simple data transformation operations, Advanced data transformation operations, Mathematical transformation of data for analysis
|field_analysis_type=Basic statistics and data summarization, Data mining methods and algorithms
|field_analysis_type=Basic statistics and data summarization, Data mining methods and algorithms
|field_visualisation_type=Sequential Graphic, Chart/Diagram
|field_visualisation_type=Sequential Graphic, Chart/Diagram, Tag Cloud
|field_tool_usability=rather easy to use
|field_tool_usability=rather easy to use
|field_end_user_type=Students/Learners/Consumers, Teachers/Tutors/Managers, Developers/Designers, Researchers, Organisations/Institutions/Firms, Others
|field_end_user_type=Students/Learners/Consumers, Teachers/Tutors/Managers, Developers/Designers, Researchers, Organisations/Institutions/Firms, Others
Line 35: Line 35:
|field_system_engineering_level=N/A
|field_system_engineering_level=N/A
|field_data_mining_models_level=Medium
|field_data_mining_models_level=Medium
|field_completion_level=Medium
|field_completion_level=High
|field_last_edition=2014/11/09
|field_last_edition=2014/11/10
}}
}}
{{stub}}
{{stub}}
Line 57: Line 57:
First of all, it is important to say that RapidMiner Studio - and RapidMiner Server, that work with it - are a complete set of tools, rather than a more specific software. [https://rapidminer.com/ RapidMiner website] says that "RapidMiner lets you easily sort through and run more than 1500 operations".
First of all, it is important to say that RapidMiner Studio - and RapidMiner Server, that work with it - are a complete set of tools, rather than a more specific software. [https://rapidminer.com/ RapidMiner website] says that "RapidMiner lets you easily sort through and run more than 1500 operations".


Because of it's complexity, i will only describe some of RapidMiner Studio's functions. However, I will show above an use example of RapidMiner Studio as a basic text miner. RapidMiner Studio's highlights are :
Because of it's complexity, i will only describe some of RapidMiner Studio's functions. However, I will show above an use example of RapidMiner Studio as a basic text miner. Then, I will show you how to use RapidMiner to extract, transform and analyze tweets.
 
RapidMiner Studio's highlights are :


* A visual - code-free - environment, so no programming needed
* A visual - code-free - environment, so no programming needed
Line 73: Line 75:
* RapidMiner allows you to work with different types and sizes of data sources
* RapidMiner allows you to work with different types and sizes of data sources


= Use example : text mining =
= Use examples =
 
As we can do almost anything with RapidMiner Studio, I choosed to explore two different activities that can help you later build a text-mining and analyzing project.
First, I will show you how to use RapidMiner as a basic text-mining tool. We will see how to extract, transform and analyze text from files on your computer.
Secondly, I will explain how you can analyze tweets for free with RapidMiner Studio and a third-party website for Tweeter extraction (that is a premium feature of RapidMiner Studio).
 
== Basic text mining ==


As described before, RapidMiner can be used as a text mining software. I will describe here an example of text mining process, where we will :
As described before, RapidMiner can be used as a text mining software. I will describe here an example of text mining process, where we will :
Line 79: Line 87:
# Ignore some words that are not wanted (stoplist)
# Ignore some words that are not wanted (stoplist)
# Generate the results
# Generate the results
# View results in two different ways
# View results
 
=== Launch RapidMiner Studio and load data ===
 
[[File:RapidMiner_Studio_Tutorial1_B.PNG|100px|thumb|left|Fig.1 : Text Mining Extension]]
[[File:RapidMiner_Studio_Tutorial1_C.PNG|300px|thumb|right|Fig.2 : The workspace]]
 
As you launched RapidMiner Studio (v. 6.1.1000) you will need to install the Text Mining extension. RapidMiner works with extensions that plug into the core system.
The Text Mining extension can be found in RapidMiner Marketplace, which can be accessed from Help > Updates and Extensions (Marketplace) as shows the figure 1.
 
After restarting the software, we can start working with it. First of all '''create a New Process'''. You will see now the main window of RapidMiner Studio, and I will briefly describe the main zones of the working space :
* In '''blue''' we have the main toolbar
* In '''orange''' we can see all the operators that we can use in our processes
* In '''green''' we have the repositories
* In '''purple''' we have the main process windows, where we will be able to see process results and progression
* In '''black''' we have parameters of each element of or process and help
 
From here, we will first of all find our operator '''Process Documents from Files''' [http://edutechwiki.unige.ch/mediawiki/images/f/f0/RapidMiner_Studio_Tutorial1_D.PNG (screenshot here)] and we will drag it into the '''Process''' zone, in the center. At this point we have our operator in our process, and we need to set his '''parameters'''. Clic on our operator in the main process area, and see which parameters you can set on the right side. First parameter is '''text directories''' which we will set right away.
 
[[File:RapidMiner_Studio_Tutorial1_E.PNG|300px|thumb|left|Fig.3 : Text Directories]]
 
Note : On the right side of your toolbar you can see a four-element menu that allows you to switch between '''Design''' and '''Results''' (also with F8 and F9 keys) that will be very useful. If your results aren't what you were expecting, or you made a mistake when designing your process, you can easily return from the results to the design area.
 
* In my case, i have a directory on my Desktop which name is "data"
* In /data/, I have /litterature/text1.txt and /photographie/text2.txt
* I will set up my text directories like suggested in the Fig. 3 and give both a different name to be able to show results depending on text directory
 
In next section we will talk about operators, and we will come back to '''Process Documents from Files parameters''' to choose which vector we want RapidMiner to create.
 
=== Tokenize & define StopWords ===
 
Now that we have our Process Documents from Files operator in our '''Main Process area''' and our text directories set up correctly, we need to connect our operator '''Process Documents from Files''' on the left (from inp to wor) and on the right (from exa to res, and wor to res). This will allow the data to be processed.
 
[[File:RapidMiner_Studio_Tutorial1_F.PNG|300px|thumb|right|Fig. 5 : Tokenize and Stopwords operators]]
 
We will now define what steps (or processes) should be executed during our '''Process Documents from Files''' operator. So by double-clicking on it, we can see it's inside. We will now add a '''Tokenize''' operator that can be found in operators area (in Tokenization) on the left. Tokenize will separate words making them independent values. One of RapidMiner big values is that graphic user interface, that allows you to build processes quite naturally. We will also be able to add '''Filter Stopwords (french)''' - because my text files are in french - into our main '''Process Documents from Files''' operator, also by dragging it. You should see something like in Fig. 5 above.
 
=== View result ===
 
If your main operator is connected (input - output) and that inside of it, your Tokenize operator and your Stopwords operator are also connected to each other, and to input and output as suggests the figure above, you should be ready to launch the process which should generate your results.
 
Before clicking on the [http://edutechwiki.unige.ch/en/File:RapidMiner_Studio_Tutorial1_G.PNG launch button], i want to make you notice that we didn't change the '''Vector Creation''' parameter of our '''Process Documents from Files'''. That parameter allows you to set the type of visualization you want the software to create from the data given.
 
If you launch the process leaving the default value (TF-IDF), RapidMiner will present you the results in different ways. First you have two tabs, '''WordList''' and '''ExampleSet'''.
 
Note : TF-IDF is a "short for term frequency–inverse document frequency" which is "a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus." [http://fr.wikipedia.org/wiki/TF-IDF Wikipedia]
 
==== WordList View ====
 
[[File:RapidMiner_Studio_Tutorial1_H.PNG|300px|thumb|right|Fig. 6 : WordList View]]
In the WordList view tab (Fig. 6) we have an occurrences analysis.
* First column presents the words found in documents
* Second column presents the attributes of words (which in my case are equal to the words themself)
* Third column shows '''Total Occurences''' (how many times we can find the word in all documents)
* Fourth column shows '''Document Occurences''' (in how many documents we can find the word)
* Fifth and sixth column shows '''Text Directory Occurences''' (how many times we can find the word in each text directory)
 
==== ExampleSet View ====
 
[[File:RapidMiner_Studio_Tutorial1_I.PNG|300px|thumb|right|Fig. 7 : ExampleSet View]]
 
[[File:RapidMiner_Studio_Tutorial1_J.PNG|150px|thumb|left|Fig. 8 : Charts view types]]
 
In the ExampleSet view tab (Fig. 7) we have a left menu with five tabs. I will try to present them :
* First tab is an '''Overview''' of the process. We can find there each '''text directory''', each '''document processed''' and some other informations.
* Second tab is '''Statistics''', and allows you to see statistics about the data given.
* Third and fourth tabs are '''Charts''' and '''Advanced Charts''' and allows the user to visualize some default charts, or build advanced and customized charts that feed in the data and analysis results of our text mining process.
* Fifth tab allows the user to save annotations into the process.
 
Note : Fig. 8 shows you some of the charts view types that RapidMiner proposes.
 
=== Export results ===
 
When it comes to export results in RapidMiner Studio, each extension and RapidMiner Studio function will allow to do different sort of things.
For example, after a text mining process, data will be available in different forms :
* WordList view will allow you to export the spreadsheet in image (png, svg, jpg, eps, pdf), to print it, or you can even copy/paste the spreadsheet data into Microsoft Excel or Google Drive Spreadsheets.
* ExampleSet view also allows the user to copy/paste the data from the software, to print it or to export it as an image.
 
Note : The export as an image function seems to allow you to export all software main area (in the center) but not to export individually an image.
 
== Tweets mining and analysis ==
 
=== Introduction ===
 
RapidMiner Studio allows you to extract, transform and analyse data from A to Z with it's core functionalities and free plugins. Unfortunately, some Cloud extensions and functionalities are premium, and pricey. I will explain here how you can extract and analyse tweets only using the free version of RapidMiner Studio and a third-party service for the tweet extraction.
 
=== Tweets extraction ===
 
[[File:TweetsExtractionWithRapidminer-Figure1.png|thumbnail|right|Zapier's GoogleDrive and Twitter connection]]
[[File:TweetsExtractionWithRapidminer-Figure2.png|thumbnail|right|Twitter search parameters on Zapier]]
 
First of all you need to get your data that you want to input in RapidMiner. In our case, we need the tweets that we want to process. As said before, some third-party services allow you to extract tweets automatically from Twitter : I will present [https://zapier.com Zapier], which "''connects the web apps you use to easily move your data and automate tedious tasks''". A zap is a connexion between two services, that you can set up to automate tasks.
 
For our task, I connected Twitter and Google Drive, and specified that I want Zapier to look for an hashtag (#edtech) and to save each tweet containing that value in a new text file, in a Google Drive folder.
Once you have the relevant amount of tweets, you can save your Google Drive folder in a local place in your computer, that you will specify to RapidMiner. I got nearly 8'000 tweets in a few days. You have now your data ready to start using it with RapidMiner.
 
=== Data transformation ===
 
After having all our tweets in a directory on the computer, we can proceed with RapidMiner. We need to make a process that will take our directory as input, and that will output data that can be analysed and visualised. The figure bellow show all my three processes that I will explained bellow.
 
[[File:TweetsProcessingWithRapidminer-Figure3.png|thumbnail|left|RapidMiner process, containing the three sub-processes]]
 
Let's first focus on the orange process, the '''Tweets processing''' :
* First, the "module" Process documents from files, named '''Process tweets''', allow us, like in the previous tutorial, to specify a directory where text files are. We need to specify, inside this "module", which actions will be triggered.
** As we want to take out from all tweets the hashtags only, we need to tell RapidMiner to '''Tokenize''' first all words (by cutting them where white spaces are).
** Then we need it to '''filter the Tokens''' created to keep only the hashtags. That is done with regular expressions, that select only words starting by # symbol, and followed by letters or numbers.
* When the Process Tweets "module" is finished, it outputs a WordList that can be converted to a ExampleSet by the '''Tweets->Data''' "module". That will allow us to treat this words as data and to use it later.
 
If we look closer the '''URL processing''', it's made just as the Tweets processing.
* We have a Process documents from files "module", named '''Process URL's''' that will put all files in the directory in a loop, and will execute for each of them two operations :
** '''Tokenize''', explained before.
** '''Filter tokens''', that will this time keep only links tweeted. We use a regular expression to keep only "words" starting by "http://".
* Finally we convert this WordList in an ExampleSet again to be able to connect it to the Result output point.
 
=== Data analysis ===
 
[[File:TweetsAnalysisWithRapidminer-Figure1.png|thumbnail|right|Figure 1 - Hashtags (sorted by "in documents" count, and alphabetically)]]
 
Once the process showed before is complete and valid, you can test it to see if data outputed is what you were waiting for. My process gets me three ExampleSets, as i had three ouput points connected. I will present now two of these ExampleSets and talk then about the third one, the Read Excel process.
 
[[File:TweetsAnalysisWithRapidminer-Figure2.svg|thumbnail|right|Figure 2 - Most represented hashtags in a graph (equivalent hashtags and #edchat filtered)]]
 
'''My first process''' had as objective to show which hashtags were represented most, combined with #edtech hashtag. The "Tweets->DATA" ExampleSet show us that. You can see it in a data view (table) which can be sorted and in other ways like charts.
* Figure 1 shows the data view, we can there see all hashtags and the number of documents (tweets) in which they were.
* Figure 2 shows a graphic with most represented hashtags.
 
'''My last process''', read Excel, is the easiest way I found to filter tokens depending on the "In documents" value. As some hashtags like #EdTech, #edTech, #Edtech were some of the most used hashtags, as I didn't used a case sensitive action to remove capital letters, and because de graph wasn't "viewable" due to the huge amount of different hashtags, I needed to filter my final data. I looked how to do it, and tried different ways, but didn't manage to do it. What I did is that I exported the data resulting from my "Tweets->Data" process, in a Microsoft Excel file. I then deleted all unwanted lines (equivalent hashtags and hashtags less represented) to keep only the most used hashtags. I created a process in RapidMiner that reads that file and outputs it's data : I then have filtered data, that can be showed.
* The figure 2 graphic is the result of the Read Excel process. It only contains the most used hashtags, and filters the "equivalents" hashtags. It is important to say also that the most used hashtag (#edchat) has also been removed to better view of the others hashtags.
 
Finally, '''my second process''' extracts links from the tweets, to see which kind of content could be behind the most tweeted links.
 
=== URL analysis ===
 
[[File:TweetsAnalysisWithRapidminer-Figure3.png|thumbnail|right|URL data sorted by "In documents"]]
 
As I said before I used RapidMiner to process my tweets and extract only the links. As I could not find a functionality in RapidMiner that allows me to ping an URL and to get it's real URL (all links in twitter are shortened with an URL Shortener) to be able, for example, to check which domains are more represented, I did it manually.
 
I kept only the five more tweeted URL's and checked them. Here they are :
 
* [https://twitter.com/K12Launch/status/514981047667527680/photo/1 Humour picture] about generational technology gap (Twitter.com, in 35 tweets)
* [http://www.brilliant-insane.com/2014/12/9-traits-good-digital-citizens.html?utm_source=twitter.com&utm_medium=social&utm_campaign=buffer&utm_content=buffer8643a 9 traits of good digital citizens] (Brilliant-Insane.com, in 35 tweets)
* [https://twitter.com/markbarnes19/status/543762746496806912/photo/1 Infography] about digital citizenship (Twitter.com, 35 tweets)
* [http://www.insightsed.com/ Insightsed] which was unavailable (ressource limit is reached, in 29 tweets) on 17.12.2014 @ 15:30 UTC+1
* [http://www.edtechmagazine.com/k12/article/2014/09/5-strategies-reach-risk-students-technology 5 Strategies to Reach At-Risk Students with Technology] (EdtechMagazine.com, in 23 tweets)
 
=== Results and comments ===
 
==== Process tweets results ====
 
* First, I was able to see that capital letters are taken in consideration in tweets. We choosed #edtech hashtag, but others were used like #EdTech, #Edtech or #edTech.
* Secondly, '''the most used hashtag was clearly #edchat in nearly 700 tweets'''. Second was #education (132) and #ipaded (117) was third.
 
==== Process URL results ====
 
* We can see that in the top 5 links, two of them target to status on Twitter with images. One is an infography about digital citizenship, and the other one is a funny picture.
* We can see that the three other links are website articles about subjects between education and technology, what our hashtag is used for.
 
==== Comments ====


== Launch RapidMiner Studio and load data ==
This process has the main objective of showing how we work with data in RapidMiner. Of course I only explored a very small amount of it's functionalities and strengths. I think that the process that processes tweets could be much better : it could analyse hashtags that are together in a tweet, could analyse how many hashtags are used, on average, in every tweet. I could also cross the hashtags represented in #edtech tweets with the ones represented in #edchat tweets for example.


As you launched RapidMiner Studio (v. 6.1.1000)
As said before, the process treating links could be more automatised : it could resolve "real domains" automatically, and we would be able then to count or mesure which articles or even domain names (websites) are more represented.


== Define StopWords and operations ==
Finally, it was sometimes pleasant to work with RapidMiner, sometimes not. It's own structure is kind of easy to understand and use once you understand it, and the visual input-output points, the inclusive documentation that gives you information about the data that can enter and exit a "module" help a lot when you're beginning. Rapidminer also allow to do use full version of the software, for a limited time, which is very positive.


== View result ==
Unfortunately some actions are not easy to find (as the Zoom out action, that only can be accessed clicking on a graph with the mouse and dragging the mouse upper-left), and it's kind of difficult to navigate in the build-in "modules" and find the one you need for an operation.
== Export results ==


= Links =
= Links =

Latest revision as of 15:58, 27 January 2015

Rapidminer logo.jpg


RapidMiner Studio 5.3.015 (2014/02/26)

Screenshot-rapidminer-studio.png

Developed by: RapidMiner
License: Commercial&Open source
Web page : Tool homepage
Tool type : Framework/Library/API,

Tool.png

The last edition of this page was on: 2014/11/10

The Completion level of this page is : High


SHORT DESCRIPTION

[[has description::RapidMiner is a world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. RapidMiner is now RapidMiner Studio and RapidAnalytics is now called RapidMiner Server.

In a few words, RapidMiner Studio is a "downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics". It can also be used (for most purposes) in batch mode (command line mode).

Camacab0 (talk)]]


TOOL CHARACTERISTICS

Usability

Authors of this page consider that this tool is rather easy to use.

Tool orientation

This tool is designed for general purpose analysis.

Data mining type

This tool is made for Structured data mining, Text mining, Image mining, Audio mining, Video mining, Data gathering, Social network analysis.

Manipulation type

This tool is designed for Data extraction, Data transformation, Data analysis, Data visualisation, Data conversion, Data cleaning.

IMPORT FORMAT : SQL, TXT, XLS, XML, a lot more

EXPORT FORMAT : CSV, XML, XSL, a lot more


Tool objective(s) in the field of Learning Sciences

Analysis & Visualisation of data
Predicting student performance
Student modelling
Social Network Analysis (SNA)
Constructing courseware

Providing feedback for supporting instructors:
Recommendations for students
Grouping students:
Developing concept maps:
Planning/scheduling/monitoring
Experimentation/observation

Tool can perform:

  • Data extraction of type: Web crawler, Flat file database/Logfile extractor, Structured database extractor
  • Transformation of type: Simple data format conversion, Simple data transformation operations, Advanced data transformation operations, Mathematical transformation of data for analysis
  • Data analysis of type: Basic statistics and data summarization, Data mining methods and algorithms
  • Data visualisation of type: Sequential Graphic, Chart/Diagram, Tag Cloud (These visualisations can be interactive and updated in "real time")



ABOUT USERS

Tool is suitable for:

Students/Learners/Consumers
Teachers/Tutors/Managers
Researchers
Developers/Designers
Organisations/Institutions/Firms
Others

Required skills:

STATISTICS: Basic

PROGRAMMING: None

SYSTEM ADMINISTRATION: N/A

DATA MINING MODELS: Medium



FREE TEXT


Tool version : RapidMiner Studio 5.3.015 2014/02/26
(blank line)

Developed by : RapidMiner
(blank line)
Tool Web page : http://sourceforge.net/projects/rapidminer/#resources
(blank line)
Tool type : Framework/Library/API
(blank line)
Commercial&Open source

Screenshot-rapidminer-studio.png

SHORT DESCRIPTION


RapidMiner is a world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. RapidMiner is now RapidMiner Studio and RapidAnalytics is now called RapidMiner Server.

In a few words, RapidMiner Studio is a "downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics". It can also be used (for most purposes) in batch mode (command line mode).

Camacab0 (talk)

TOOL CHARACTERISTICS


Tool orientation Data mining type Usability
This tool is designed for general purpose analysis. This tool is designed for Structured data mining, Text mining, Image mining, Audio mining, Video mining, Data gathering, Social network analysis. Authors of this page consider that this tool is rather easy to use.
Data import format Data export format
SQL, TXT, XLS, XML, a lot more. CSV, XML, XSL, a lot more.
Tool objective(s) in the field of Learning Sciences

☑ Analysis & Visualisation of data
☑ Predicting student performance
☑ Student modelling
☑ Social Network Analysis (SNA)
☑ Constructing courseware

☑ Providing feedback for supporting instructors:
☑ Recommendations for students
☑ Grouping students:
☑ Developing concept maps:
☑ Planning/scheduling/monitoring
Experimentation/observation


Can perform data extraction of type:
Web crawler, Flat file database/Logfile extractor, Structured database extractor


Can perform data transformation of type:
Simple data format conversion, Simple data transformation operations, Advanced data transformation operations, Mathematical transformation of data for analysis


Can perform data analysis of type:
Basic statistics and data summarization, Data mining methods and algorithms


Can perform data visualisation of type:
Sequential Graphic, Chart/Diagram, Tag Cloud (These visualisations can be interactive and updated in "real time")


ABOUT USER


Tool is suitable for:
Students/Learners/Consumers:☑ Teachers/Tutors/Managers:☑ Researchers:☑ Organisations/Institutions/Firms:☑ Others:☑
Required skills:
Statistics: BASIC Programming: NONE System administration: Data mining models: MEDIUM

OTHER TOOL INFORMATION


Screenshot-rapidminer-studio.png
Screenshot-rapidminer-studio.png
Rapidminer logo.jpg
RapidMiner Studio
Commercial&Open source
RapidMiner
2014/02/26
5.3.015
http://sourceforge.net/projects/rapidminer/#resources
[[has description::RapidMiner is a world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. RapidMiner is now RapidMiner Studio and RapidAnalytics is now called RapidMiner Server.

In a few words, RapidMiner Studio is a "downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics". It can also be used (for most purposes) in batch mode (command line mode).

Camacab0 (talk)]]

General analysis
Students/Learners/Consumers, Teachers/Tutors/Managers, Developers/Designers, Researchers, Organisations/Institutions/Firms, Others
Basic
None
N/A
Medium
Framework/Library/API
Web crawler, Flat file database/Logfile extractor, Structured database extractor
Structured data mining, Text mining, Image mining, Audio mining, Video mining, Data gathering, Social network analysis
Data extraction, Data transformation, Data analysis, Data visualisation, Data conversion, Data cleaning
Basic statistics and data summarization, Data mining methods and algorithms
Simple data format conversion, Simple data transformation operations, Advanced data transformation operations, Mathematical transformation of data for analysis
SQL, TXT, XLS, XML, a lot more
CSV, XML, XSL, a lot more
a lot more
a lot more
Sequential Graphic, Chart/Diagram, Tag Cloud
rather easy to use
High

Draft

Introduction

Rapidminer is both a free open source and commercial product for text mining (content analysis).

“RapidMiner provides data mining and machine learning procedures including: data loading and transformation (ETL), data preprocessing and visualization, modelling, evaluation, and deployment. The data mining processes can be made up of arbitrarily nestable operators, described in XML files and created in RapidMiner's graphical user interface (GUI). RapidMiner is written in the Java programming language. It also integrates learning schemes and attribute evaluators of the Weka machine learning environment and statistical modelling schemes of the R-Project.” (Wikipedia, retrieved 20:37, 13 March 2012 (CET))

Installation

  • Installation of RapidMiner Studio is very easy on Windows (tested on Windows 7 and Windows 8.1, both 64 bits), when using the Installer provided on your RapidMiner Account page.
  • Installation is kind of difficult on Mac OS X depending on Java versions. In 10.10, RapidMiner asks for Java 1.7 or above, even if you've got 1.8.X installed.

Note : RapidMiner is now a commercial software, so you can only use the product for 14 days, after asking a trial license.

A complete set of tools

First of all, it is important to say that RapidMiner Studio - and RapidMiner Server, that work with it - are a complete set of tools, rather than a more specific software. RapidMiner website says that "RapidMiner lets you easily sort through and run more than 1500 operations".

Because of it's complexity, i will only describe some of RapidMiner Studio's functions. However, I will show above an use example of RapidMiner Studio as a basic text miner. Then, I will show you how to use RapidMiner to extract, transform and analyze tweets.

RapidMiner Studio's highlights are :

  • A visual - code-free - environment, so no programming needed
  • Available on all major operating systems and platforms
  • Main function : Design of analysis processes
  • Predictive analytics (with pre-made templates)
  • Data loading
  • Data transformation
  • Data modeling
  • Data visualization (with lots of visualizations)
  • Extension API
  • Lots of data sources : Excel, Access, Oracle, IBM DB2, Microsoft SQL, Sybase, Ingres, MySQL, Postgres, SPSS, dBase, Text files, and more
  • RapidMiner allows you to work with different types and sizes of data sources

Use examples

As we can do almost anything with RapidMiner Studio, I choosed to explore two different activities that can help you later build a text-mining and analyzing project. First, I will show you how to use RapidMiner as a basic text-mining tool. We will see how to extract, transform and analyze text from files on your computer. Secondly, I will explain how you can analyze tweets for free with RapidMiner Studio and a third-party website for Tweeter extraction (that is a premium feature of RapidMiner Studio).

Basic text mining

As described before, RapidMiner can be used as a text mining software. I will describe here an example of text mining process, where we will :

  1. Load and extract words from (the text files in) two directories
  2. Ignore some words that are not wanted (stoplist)
  3. Generate the results
  4. View results

Launch RapidMiner Studio and load data

Fig.1 : Text Mining Extension
Fig.2 : The workspace

As you launched RapidMiner Studio (v. 6.1.1000) you will need to install the Text Mining extension. RapidMiner works with extensions that plug into the core system. The Text Mining extension can be found in RapidMiner Marketplace, which can be accessed from Help > Updates and Extensions (Marketplace) as shows the figure 1.

After restarting the software, we can start working with it. First of all create a New Process. You will see now the main window of RapidMiner Studio, and I will briefly describe the main zones of the working space :

  • In blue we have the main toolbar
  • In orange we can see all the operators that we can use in our processes
  • In green we have the repositories
  • In purple we have the main process windows, where we will be able to see process results and progression
  • In black we have parameters of each element of or process and help

From here, we will first of all find our operator Process Documents from Files (screenshot here) and we will drag it into the Process zone, in the center. At this point we have our operator in our process, and we need to set his parameters. Clic on our operator in the main process area, and see which parameters you can set on the right side. First parameter is text directories which we will set right away.

Fig.3 : Text Directories

Note : On the right side of your toolbar you can see a four-element menu that allows you to switch between Design and Results (also with F8 and F9 keys) that will be very useful. If your results aren't what you were expecting, or you made a mistake when designing your process, you can easily return from the results to the design area.

  • In my case, i have a directory on my Desktop which name is "data"
  • In /data/, I have /litterature/text1.txt and /photographie/text2.txt
  • I will set up my text directories like suggested in the Fig. 3 and give both a different name to be able to show results depending on text directory

In next section we will talk about operators, and we will come back to Process Documents from Files parameters to choose which vector we want RapidMiner to create.

Tokenize & define StopWords

Now that we have our Process Documents from Files operator in our Main Process area and our text directories set up correctly, we need to connect our operator Process Documents from Files on the left (from inp to wor) and on the right (from exa to res, and wor to res). This will allow the data to be processed.

Fig. 5 : Tokenize and Stopwords operators

We will now define what steps (or processes) should be executed during our Process Documents from Files operator. So by double-clicking on it, we can see it's inside. We will now add a Tokenize operator that can be found in operators area (in Tokenization) on the left. Tokenize will separate words making them independent values. One of RapidMiner big values is that graphic user interface, that allows you to build processes quite naturally. We will also be able to add Filter Stopwords (french) - because my text files are in french - into our main Process Documents from Files operator, also by dragging it. You should see something like in Fig. 5 above.

View result

If your main operator is connected (input - output) and that inside of it, your Tokenize operator and your Stopwords operator are also connected to each other, and to input and output as suggests the figure above, you should be ready to launch the process which should generate your results.

Before clicking on the launch button, i want to make you notice that we didn't change the Vector Creation parameter of our Process Documents from Files. That parameter allows you to set the type of visualization you want the software to create from the data given.

If you launch the process leaving the default value (TF-IDF), RapidMiner will present you the results in different ways. First you have two tabs, WordList and ExampleSet.

Note : TF-IDF is a "short for term frequency–inverse document frequency" which is "a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus." Wikipedia

WordList View

Fig. 6 : WordList View

In the WordList view tab (Fig. 6) we have an occurrences analysis.

  • First column presents the words found in documents
  • Second column presents the attributes of words (which in my case are equal to the words themself)
  • Third column shows Total Occurences (how many times we can find the word in all documents)
  • Fourth column shows Document Occurences (in how many documents we can find the word)
  • Fifth and sixth column shows Text Directory Occurences (how many times we can find the word in each text directory)

ExampleSet View

Fig. 7 : ExampleSet View
Fig. 8 : Charts view types

In the ExampleSet view tab (Fig. 7) we have a left menu with five tabs. I will try to present them :

  • First tab is an Overview of the process. We can find there each text directory, each document processed and some other informations.
  • Second tab is Statistics, and allows you to see statistics about the data given.
  • Third and fourth tabs are Charts and Advanced Charts and allows the user to visualize some default charts, or build advanced and customized charts that feed in the data and analysis results of our text mining process.
  • Fifth tab allows the user to save annotations into the process.

Note : Fig. 8 shows you some of the charts view types that RapidMiner proposes.

Export results

When it comes to export results in RapidMiner Studio, each extension and RapidMiner Studio function will allow to do different sort of things. For example, after a text mining process, data will be available in different forms :

  • WordList view will allow you to export the spreadsheet in image (png, svg, jpg, eps, pdf), to print it, or you can even copy/paste the spreadsheet data into Microsoft Excel or Google Drive Spreadsheets.
  • ExampleSet view also allows the user to copy/paste the data from the software, to print it or to export it as an image.

Note : The export as an image function seems to allow you to export all software main area (in the center) but not to export individually an image.

Tweets mining and analysis

Introduction

RapidMiner Studio allows you to extract, transform and analyse data from A to Z with it's core functionalities and free plugins. Unfortunately, some Cloud extensions and functionalities are premium, and pricey. I will explain here how you can extract and analyse tweets only using the free version of RapidMiner Studio and a third-party service for the tweet extraction.

Tweets extraction

Zapier's GoogleDrive and Twitter connection
Twitter search parameters on Zapier

First of all you need to get your data that you want to input in RapidMiner. In our case, we need the tweets that we want to process. As said before, some third-party services allow you to extract tweets automatically from Twitter : I will present Zapier, which "connects the web apps you use to easily move your data and automate tedious tasks". A zap is a connexion between two services, that you can set up to automate tasks.

For our task, I connected Twitter and Google Drive, and specified that I want Zapier to look for an hashtag (#edtech) and to save each tweet containing that value in a new text file, in a Google Drive folder. Once you have the relevant amount of tweets, you can save your Google Drive folder in a local place in your computer, that you will specify to RapidMiner. I got nearly 8'000 tweets in a few days. You have now your data ready to start using it with RapidMiner.

Data transformation

After having all our tweets in a directory on the computer, we can proceed with RapidMiner. We need to make a process that will take our directory as input, and that will output data that can be analysed and visualised. The figure bellow show all my three processes that I will explained bellow.

RapidMiner process, containing the three sub-processes

Let's first focus on the orange process, the Tweets processing :

  • First, the "module" Process documents from files, named Process tweets, allow us, like in the previous tutorial, to specify a directory where text files are. We need to specify, inside this "module", which actions will be triggered.
    • As we want to take out from all tweets the hashtags only, we need to tell RapidMiner to Tokenize first all words (by cutting them where white spaces are).
    • Then we need it to filter the Tokens created to keep only the hashtags. That is done with regular expressions, that select only words starting by # symbol, and followed by letters or numbers.
  • When the Process Tweets "module" is finished, it outputs a WordList that can be converted to a ExampleSet by the Tweets->Data "module". That will allow us to treat this words as data and to use it later.

If we look closer the URL processing, it's made just as the Tweets processing.

  • We have a Process documents from files "module", named Process URL's that will put all files in the directory in a loop, and will execute for each of them two operations :
    • Tokenize, explained before.
    • Filter tokens, that will this time keep only links tweeted. We use a regular expression to keep only "words" starting by "http://".
  • Finally we convert this WordList in an ExampleSet again to be able to connect it to the Result output point.

Data analysis

Figure 1 - Hashtags (sorted by "in documents" count, and alphabetically)

Once the process showed before is complete and valid, you can test it to see if data outputed is what you were waiting for. My process gets me three ExampleSets, as i had three ouput points connected. I will present now two of these ExampleSets and talk then about the third one, the Read Excel process.

Figure 2 - Most represented hashtags in a graph (equivalent hashtags and #edchat filtered)

My first process had as objective to show which hashtags were represented most, combined with #edtech hashtag. The "Tweets->DATA" ExampleSet show us that. You can see it in a data view (table) which can be sorted and in other ways like charts.

  • Figure 1 shows the data view, we can there see all hashtags and the number of documents (tweets) in which they were.
  • Figure 2 shows a graphic with most represented hashtags.

My last process, read Excel, is the easiest way I found to filter tokens depending on the "In documents" value. As some hashtags like #EdTech, #edTech, #Edtech were some of the most used hashtags, as I didn't used a case sensitive action to remove capital letters, and because de graph wasn't "viewable" due to the huge amount of different hashtags, I needed to filter my final data. I looked how to do it, and tried different ways, but didn't manage to do it. What I did is that I exported the data resulting from my "Tweets->Data" process, in a Microsoft Excel file. I then deleted all unwanted lines (equivalent hashtags and hashtags less represented) to keep only the most used hashtags. I created a process in RapidMiner that reads that file and outputs it's data : I then have filtered data, that can be showed.

  • The figure 2 graphic is the result of the Read Excel process. It only contains the most used hashtags, and filters the "equivalents" hashtags. It is important to say also that the most used hashtag (#edchat) has also been removed to better view of the others hashtags.

Finally, my second process extracts links from the tweets, to see which kind of content could be behind the most tweeted links.

URL analysis

URL data sorted by "In documents"

As I said before I used RapidMiner to process my tweets and extract only the links. As I could not find a functionality in RapidMiner that allows me to ping an URL and to get it's real URL (all links in twitter are shortened with an URL Shortener) to be able, for example, to check which domains are more represented, I did it manually.

I kept only the five more tweeted URL's and checked them. Here they are :

Results and comments

Process tweets results

  • First, I was able to see that capital letters are taken in consideration in tweets. We choosed #edtech hashtag, but others were used like #EdTech, #Edtech or #edTech.
  • Secondly, the most used hashtag was clearly #edchat in nearly 700 tweets. Second was #education (132) and #ipaded (117) was third.

Process URL results

  • We can see that in the top 5 links, two of them target to status on Twitter with images. One is an infography about digital citizenship, and the other one is a funny picture.
  • We can see that the three other links are website articles about subjects between education and technology, what our hashtag is used for.

Comments

This process has the main objective of showing how we work with data in RapidMiner. Of course I only explored a very small amount of it's functionalities and strengths. I think that the process that processes tweets could be much better : it could analyse hashtags that are together in a tweet, could analyse how many hashtags are used, on average, in every tweet. I could also cross the hashtags represented in #edtech tweets with the ones represented in #edchat tweets for example.

As said before, the process treating links could be more automatised : it could resolve "real domains" automatically, and we would be able then to count or mesure which articles or even domain names (websites) are more represented.

Finally, it was sometimes pleasant to work with RapidMiner, sometimes not. It's own structure is kind of easy to understand and use once you understand it, and the visual input-output points, the inclusive documentation that gives you information about the data that can enter and exit a "module" help a lot when you're beginning. Rapidminer also allow to do use full version of the software, for a limited time, which is very positive.

Unfortunately some actions are not easy to find (as the Zoom out action, that only can be accessed clicking on a graph with the mouse and dragging the mouse upper-left), and it's kind of difficult to navigate in the build-in "modules" and find the one you need for an operation.

Links

Official

Get RapidMiner

Documentation / Tutorials