Orange Textable: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
(12 intermediate revisions by the same user not shown) | |||
Line 155: | Line 155: | ||
</gallery> | </gallery> | ||
*In general the first section is different accord to the widget that you are using. | *In general the first section is different accord to the widget that you are using. | ||
**Preprocessing (''Preprocess widget''): This widget inputs a segmentation and outputs a segmentation covering the modified text. The possible modifications are to replace the accented characters by their non-accented equivalents as well as lower case by upper case characters and vice versa. Note that ''Preprocess'' creates a copy of each modified segment and increases the program’s memory footprint. Finally as it creates new strings and not only new segmentations it won’t work if combined with segmentations that refer to different strings. In the sequence depicted in the image bellow the frequency table will remain empty.[[File:PreprocesEx.png|Image taken from Orange Textable documentation]] | **Preprocessing (''Preprocess widget''): This widget inputs a segmentation and outputs a segmentation covering the modified text. The possible modifications are to replace the accented characters by their non-accented equivalents as well as lower case by upper case characters and vice versa. Note that ''Preprocess'' creates a copy of each modified segment and increases the program’s memory footprint. Finally as it creates new strings and not only new segmentations it won’t work if combined with segmentations that refer to different strings. In the sequence depicted in the image bellow the frequency table will remain empty.[[File:PreprocesEx.png|frame|Image taken from Orange Textable documentation]] | ||
**Substitutions (''Record widget''): This widget inputs segmentation which covers the text that should be recoded and outputs segmentation that covers the recoded text. It “captures” and substitutes the inputted text by using regular expressions. The text to be “captured” is encoded in the “Regex” field and the text that substitutes it in the “Replacement string” field. If the “Replacement string” field is empty the “captured” text will be deleted. Note that it creates new strings and not only new segmentations so it is subject of the same limitations as the ''Preprocess widget''. | **Substitutions (''Record widget''): This widget inputs segmentation which covers the text that should be recoded and outputs segmentation that covers the recoded text. It “captures” and substitutes the inputted text by using regular expressions. The text to be “captured” is encoded in the “Regex” field and the text that substitutes it in the “Replacement string” field. If the “Replacement string” field is empty the “captured” text will be deleted. Note that it creates new strings and not only new segmentations so it is subject of the same limitations as the ''Preprocess widget''. | ||
**Ordering (''Merge widget''): This widget inputs two or more segmentations and outputs a merged segmentation. You can reorder the inputted segmentations by selecting them and then clicking the move up/move down buttons. | **Ordering (''Merge widget''): This widget inputs two or more segmentations and outputs a merged segmentation. You can reorder the inputted segmentations by selecting them and then clicking the move up/move down buttons. | ||
Line 201: | Line 201: | ||
Once the table is configured ''Convert widget'' takes the data table and emits Example tables that can be visualized using the ''Orange Data table widget'' or other widget of Orange canvas. | Once the table is configured ''Convert widget'' takes the data table and emits Example tables that can be visualized using the ''Orange Data table widget'' or other widget of Orange canvas. | ||
==Example with one url== | |||
[[File:SchemaNew.png|800x600px|center|frame|Figure 1: Complete scheme]] | |||
In this example we will examine the technologies and the frequency that these technologies have been used, according to one of the students during the course Sciences et Technologies de l’Information et de la Communication I (STIC I), Master of Science in Learning and Teaching Technologies, University of Geneva at 2003. <br /> | |||
During this course the students created a web page using the XML language that gathers all the exercises submitted for this course along with the associated technologies as well as exercises submitted for other courses of their master. We’ve selected to focus on one student for two primary reasons. Firstly we have the student’s agreement and secondly our purpose is to provide an example of the functionalities of Orange Textable tool and not an in-depth analysis of the problem in question. <br /> | |||
We will create a concordance data table based on the student’s web page that associates the exercises submitted along with the technologies used for these exercises and we will measure the frequency of each technology for a given exercise. | |||
By viewing the student page http://tecfaetu.unige.ch/etu-maltt/tetris/karanis0/ we can see that this page comports information about the student as well as the exercises that the student submitted for three courses. We are interested for the exercises that are part of the course Stic I. | |||
'''Step 1''' | |||
Firstly we have to import the url into Orange Textable. To do so, we will select the URLs widget and copy paste the above url into the URL field. Then we will specify the encoding as shown in the following picture. | |||
[[File:One.png|center|frame|Figure 2: Interface of the URLs widget]] | |||
By using the display widget we can visualize the XML tags as well as the text enclosed in these tags. The tags that we are interested in are : “course” and “exercise”. | |||
'''Step 2''' | |||
Now we have to isolate the exercises that are part of the course Stic I. In order to do that firstly we have to extract the content of the “course” tag. We will use the Extract XML widget and we will type “course” in the XML element field as shown in the following picture. | |||
[[File:Three.png|center|frame|Figure 3: Interface of the Extract XML widget]] | |||
“Course” is the XML tag of our page that encloses all the associated information to a given course. As we can notice in the above picture we have created three segments. Each segment corresponds to the content of each course. | |||
In order to isolate the content of the course Stic, we link the Select widget to the Extract XML widget and type the regular expression shown in the following picture. | |||
[[File:Four.png|center|frame|Figure 4: Interface of the Select widget]] | |||
By doing this, we note that we’ve selected the whole segment that contains the string Stic and not just the specified string “Stic”. | |||
So far we have created two segments that correspond to the content related to the courses Stic I and Stic II. Finally we have to isolate each exercise of the course Stic I. To do so, we will segment the above segmentation by using the Extract XML widget and the tag “exercise”. | |||
[[File:Five.png|center|frame|Figure 5: Interface of the Extract XML widget]] | |||
Now we have 25 segments and each segment corresponds to an exercise of either the course Stic I or the course Stic II. To isolate the desired exercises we will use the Select widget six times and we will be specifying each time the string that refers to the desired segment, that is <exercise-number>1</exercise-number>, <exercise-number>2</exercise-number>, exercise-number>3</exercise-number> etc. | |||
[[File:Six.png|center|frame|Figure 6: Interface of the Select widget]] | |||
We note that at this point it is important to specify the label of our segment as it will allow us to distinguish them during the merging process that follows. | |||
'''Step 3''' | |||
Our next goal is to identify the technologies associated to theses exercises. So, firstly we need one segmentation consisting of the six segments that we’ve created in the previous step (2). To do so, we will merge these segments using the Merge widget. In the Merge widget window we leave the default options and check the box import labels with key. Now we will have one segment with one label that consists of all the segments of step 2 (all exercises for Stic I) | |||
[[File:Seven.png|center|frame|Figure 7: Interface of the Merge widget]] | |||
'''Step 4''' | |||
At this point we need to get rid of the urls included in the text, to avoid having double results in the case that the technology cited is also associated to a link including the name of the technology. To do so we will use the Record widget and the regular expression shown below. | |||
[[File:PictureUrl.png|center|frame|Figure 7: Interface of the Record widget]] | |||
'''Step 5''' | |||
Now we are ready to identify the technologies cited in our segmentation, which are the six merged segments. For accomplishing that, we will use the Segment widget and we will specify the strings that we are searching for (technologies names) as well as the annotation value of each outcome segment. | |||
[[File:Eight.png|center|frame|Figure 9: Interface of the Segment widget]] | |||
In order to specify the Regular Expressions (Regexes), we take the example of the css string. We select Tokenize Mode (t) and type \b(css)\b in the Regex field. Then, we type “type” in the Annotation key field and css in the Annotation value field. Finally we check the boxes Ignore case (i) and Unicode dependent (u). We note that the annotation key is important as we will use it in order to “call” these segments during the construction of the table. | |||
'''Step 6''' | |||
In order to count the technologies associated to the given exercises as well as the number of the times that each technology has been cited for a given exercise, we will use the Count widget. In the Units field we select annotation key: “type” (labels of the technologies). In the Contexts field we select Mode: Containing segmentation and Annotation key: component_labels (labels of the exercises) and click the “Compute” button as show in the following picture. | |||
[[File:Nine.png|center|frame|Figure 10: Interface of the Count widget]] | |||
'''Step 7''' | |||
Finally in order to visualize the result, that is the table that we have created we will use the Convert widget and the Data Table widget (from the Data window) and we will leave the default options. | |||
[[File:ResultNew.png|center|frame|Figure 11: Result]] | |||
You can dowload the example by clicking [http://tecfaetu.unige.ch/etu-maltt/tetris/karanis0/SticIII/ here] | |||
==Example with more urls== | |||
In this mini-research we study if the students of the course Jeux vidéos pédagogiques (VIP) focus more on the player of a game or on the game itself in their analysis of various games and how this focus changes over time. | |||
For the purposes of our research we focus on the game analyses made by the students for the course VIP during the years 2012-2014. More specifically, we examine how many times in total the words “player” and “game” appear in the analysis of each year. Theses analysis can be found in edutechwiki in the following urls: | |||
Analyses 2012: http://edutechwiki.unige.ch/fr/Cat%C3%A9gorie:Maltt_VIP_Stella<br /> | |||
Analyses 2013: http://edutechwiki.unige.ch/fr/Cat%C3%A9gorie:Maltt_VIP_Tetris<br /> | |||
Analyses 2014: http://edutechwiki.unige.ch/fr/Cat%C3%A9gorie:Maltt_VIP_Utopia | |||
'''Step 1''' | |||
Firstly, we import the urls into Orange Textable defying for each url the encoding, the annotation key and the annotation value. Note that the annotation value must correspond to the year that each analysis was written in order to later present our results by year. As soon as we import all the desired urls we define the output segmentation label. | |||
'''Step 2''' | |||
Furthermore, we clear our html pages from all the unwanted style and script tags along with their contents and all the remaining html tags in order to be left with the analysis in pure text so that we can study them easier. We do that using the ''Record widget'' and the regular expressions shown in the picture below.The regular expression ''<script [^>]*> [\s\S]*?</script>'' removes the script tags along with their contents whereas the regular expression ''<style [^>] *> [\s\S]*?</style>'' removes the style tags with their contents and finally the regular expression ''<.*?>'' removes the html tags. | |||
[[File:Picture1Textable.png|center|frame|Figure 1: Interface of the Record widget]] | |||
'''Step 3''' | |||
Once the texts are free of the unwanted elements, we search them for the two words that are of our interest. To do so, we use the ''Segment widget'' and the regular expressions shown below. | |||
[[File:Picture2Textable.png|center|frame|Figure 2: Interface of the Segment widget]] | |||
'''Step 4''' | |||
We then continue by counting how many times the words “player” and “play” are cited in the analysis of each year. We use the ''Count widget'' for doing so. In the Units field we select annotation key: “type” (labels of the words “player” and “play”). In the Contexts field we select Mode: Containing segmentation and Annotation key: years (labels of the urls) and we continue by clicking the “Compute” button as show in the following picture. | |||
[[File:Picture3Textable.png|center|frame|Figure 3: Interface of the Count widget]] | |||
'''Step 5''' | |||
Finally and in order to visualize the result, that is the table that we have created, we use the ''Convert widget'' and ''Data Table widget'' (from the Data window) leaving their default options. | |||
Our final scheme is shown in the following picture | |||
[[File:PictureSchemaTextable.png|center|frame|Figure 6: Complete scheme]] | |||
'''Results''' | |||
As we can see from the data table, the students of the first year we examine (2012) refer more to the word "game" than "player". At the second year (2013), the use of the word "game" is also more frequent than "player" but we can notice that the reference to the latter starts earning ground. As for the third year (2014), the use of the word "player" is greater than the use of the word "game". From this mini-research derives that over the years the tendency of students to refer more to the game than the player is changing in benefit of the latter. | |||
[[File:PictureResultsTextable.png|center|frame|Figure 4: Results]] | |||
You can dowload the example by clicking [http://tecfaetu.unige.ch/etu-maltt/tetris/karanis0/SticIII/ here] | |||
==Conclusion== | |||
Orange Textable allows its users to create data tables on the basis of text data through a flexible and intuitive interface. The quality of the results lies mostly on the segmentation and annotation processes effectuated by the user. The creation of the adequate segments and their annotation will ensure that the concordances and collocation lists as well as the computed quantitative indices (frequency measures, etc...) refer to the desired segments. At this point we have to stress that the segmentation process is highly dependent on regular expressions. For this reason, even if orange Textable is a tool that one can easily learn, good knowledge of regular expressions might be required. | |||
== References == | == References == | ||
[https://orange-textable.readthedocs.org/en/latest/ Orange Textable Documentation] | [https://orange-textable.readthedocs.org/en/latest/ Orange Textable Documentation] | ||
[http://orange.biolab.si/download/ Orange package dowload page] | [http://orange.biolab.si/download/ Orange package dowload page] |