Orange Textable: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 202: | Line 202: | ||
Once the table is configured ''Convert widget'' takes the data table and emits Example tables that can be visualized using the ''Orange Data table widget'' or other widget of Orange canvas. | Once the table is configured ''Convert widget'' takes the data table and emits Example tables that can be visualized using the ''Orange Data table widget'' or other widget of Orange canvas. | ||
==Example== | |||
In this example we will examine the technologies and the frequency that these technologies have been used, according to one of the students during the course Sciences et Technologies de l’Information et de la Communication I (STIC I), Master of Science in Learning and Teaching Technologies, University of Geneva at 2003. | |||
During this course the students created a web page using the XML language that gathers all the exercises submitted for this course along with the associated technologies as well as exercises submitted for other courses of their master. We’ve selected to focus on one student for two primary reasons. Firstly we have the student’s agreement and secondly our purpose is to provide an example of the functionalities of Orange Textable tool and not an in-depth analysis of the problem in question. | |||
We will create a concordance data table based on the student’s web page that associates the exercises submitted along with the technologies used for these exercises and we will measure the frequency of each technology for a given exercise. | |||
By viewing the student page http://tecfaetu.unige.ch/etu-maltt/tetris/karanis0/ we can see that this page comports information about the student as well as the exercises that the student submitted for three courses. We are interested for the exercises that are part of the course Stic I. | |||
Step 1 | |||
Firstly we have to import the url into Orange Textable. To do so, we will select the URLs widget and copy paste the above url into the URL field. Then we will specify the encoding as shown in the following picture. (image one?) | |||
By using the display widget we can visualize the XML tags as well as the text enclosed in these tags. The tags that we are interested in are : “course” and “exercise”. | |||
Step 2 | |||
Now we have to isolate the exercises that are part of the course Stic I. In order to do that firstly we have to extract the content of the “course” tag. We will use the Extract XML widget and we will type “course” in the XML element field as shown in the following picture. (image three) | |||
“Course” is the XML tag of our page that encloses all the associated information to a given course. As we can notice in the following picture we have created three segments. Each segment corresponds to the content of each course. | |||
In order to isolate the content of the course Stic, we link the Select widget to the Extract XML widget and type the regular expression shown in the following picture. (image four) | |||
By doing this, we note that we’ve selected the whole segment that contains the string Stic and not just the specified string “Stic”. | |||
So far we have created two segments that correspond to the content related to the courses Stic I and Stic II. Finally we have to isolate each exercise of the course Stic I. To do so, we will segment the above segmentations by using the Extract XML widget and the tag “exercise”. (image five) | |||
Now we have 25 segments and each segment corresponds to an exercise of either the course Stic I or the course Stic II. To isolate the desired exercises we will use the Select widget six times and we will be specifying each time the string that refers to the desired segment, that is <exercise-number>1</exercise-number>, <exercise-number>2</exercise-number>, exercise-number>3</exercise-number> etc. (image six) | |||
We note that at this point it is important to specify the label of our segment as it will allow us to distinguish them during the merging process that follows. | |||
Step 3 | |||
Our next goal is to identify the technologies associated to theses exercises. So, firstly we need one segmentation consisting of the six segments that we’ve created in the previous step (2). To do so, we will merge these segments using the Merge widget. In the Merge widget window we leave the default options and check the box import labels with key. Now we will have one segment with one label that consists of all the segments of step 2 (all exercises for Stic I) | |||
Step 4 | |||
Now we are ready to identify the technologies cited in our segmentation, which are the six merged segments. For accomplishing that, we will use the Segment widget and we will specify the strings that we are searching for (technologies names) as well as the annotation value of each outcome segment. (image eight) | |||
In order to specify the Regular Expressions (Regexes), we take the example of the css string. We select Tokenize Mode (t) and type \b(css)\b in the Regex field. Then, we type “type” in the Annotation key field and css in the Annotation value field. Finally we check the boxes Ignore case (i) and Unicode dependent (u). We note that the annotation key is important as we will use it in order to “call” these segments during the construction of the table. | |||
Step 5 | |||
In order to count the technologies associated to the given exercises as well as the number of the times that each technology has been cited for a given exercise, we will use the Count widget. In the Units field we select annotation key: “type” (labels of the technologies). In the Contexts field we select Mode: Containing segmentation and Annotation key: component_labels (labels of the exercises) and click the “Compute” button as show in the following picture. (image nine) | |||
Step 6 | |||
Finally in order to visualize the result, that is the table that we have created we will use the Convert widget and the Data Table widget (from the Data window) and we will leave the default options. (image ten) | |||
Conclusion | |||
Orange Textable allows its users to create data tables on the basis of text data through a flexible and intuitive interface. The quality of the results lies mostly on the segmentation and annotation processes effectuated by the user. The creation of the adequate segments and their annotation will ensure that the concordances and collocation lists as well as the computed quantitative indices (frequency measures, etc...) refer to the desired segments. At this point we have to stress that the segmentation process is highly dependent on regular expressions. For this reason, even if orange Textable is a tool that one can easily learn, good knowledge of regular expressions might be required. | |||
== References == | == References == | ||
[https://orange-textable.readthedocs.org/en/latest/ Orange Textable Documentation] | [https://orange-textable.readthedocs.org/en/latest/ Orange Textable Documentation] | ||
[http://orange.biolab.si/download/ Orange package dowload page] | [http://orange.biolab.si/download/ Orange package dowload page] |