User:Daniel K. Schneider

About myself

Daniel K. Schneider is a senior lecturer and researcher at TECFA, a research and teaching unit in the faculty of psychology and education, University of Geneva. Holding a PhD in political science, he has been working in educational technology since 1988 and participated in various innovative pedagogical and technological projects. He has been a prime mover towards the introduction of creative pedagogical strategies and ICT technologies. His current R&D interests focus on modular, flexible and open Internet architectures supporting rich and effective educational designs. Within TECFA's "blended" master program in educational technology ([ MALTT]), he teaches educational information & communication systems, virtual environments and research methodology.

I am the coordinator / initiator of this wiki
E-mail: Daniel.Schneider at unige.ch (E.g. if you have questions about the why and what of this Wiki)
My "classic" HTML Home Page (rarely updated)
Classes in french: Cours STIC: STIC I - STIC II - STIC III - STIC IV, BASES
Classes in English: See Courses and workshops
Talks: some slides

DataMelt is a software environment for numeric calculations, statistics and data analysis

DataMelt, or DMelt, is an environment for numeric computation, data analysis and data visualization. DMelt is designed for analysis of large data volumes ("big data"), data mining, statistical analyses and math computations. The program can be used in many areas, such as natural sciences, engineering, modeling and analysis of financial markets.
DMelt is a computational platform: It can be used with different programming languages on different operating systems. Unlike other statistical programs, DataMelt is not limited by a single programming language. Data analysis and statistical computations can be done using high-level scripting languages (Python/Jython, Groovy, etc.), as well as a lower-level language, such as JAVA. It incorporates many open-source JAVA packages into a coherent interface using the concept of dynamic scripting.
DMelt creates high-quality vector-graphics images (SVG, EPS, PDF etc.) that can be included in LaTeX and other text-processing systems.
The program runs on Windows, Linux, Mac OS.
Quotes from the official home page (10/2014):
- TANAGRA is a free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area.
- The main purpose of Tanagra project is to give researchers and students an easy-to-use data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyse either real or synthetic data.
- The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools : the data management.
- The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for building this kind of software. They should take advantage of free access to source code, to look how this sort of software is built, the problems to avoid, the main steps of the project, and which tools and code libraries to use for. In this way, Tanagra can be considered as a pedagogical tool for learning programming techniques.
According to its author, Tangara can be compared to Weka: In comparison it has an easier to use Interface, but less functionality.
“Pajek (Slovene word for Spider) is a program, for Windows, for analysis and visualization of large networks. It is freely available, for noncommercial use, at its download page. See also a reference manual for Pajek (in PDF). The development of Pajek is traced in its History. See also an overview of Pajek's background and development. ” (Pajek, sept. 22, 2014) Pajek includes six data structures (e.g. network, permutation, cluster,...) and about 15 alorithms using these structures (e.g. partitions, decompositions, paths, flows...)
“The open-source LightSide platform, including the machine-learning and feature-extraction core as well as the researcher's workbench UI, has been and continues to be funded in part through Carnegie Mellon University, in particular by grants from the National Science Foundation and the Office of Naval Research.” (LightSide home page, sept. 2014).
According to the official help page (3(2014), DocuBurst is an online document visualization tool, and can be used for:
- Uploading your own text documents
- Creating interactive visual summaries of documents
- Exploring keywords to uncover document themes or topics
- Investigating intra-document word patterns, such as character relationships
- Comparing documents
- Commenting, annotating and sharing visualizations with others
- According to the home page (oct 1 2014):
The Apache Mahout™ project's goal is to build a scalable machine learning library. Quote: “Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.”
AntConc is a freeware concordance program for Windows, Macintosh OS X, and Linux. The software includes seven tools: Concordance Tool: shows search results in a 'KWIC' (KeyWord In Context) format. Concordance Plot Tool: shows search results plotted as a 'barcode' format. This allows you to see the position where search results appear in target texts. File View Tool: This tool shows the text of individual files. This allows you to investigate in more detail the results generated in other tools of AntConc. Clusters/N-Grams: hows clusters based on the search condition. In effect it summarizes the results generated in the Concordance Tool or Concordance Plot Tool. The N-Grams Tool, on the other hand, scans the entire corpus for 'N' (e.g. 1 word, 2 words, …) length clusters. This allows you to find common expressions in a corpus. Collocates: shows the collocates of a search term. This allows you to investigate non-sequential patterns in language. Word List: counts all the words in the corpus and presents them in an ordered list. This allows you to quickly find which words are the most frequent in a corpus. Keyword List: shows the which words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus. This allows you to identify characteristic words in the corpus, for example, as part of a genre or ESP study.
A website where you can visualise data such as numbers, text and geographic information. You can create a range of visualisations including unusual ones such as “treemaps” and “phrase nets”. All the charts made in Many Eyes are interactive, so you can change what data is shown and how it is displayed. Many Eyes is also an online community where users can create topical groups to organise, share and discuss data visualisations. You can sign up to receive notifications when there are new visualisations or data on topics you are interested in. Only being able to use Many Eyes if your data and the visualisations can be shared publicly on the Internet.
A website where you can visualise data such as numbers, text and geographic information. You can create a range of visualisations including unusual ones such as “treemaps” and “phrase nets”. All the charts made in Many Eyes are interactive, so you can change what data is shown and how it is displayed. Many Eyes is also an online community where users can create topical groups to organise, share and discuss data visualisations. You can sign up to receive notifications when there are new visualisations or data on topics you are interested in. Only being able to use Many Eyes if your data and the visualisations can be shared publicly on the Internet.
Commercial software for extracting specific information. Using a point-and-click interface, Mozenda enables to extract specific information and images from websites. Mozenda is composed of an "Agent builder" and a web-console. The Mozenda Web Console can run the Agent created in the Agent Builder and enables to organize, manage, view, export and publish information. All agents are run on highly optimized harvesting servers in Mozenda's Data Centers.
Cytoscape is an open source software platform for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles and other state data. Although Cytoscape was originally designed for biological research, now it is a general platform for complex network analysis and visualization. Cytoscape core distribution provides a basic set of features for data integration, analysis, and visualization. Additional features are available as Apps (formerly called Plugins). Apps are available for network and molecular profiling analyses, new layouts, additional file format support, scripting, and connection with databases. They may be developed by anyone using the Cytoscape open API based on Java™ technology and App community development is encouraged. Most of the Apps are freely available from Cytoscape App Store. See Cytoscape App store
Features
- A rich set of data visualizations
- An easy-to-use interface for exploring and visualizing data
- Create and share dashboards
- Enterprise-ready authentication with integration with major authentication providers (database, OpenID, LDAP, OAuth & REMOTE_USER through Flask AppBuilder)
- An extensible, high-granularity security/permission model allowing intricate rules on who can access individual features and the dataset
- A simple semantic layer, allowing users to control how data sources are displayed in the UI by defining which fields should show up in which drop-down and which aggregation and function metrics are made available to the user
- Integration with most SQL-speaking RDBMS through SQLAlchemy
- Deep integration with Druid.io
This project was originally named Panoramix, was renamed to Caravel in March 2016, and is currently named Superset as of November 2016
Features:
- text tokenization, including deep semantic features like parse trees
- inverted and forward indexes with compression and various caching strategies
- a collection of ranking functions for searching the indexes
- topic models
- classification algorithms
- graph algorithms
- language models
- CRF implementation (POS-tagging, shallow parsing)
- wrappers for liblinear and libsvm (including libsvm dataset parsers)
- UTF8 support for analysis on various languages
- multithreaded algorithms
- GATE is over 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents ([http://gate.ac.uk/overview.html
GATE: a full-lifecycle open source solution for text processing])
GISMO is a graphical interactive monitoring tool that provides visualization of students' activities in online courses to instructors. Instructors can examine various aspects of distance students, such as the attendance to courses, reading of materials, submission of assignments. Users of Moodle may benefit from GISMO for their teaching activities. With respect to the standard reports provided by Moodle (which basically allow teachers to see if an individual student has viewed a specific resource or participated on a specific activity on a specific day), GISMO provides comprehensive visualizations that gives an overview of the whole class, not only a specific student or a particular resource. GISMO is available for Moodle 1.9.x. and Moodle 2.x.
IRaMuTeQ stands for "Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires", in English, "interface of R for multi-dimensional text and questionnaire analysis". Iramutec is built on top of R As of oct 2014, there is only a french Interface, but the software can deal with English texts.
KEEL (Knowledge Extraction based on Evolutionary Learning) is an open source (GPLv3) Java software tool which empowers the user to assess the behavior of evolutionary learning and Soft Computing based techniques for different kinds of DM problems: regression, classification, clustering, pattern mining and so on. See a complete description on KEEP website
KH Coder is an application for quantitative content analysis, text mining or corpus linguistics. It can handle Japanese, English, French, German, Italian, Portuguese and Spanish language data. By inputting the raw texts the searching and statistical analysis functionalities like KWIC, collocation statistics, co-occurrence networks, self-organizing map, multidimensional scaling, cluster analysis and correspondence analysis can be utilized.
KNIME is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes). The open source version claims to implement a very rich platform: “The KNIME Analytics Platform incorporates hundreds of processing nodes for data I/O, preprocessing and cleansing, modeling, analysis and data mining as well as various interactive views, such as scatter plots, parallel coordinates and others. It integrates all of the analysis modules of the well known Weka data mining environment and additional plugins allow R-scripts to be run, offering access to a vast library of statistical routines.”
LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:
- Find the names of people, organizations or locations in news
- Automatically classify Twitter search results into categories
- Suggest correct spellings of queries
The free and open source version requires that data processed and linked software must be freely available. There are other versions.
LOCO-Analyst is an educational tool aimed at providing teachers with feedback on the relevant aspects of the learning process taking place in a web-based learning environment, and thus helps them improve the content and the structure of their web-based courses. LOCO-Analyst aims at providing teachers with feedback regarding:
- all kinds of activities their students performed and/or took part in during the learning process,
- the usage and the comprehensibility of the learning content they had prepared and deployed in the LCMS,
- contextualized social interactions among students (i.e., social networking) in the virtual learning environment.
- Log Parser is a flexible command line utility that was initially written by Gabriele Giuseppini, a Microsoft employee, to automate tests for IIS logging. It was intended for use with the Windows operating system, and was included with the IIS 6.0 Resource Kit Tools. The default behavior of logparser works like a "data processing pipeline", by taking an SQL expression on the command line, and outputting the lines containing matches for the SQL expression. (From wikipedia)
Microsoft describes Logparser as a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows operating system such as the Event Log, the Registry, the file system, and Active Directory. The results of the input query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart.
Log Parser studio graphical user interface (GUI) to function as a front-end to Log Parser 2.2 and a ‘Query Library’ in order to manage all queries and scripts that one builds up over time. Log Parser Studio (LPS) can house all queries in a central location and allows to edit, create and save queries. You can search for queries using free text search as well as export and import both libraries and queries in different formats allowing for easy collaboration as well as storing multiple types of separate libraries for different protocols.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur euismod molestie suscipit. Quisque metus libero, vulputate sed consectetur elementum, molestie id mi. Aliquam tristique diam metus, eget tincidunt tortor aliquet sit amet. Vestibulum ac velit id lacus blandit hendrerit eu nec risus. Donec ac elementum nisi. Interdum et malesuada fames ac ante ipsum primis in faucibus. Nulla nec ipsum felis. Vestibulum neque diam, laoreet in mollis eget, vulputate at erat. Donec quis semper est, in condimentum quam. Pellentesque pulvinar semper est, ac condimentum massa adipiscing ut. Sed pharetra ligula et posuere vulputate. Morbi ullamcrper auctor varius. Nulla eget nibh at ipsum convallis faucibus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Sed sed turpis sagittis, viverra libero ac, lacinia ligula.
MAXQDA is a mixed methods research tool. There are two versions:
- MAXQDA includes more classical QDA functionality (e.g. the ones that can be found in Atlas or Nvivo) + data management/import tools
- MAXQDAplus contains the quantiative MAXDictio tool.
According to Wikipedia (oct 2013), “MAXQDA is a software program designed for computer-assisted qualitative and mixed methods data, text and multimedia analysis in academic, scientific, and business institutions. It is the successor of winMAX, which was first made available in 1989.”
NaCTeM has developed a number of high-quality text mining tools for the UK academic community. However, at least some seem to available to all for non commercial purposes ([1]) NaCTeM's tools and services offer benefits to a wide range of users eg. reduction in time and effort for finding and linking pertinent information from large scale textual resources and customised solutions in semantic data analysis. (Our Aims and Objectives, retrieved March 2014). NaCTeM tools are available in different ways. For basic tools, web services exist. Others require download and sometimes configuration/installation.
NetDraw is a free Windows program for visualizing social network data NetDraw is also included in UCINET, a fairly cheap commercial SNA program deveveloped by the same company.
NetMiner is an application software for exploratory analysis and visualization of large network data based on SNA(Social Network Analysis). It can be used for general research and teaching in social networks. This tool allows researchers to explore their network data visually and interactively, helps them to detect underlying patterns and structures of the network. It features data transformation, network analysis, statistics, visualization of network data, chart, and a programming language based on the Python script language.
Neural Designer is a data mining application intended for professional data scientists. It uses neural networks, which are mathematical models of the brain function that can be trained in order to perform tasks such as function regression, pattern recognition, time series prediction or auto-association. The software provides a graphical user interface using a wizard approach consisting of a sequence of pages. It allows you to run the tasks and to obtain comprehensive results as a report in an easy way. Neural Designer outstands in terms of performance. Indeed, it is developed using C++, has been subjected to code optimization techniques and makes use of parallel processing. It can analyze bigger data sets in less time.
Online tools to assist in the conversion of JSON to CSV.
OpenSesame is a graphical, open-source experiment builder for the social sciences. It sports a modern and intuitive user interface that allows you to build complex experiments with a minimum of effort. With OpenSesame you can create a wide range of experiments. The plug-in framework and Python scripting allow you to incorporate external devices, such as eye trackers, response boxes, and parallel port devices, into your experiment. OpenSesame is freely available under the General Public Licence.
Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics. Various addons like Orange Textable expand functionality of this software
Piwik is an open source web analytics platform. Piwik displays reports regarding the geographic location of visits, the source of visits (i.e. whether they came from a website, directly, or something else), the technical capabilities of visitors (browser, screen size, operating system, etc.), what the visitors did (pages they viewed, actions they took, how they left), the time of visits and more. In addition to these reports, Piwik provides some other features that can help users analyze the data Piwik accumulates, such as:
- Annotations — the ability to save notes (such as one's analysis of data) and attach them to dates in the past.
- Transitions — a feature similar to Click path-like features that allows one to see how visitors navigate a website, but different in that it only displays navigation information for one page at a time.
- Goals — the ability to set goals for actions it is desired for visitors to take (such as visiting a page or buying a product). Piwik will track how many visits result in those actions being taken.
- E-commerce — the ability to track if and how much people spend on a website.
- Page Overlay — a feature that displays analytics data overlaid on top of a website.
- Row Evolution — a feature that displays how metrics change over time within a report.
- Custom Variables — the ability to attach data, like a user name, to visit data.
- QDA Miner is qualitative "mixed methods" data analysis package. There are two version:
- A free QDA Miner Lite Version
- An expensive commercial version
Quote from the official product page: “DA Miner is an easy-to-use qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner qualitative data analysis tool may be used to analyze interview or focus group transcripts, legal documents, journal articles, speeches, even entire books, as well as drawings, photographs, paintings, and other types of visual documents. Its seamless integration with SimStat, a statistical data analysis tool, and WordStat, a quantitative content analysis and text mining module, gives you unprecedented flexibility for analyzing text and relating its content to structured information including numerical and categorical data.”
Quotation from the getting started page (11/2014): “ WordSmith Tools is an integrated suite of programs for looking at how words behave in texts. You will be able to use the tools to find out how words are used in your own texts, or those of others. The WordList tool lets you see a list of all the words or word-clusters in a text, set out in alphabetical or frequency order. The concordancer, Concord, gives you a chance to see any word or phrase in context -- so that you can see what sort of company it keeps. With KeyWords you can find the key words in a text. The tools have been used by Oxford University Press for their own lexicographic work in preparing dictionaries, by language teachers and students, and by researchers investigating language patterns in lots of different languages in many countries world-wide.”
Quote: “koRpus is an R package i originally wrote to measure similarities/differences between texts. over time it grew into what it is now, a hopefully versatile tool to analyze text material in various ways, with an emphasis on scientific research, including readability and lexical diversity features.”
Quote: OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases like Freebase. ([2], oct 2. 2014)
Quote: The Stanford NLP Group makes parts of our Natural Language Processing software available to everyone. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs. ([3])
Quote: We provide a tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools.
Quoted from the tOko homepage (oct 2014)
- tOKo is an open source tool for text analysis and browsing a corpus of documents. It implements a wide variety of text analysis and browsing functions in an interactive user interface.
- An important application area of tOKo is ontology development. It supports both ontology construction from a corpus, as well as relating the ontology back to a corpus (for example by highlighting concepts from the ontology in a document).
- Another application area is community research. Here the objective is to analyse the exchange of information, for example in a community forum or through a collection of interconnected weblogs.
- Quote from the home page: “Textalytics is a text analysis engine that extracts meaningful elements from any type of content and structures it, so that you can easily process and manage it. Textalytics features a set of high-level web services — adaptable to the characteristics of every type of business — which can be flexibly integrated into your processes and applications.”
- Quote from the home page: “This web-based tool enables you to "scrub" (clean) your unicode text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, tokenize with character- or word- Ngrams or TF-IDF weighting, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations. To facilitate subsequent text mining analyses beyond the scope of this site, users can also transpose and download their matricies of word counts or relative proportions as comma- or tab-separated files (.csv, .tsv).”
- Quote from the software home page (11(2014):
The Knowledge Network Organizing Tool (KNOT) is built around the Pathfinder network generation algorithm. There are also several other components (see below). Pathfinder algorithms take estimates of the proximities between pairs of items as input and define a network representation of the items. The network (a PFNET) consists of the items as nodes and a set of links (which may be either directed or undirected for symmetrical or non-symmetrical proximity estimates) connecting pairs of the nodes. The set of links is determined by patterns of proximities in the data and parameters of Pathfinder algorithms. For details on the method and its applications see R. Schvaneveldt (Editor), Pathfinder Associative Networks: Studies in Knowledge Organization. Norwood, NJ: Ablex, 1990.
The Pathfinder software includes several programs and utilities to facilitate Pathfinder network analyses of proximity data. The system is oriented around producing pictures of the solutions, but representations of networks and other information are also available in the form of text files which can be used with other software. The positions of nodes for displays are computed using an algorithm described by Kamada and Kawai (1989, Information Processing Letters, 31, 7-15).
Quote from the Textable (oct. 2, 2014)
Orange Textable is an open-source software tool for building data tables on the basis of raw text sources. Look at the following example to see it in typical action. Orange Textable offers the following features:
- text data import from keyboard, files, or urls
- systematic recoding
- segmentation and annotation of various text units
- extract and exploit XML-encoded annotations
- automatic, random, and arbitrary selection of unit subsets
- unit context examination using concordance and collocation tables
- frequency and complexity measures
- recoded text data and table export
- Quote from the about page (12/2016): Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics Library dml.cz in 2008, where it served to generate a short list of the most similar articles to a given article (gensim = “generate similar”). I also wanted to try these fancy “Latent Semantic Methods”, but the libraries that realized the necessary computation were not much fun to work with.
By now, gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text. It stands in contrast to brittle homework-assignment-implementations that do not scale on one hand, and robust java-esque projects that take forever just to run “hello world”.
Quote from the home page: “Welcome to the online text analysis tool, the detailed statistics of your text, perfect for translators (quoting), for webmasters (ranking) or for normal users, to know the subject of a text. Now with new features as the analysis of words groups, finding out the keyword density, analyse the prominence of word or expressions.”
Quote from the home page (11/2014): Bitext provides B2B multilingual semantic engines with “documentably” the highest accuracy in the market. Bitext works for companies in two main markets: Text Analytics (Concept and Entity Extraction, Sentiment Analysis) for Social CRM, Enterprise Feedback Management or Voice of the Customer; and in Natural Language Interfaces for Search Engines.
Quote from the About Page (11/2014): “Juxta is an open-source tool for comparing and collating multiple witnesses to a single textual work. Originally designed to aid scholars and editors examine the history of a text from manuscript to print versions, Juxta offers a number of possibilities for humanities computing and textual scholarship. [...] As a standalone desktop application, Juxta allows users to complete many of the necessary operations of textual criticism on digital texts (TXT and XML). With this software, you can add or remove witnesses to a comparison set, switch the base text at will. Once you’ve collated a comparison, Juxta also offers several kinds of analytic visualizations. By default, it displays a heat map of all textual variants and allows the user to locate — at the level of any textual unit — all witness variations from the base text. Users can switch to a side by side collation view, which gives a split frame comparison of a base text with a witness text. A histogram of Juxta collations is particularly useful for long documents; this visualization displays the density of all variation from the base text and serves as a useful finding aid for specific variants.”
Quote from the software home page (11/2014): Here is a software tool that can translate written text summaries directly into proximity files (prx) that can be analyzed by Pathfinder KNOT. It also generates text proposition files that can be imported by CMAP Tools to automatically form concept maps from the text. It should be of use to researchers who want to visualize "text" for various instructional and research-related reasons. Also it should work with different languages. ALA-Reader contains a rudimentary scoring system. Essentially, this tool converts the written summary into a cognitive map and then scores the cognitive map using an approach that we developed for scoring concept maps. The "score" produced is percent agreement with an expert referent. As I narrow down what algorithms work, then I plan to release updated versions periodically.
Quote from the Home page: “Lexico3 is the 2001 edition of the Lexico software, first published in 1990. Functions present from the first version (segmentation, concordances, breakdown in graphic form, characteristic elements and factorial analyses of repeated forms and segments) were maintained and for the most part significantly improved. The Lexico series is unique in that it allows the user to maintain control over the entire lexicometric process from initial segmentation to the publication of final results. Beyond identification of graphic forms, the software allows for study of the identification of more complex units composed of form sequences: repeated segments, pairs of forms in co-occurrences, etc which are less ambiguous than the graphic forms that make them up.” A free version is available for "personal work", bottom of this page
Quote from the home page:
“WordCruncher is a free eBook reader with research tools to help students and scholars study important texts.
- You can look for specific references, search for words or phrases, follow cross-reference hyperlinks, and enlarge images.
- You can copy and paste text, add bookmarks, highlight text, and make searchable notes.
- Additional study aids include complex searches, word frequencies, word frequency distributions, synchronized windows to compare translations, word tags, and various text analysis reports (e.g., collocation, vocabulary dispersion, vocabulary usage). ”
- Quote from the home page (11/2014): Netlytic is a cloud-based text and social networks analyzer that can automatically summarize and discover social networks from online conversations on social media sites.
- Quotes from the FAQ: Redash is an open source tool for teams to query, visualize and collaborate. Redash is quick to setup and works with any data source you might need so you can query from anywhere in no time. [..] Redash was built to allow fast and easy access to billions of records, that we process and collect using Amazon Redshift (“petabyte scale data warehouse” that “speaks” PostgreSQL). Today Redash has support for querying multiple databases, including: Redshift, Google BigQuery,Google Spreadsheets, PostgreSQL, MySQL, Graphite, Axibase Time Series Database and custom scripts.
Main features:
- Query editor - enjoy all the latest standards like auto-complete and snippets. Share both your results and queries to support an open and data driven approach within the organization.
- Visualization - once you have your dataset, select one of our /9 types of visualizations/ for your query. You can also export or embed it anywhere.
- Dashboard - combine several visualizations into a topic targeted dashboard.
- Alerts - get notified via email, Slack, Hipchat or a webhook when your query's results need attention.
" API - anything you can do with the UI, you can do with the API. Easily connect results to other systems or automate your workflows.
RapidAnalytics is an open source server for data mining and business analytics. It is based on the data mining solution RapidMiner and includes ETL, data mining, reporting, dashboards in a single server solution.
RapidMiner is a world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. RapidMiner is now RapidMiner Studio and RapidAnalytics is now called RapidMiner Server. In a few words, RapidMiner Studio is a "downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics". It can also be used (for most purposes) in batch mode (command line mode). Camacab0 (talk)
Real-time system that automatically collects student engagement and attendance & provides analytics tools and dashboards for students, teachers & management. See :5 pressing educational problems Beestar’s Location Intelligence Platform solves
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. R is available as Free Software for data manipulation, calculation and graphical display. It includes
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hardcopy, and
- a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
R can be considered as an environment within which statistical techniques are implemented. R can be extended via packages. For example, try:
- RQDA
- CRAN Task View: Natural Language Processing
- Run SQL queries on APIs, JSON / XML / RSS feeds, Web pages (tables), EVERYTHING!
- SAM includes a set of visualizations of learner activities to increase awareness and to support self-reflection. These are implemented as widgets in the ROLE project
- SATO is a multi-purpose text mining tool, e.g. it includes concordancing, lexical inventoring, annotation and categorization. It allows to mark up text with variables for further analysis.
SATO is a web-based text analysis tool using a command line language. So far, only a french interface exists. A commercial version exists, i.e. you can buy a license to install the same system on your own server.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features
- Scrapy was designed with simplicity in mind, by providing the features you need without getting in your way
- Just write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
- Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server
- Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
- Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD
- Scrapy comes with lots of functionality built in. Check this section of the documentation for a list of them.
- Scrapy is extensively documented and has an comprehensive test suite with very good code coverage
- SNAPP essentialy serves as a diagnostic instrument, allowing teaching staff to evaluate student behavorial patterns against learning learning activity design objectives and intervene as required in a timely manner.
- Tableau software helps people communicate data through an innovation called VizQL, a visual query language that converts drag-and-drop actions into data queries, allowing users to quickly find and share insights in their data. With Tableau, “data workers” first connect to data stored in files, cubes, databases, warehouses, Hadoop technologies, and even some cloud sources like Google Analytics. They then interact with the Tableau user interface to simultaneously query the data and view the results in charts, graphs, and maps that can be arranged together on dashboards. (Jones, 2014: 15)
Basically, one has to install a desktop application (Win/Mac) and create a visualization. The result then can be published either on their public server or on your own server (commercial).
Tabula is a free, open source tool that allows you to easily take data out of PDF files and into Excel, database programs, and web applications. Tabula allows users to upload their documents, indicate the position of the tables they want and extract the data right into Comma Separated Variable (CSV) or Tab Separated Variable (TSV) file, or just copy the text as CSV to a clipboard. Tabula can repeat operation on several pages or documents.
TAPoRware is a set of text analysis tools that enables users to perform text analysis on HTML, XML and plain text files, using documents from the users' machine or on the web. There are five families of tools: for HTML, XML, Text, Other and Beta. A list is included below in the free text section.
TextSTAT is simple text analysis program. It's main functionality is concordance. Quote from the home page (11/2014): “TextSTAT is a simple programme for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files. This version includes a web-spider which reads as many pages as you want from a particular website and puts them in a TextSTAT-corpus. The new news-reader, too, puts news messages in a TextSTAT-readable corpus file. TextSTAT reads MS Word and OpenOffice files. No conversion needed, just add the files to your corpus... ”
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.
The Dragon Toolkit is a Java-based development package for academic use in information retrieval (IR) and text mining (TM, including text classification, text clustering, text summarization, and topic modeling). It is tailored for researchers who work on large-scale IR and TM and prefer Java programming. Moreover, different from Lucene and Lemur, it provides built-in supports for semantic-based IR and TM. The dragon toolkit seamlessly integrates a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. ([dragon.ischool.drexel.edu/], retrieved March 2014)
The goal of the SEMantic simILARity software toolkit (SEMILAR; pronounced the same way as the word 'similar') is to promote productive, fair, and rigorous research advancements in the area of semantic similarity. The kit is available as application software or as Java API. As of March 2014, the GUI-based SEMILAR application is only available to a limited number of users who commit to help improving the usability of the interface. The JAVA libray (API) however, can be downloaded. SEMILAR comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts. Some methods have their own variations which coupled with parameter settings and your selection of preprocessing steps could result in a huge space of possible instances of the same basic method.
The Learning Analytics Enriched Rubric (LA e-Rubric) is an advanced grading method used for criteria-based assessment. As a rubric, it consists of a set of criteria. For each criterion, several descriptive levels are provided. A numerical grade is assigned to each of these levels. An enriched rubric contains some criteria and related grading levels that are associated to data from the analysis of learners’ interaction and learning behavior in a Moodle course, such as number of post messages, times of accessing learning material, assignments grades and so on. Using learning analytics from log data that concern collaborative interactions, past grading performance and inquiries of course resources, the LA e-Rubric can automatically calculate the score of the various levels per criterion. The total rubric score is calculated as a sum of the scores per each criterion.
tm package provides a framework for text mining applications within R. The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package provides native support for reading in several classic file formats such as plain text, PDFs, or XML files. There is also a plug-in mechanism to handle additional file formats. The data structures and algorithms can be extended to fit custom demands.
Tropes is a free text-analysis(text mining) software . Tropes include its ability to carry out stylistic, syntactic and semantic analyses and to present the results in graph and table form. Tropes can yield information about a text such as stylistic/rhetorical analyses (argumentative, enunciative, descriptive or narrative style). It can also identify different word categories (verbs, connectors, personal pronouns, modalities, qualifying adjectives), conduct thematic analyses (reference fields), and detect discursive/chronological structures.
Unstructured Information Management Architecture (UIMA) is a component framework to analyze unstructured content such as text, audio and video. This is originally developed by IBM. UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => Each component implements interfaces defined by the framework and provides self describing metadata via XML descriptor files. Also provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
Voyeur is a web-based text analysis environment. It is designed to be user-friendly, flexible and powerful. Voyeur is part of the Hermeneuti.ca, a collaborative project to develop and theorize text analysis tools and text analysis rhetoric. Voyeur Tools: See Through Your Texts (retrieved 3/2014). In Yoyeur, you can
- use texts in a variety of formats including plain text, HTML, XML, PDF, RTF and MS Word
- use texts from different locations, including URLs and uploaded files
- perform lexical analysis including the study of frequency and distribution data; in particular export data into other tools (as XML, tab separated values, etc.)
- embed live tools into remote web sites that can accompany or complement your own content
- Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
- Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Weka 3.7 (still beta in oct. 2014) includes a package system, that allows to add functionality without recompiling the system. As of summer 2014, most people seem to use this developer version. Weka is a very popular free data mining tool that includes advanced text mining features
Welcome to Gephi! Gephi is an open-source software for visualizing and analysing large networks graphs. Gephi uses a 3D render engine to display graphs in real-time and speed up the exploration. You can use it to explore, analyse, spatialise, filter, cluterize, manipulate and export all types of graphs.
Wordstat is a commercial text-mining and content analysis software. It is integrated with the QDA Miner and SimStat products from the same company. Quote from the official page: “WordStat is a flexible and easy-to-use text analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – and QDA Miner – our qualitative data analysis software – gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data.”

Blog

Last 10 posts from Blog:DKS, my wikilog:

Special:Wikilog/Blog:DKS/Template:WikilogTemplate1/10

Other

Needed a new desktop laptop

It should be 3D enabled (both CAD and gaming 3D) and have both good Opengl and activeX support
Fast CPU, e.g. i7-2630QM (Quad procesor) or better
Biggest possible display
a SSD (for quick installs and project start)

I finally got a DELL M6600 with a M4000 Graphics card (DELL gives universities huge bargains for some models). See an owners review at notebookreview.com. So far, I am happy with it, except for its low "HD" 1920x1080 screen resolution and the impossibility to connect it to our HD projector via HDMI. I'll have to investigate the latter - 18:32, 29 March 2012 (CEST).

Alternatives I considered:

Gaming/multimedia laptops (can do for the little CAD I do)

Alienware M17x (too heavy, good gaming 3D Radeon HD 5870 , good screen)
Asus G73 series, e.g. G73JW (cheap, only hdtv 1080p, quite heavy, ok screen, good gaming 3D NVIDIA GeForce GTX 460M)
Acer Aspire Ethos (cheap, slim, slow 3D, bad screen)
Apple MacBook Pro (slim, don't trust win drivers, slow 3D Radeon 6750M)
Clevo (also sold as Sager or XMG Schenker, plus other brands) 17 in. and above series. Flexible configurations. Probably the fastest laptops, various i7 chips, GeForce GTX 560M/580M for the high-ends, 18.4 in FHD (1920x1080). No better resolution ? Cheaper than comparable "brands". Resellers in Germany: Schenker Notebook, Deviltech, Notebookguru.de
Samsung Series 7 Gamer Notebooks, in various variants. 700G7A with Radeon HD 6970M, 4GB, is about Euros 2000, 4kgs and higher.
Sony Vaio, VPC-F22S1E or similar.

Cad Laptops (certified, also better OpenGL support)

DELL Precision M6600, Nvidia Quadro 4000M or 5010M (too expensive!), HD (1900x1080), CPU: I7-2720QM (or better).
HP EliteBook 8740W, 8760W (UWVA-Display, starts at 3.5kg, Various Nvidia Quadro (e.g. Quadro 4000M), various CPU, e.g. i7-2630QM
Xirios W series from Schenker. Various configurations. An almost top W701 mobile Workstation model with a GTX 580M, Intel i7-2760QM, 8GB RAM, 300GB SDD etc. is about Euros 2500 (or about 2800 CHF). Schenker used to sell much more expensive Quadro-based models, e.g. the W710 (can't find them anymore).

None has what I would call a decent screen resolution. The actual trend is in fact towards less (i.e. HDTV 1900x1080 or 1900x900). There seems to be a fair market for custom-built laptops that may include larger screens, but I won't trust any that is not local (in case I have to return the unit for some quick repair)

User:Daniel K. Schneider

About myself

Blog

Other

Navigation menu

Slow Search