Methodology tutorial - quantitative data analysis: Difference between revisions

The educational technology and digital learning wiki
Jump to navigation Jump to search
 
m (Text replacement - "<pageby nominor="false" comments="false"/>" to "<!-- <pageby nominor="false" comments="false"/> -->")
 
(29 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Incomplete}}
{{Incomplete}}
{{under construction}}


<pageby nominor="false" comments="false"/>
<!-- <pageby nominor="false" comments="false"/> -->


== Quantitative data analysis ==
This is part of the [[methodology tutorial]] (see its table of contents).
 
== Introduction ==
 
This tutorial is a short introduction to simple (rather confirmatory) statistics for beginners.
 
<div class="tut_goals">
; Learning goals
* Understand the importance of data assumptions, e.g. understand the bad influence of "extreme cases"
* Understand the "find structure" principle of statistical data analysis
* Be able to identify the major stages of (simple) statistical analysis
* Know the difference between the four kinds of statistical coefficients
* Be able to select a procedure for bi-variate analysis according to data types
* Understand crosstabulation
* Understand  analysis of variance
* Understand simple regression analysis
 
; Prerequisites
* [[Methodology tutorial - empirical research principles]]
* [[Methodology tutorial - theory-driven research designs]]
* [[Methodology tutorial - descriptive statistics and scales]]
 
; Moving on
* [[Methodology tutorial - exploratory data analysis]]
; Level and target population
* Beginners
; Quality
* to be improved, but usable


This is part of the [[methodology tutorial]] (see its table of contents).
</div>


== Scales and "data assumptions" ==
== Scales and "data assumptions" ==
Line 12: Line 38:
=== Types of quantitative measures (scales) ===
=== Types of quantitative measures (scales) ===


{| border="1"
Quantitative data come in different '''types''' or forms as we have seen in [[Methodology tutorial - descriptive statistics and scales|descriptive statistics and scales tutorial]].
! rowspan="1" colspan="1" |
 
Types of measures
Let's recall the three data types:
! rowspan="1" colspan="1" |
* nominal, i.e. categorized observations (e.g. country names)
Description
* ordinal, i.e. rankings
! rowspan="1" colspan="1" |
* interval, i.e. quantitative scaled observations (e.g. a score)
Examples
|-
| rowspan="1" colspan="1" |
nominalor category
| rowspan="1" colspan="1" |
enumeration of categories
| rowspan="1" colspan="1" |
male, femaledistrict A, district B,software widget A, widget B
|-
| rowspan="1" colspan="1" |
ordinal
| rowspan="1" colspan="1" |
ordered scales
| rowspan="1" colspan="1" |
1st, 2nd, 3rd
|-
| rowspan="1" colspan="1" |
interval<br /> or quantitativeor "scale" (in SPSS)
| rowspan="1" colspan="1" |
measure with an interval
| rowspan="1" colspan="1" |
1, 10, 5, 6 (on a scale from 1-10)180cm, 160cm, 170cm
|}


* For each type of measure or combinations of types of measure you will have to use
For each type of measure or combinations of types of measures, you will have to use particular analysis techniques. In other words, most statistical procedures only work with certain kinds of data types.
different analysis techniques.
* For interval variables you have a bigger choice of statistical techniques.
** Therefore scales like (1) strongly agree, (2) agree, (3) somewhat agree, etc. usually
are treated as interval variables.


There is a bigger choice of statistical techniques for quantitative (interval) variables. Therefore scales like (1) strongly agree, (2) agree, (3) somewhat agree, etc. usually are treated as interval variables, although it's not totally correct to do so.


Data types are not the only technical constraints for the selection of a statistical procedure, sample size and data assumptions are others.


=== Data assumptions ===
=== Data assumptions ===


* not only you have to adapt your analysis techniques to types of measures but they also
In addition to their data types, many statistical analysis types only work for given sets of data distributions and relations between variables.
(roughly) should respect other data assumptions.


In practical terms this means that not only you have to adapt your analysis techniques to types of measures but you also (roughly) should respect other data assumptions.


; Linearity


=== Linearity ===
The most frequent assumption about relations between variables is that the relationships are linear.


* Example: Most popular statistical methods for interval data assume '' linear
In the following example the relationship is non-linear: students that show weak daily
relationships'' :
** In the following example the relationship is non-linear: students that show weak daily
computer use have bad grades, but so do they ones that show very strong use.
computer use have bad grades, but so do they ones that show very strong use.
** Popular measures like the Pearson’s r will "not work", i.e. you will have a very weak
correlation and therefore miss this non-linear relationship


[[Image:book-research-design-192.png]]
Popular measures like the Pearson’s r correlation will "not work", i.e. you will have a very weak correlation and therefore miss this non-linear relationship.
 
[[Image:non-linear-relation.png]]


=== Normal distribution ===
; Normal distribution


* Most methods for interval data also require "'' normal distribution'' "
Most methods for interval data also require a so-called ''normal distribution'' (see the [[Methodology tutorial - descriptive statistics and scales]])
* If you have data with "extreme cases" and/or data that is skewed, some individuals will
have much more "weight" than the others.
* Hypothetical example:
** The "red" student who uses the computer for very long hours will determine a positive
correlation and positive regression rate, whereas the "black" ones suggest an inexistent
correlation. Mean use of computers does not represent "typical" usage.
** The "green" student however, will not have a major impact on the result, since the
other data are well distributed along the 2 axis. In this second case the "mean"
represents a "typical" student.


[[Image:book-research-design-193.png]]
If you have data with "extreme cases" and/or data that is skewed (assymetrical), some individuals will have much more "weight" than the others.
 
Hypothetical example:
* The "red" student who uses the computer for very long hours will lead to a positive correlation and positive regression rate, whereas the "black" ones alone in the data suggest an inexistent correlation. Mean use of computers does not represent "typical" usage in this case, since the "red" one "pulls the mean upwards".
* The "green" student however, will not have a major impact on the result, since the other data are well distributed along the 2 axis. In this second case the "mean" represents a "typical" student.
 
[[Image:non-normal-distribution.png]]
 
In addition you also should understand that extreme values already have more weight with variance-based analysis methods (i.e. regression analysis, Anova, factor analysis, etc.) since since distances are computed as squares.


== The principle of statistical analysis ==
== The principle of statistical analysis ==


* The goal of statistical analysis is quite simple: find structure in the data
=== Finding structure ===
 
The goal of statistical analysis is quite simple: find structure in the data. We can express this principle with two synonymous formulas:


DATA = STRUCTURE + NON-STRUCTURE
DATA = STRUCTURE + NON-STRUCTURE


DATA = EXPLAINED VARIANCE + NOT EXPLAINED VARIANCE
DATA = EXPLAINED VARIANCE + NOT EXPLAINED VARIANCE


Example: Simple regression analysis
Example: Simple regression analysis


* DATA = '' predicted'' regression line + '' residuals''
* DATA = ''predicted'' regression line + ''residuals'' (unexplained noise)
* in other words: regression analysis tries to find a line that will maximize prediction
 
and minimize residuals
In other words: regression analysis tries to find a line that will maximize prediction and minimize residuals.


[[Image:book-research-design-194.png]]
[[Image:statistical-structure.png]]


== Stages of statistical analysis ==
=== Stages of statistical analysis ===


Note: With statistical data analysis programs you easily can do several steps in one
Let's have look of what we mean be statistical analysis and what your typically have to do. We shall come back to most stages throughout this tutorial page:
operation.


# Clean your data
# Clean your data
#* Make very sure that your data are correct (e.g. check data transcription)
#* Make very sure that your data are correct (e.g. check data transcription)
#* Make very sure that missing values (e.g. not answered questions in a survey) are
#* Make very sure that missing values (e.g. not answered questions in a survey) are clearly identified as missing data
clearly identified as missing data
# Gain knowledge about your data
# Gain knowledge about your data
#* Make lists of data (for small data sets only !)
#* Make lists of data (for small data sets only !)
#* Produce descriptive statistics, e.g. means, standard-deviations, minima, maxima for
#* Produce descriptive statistics, e.g. means, standard-deviations, minima, maxima for each variable
each variable
#* Produce graphics, e.g. histograms or box plot that show the distribution
#* Produce graphics, e.g. histograms or box plot that show the distribution
# Produce composed scales
# Produce composed scales
#* E.g. create a single variable from a set of questions
#* E.g. create a single variable from a set of questions
# Make graphics or tables that show relationships
# Make graphics or tables that show relationships
#* E.g. Scatter plots for interval data (as in our previous examples) or crosstabulations
#* E.g. Scatter plots for interval data (as in our previous examples) or cross-tabulations
# Calculate coefficients that measure the strength and the structure of a relation
# Calculate coefficients that measure the strength and the structure of a relation
#* Strength examples: Cramer’s V for crosstabulations, or Pearson’s R for interval data
#* Strength examples: Cramer’s V for cross-tabulations, or Pearson’s R for interval data
#* Structure examples: regression coefficient, tables of means in analysis of variance
#* Structure examples: regression coefficient, tables of means in analysis of variance
# Calculate coefficients that describe the percentage of variance explained
# Calculate coefficients that describe the percentage of variance explained
#* E.g. R'' 2'' in a regression analysis
#* E.g. R''2'' in a regression analysis
# Compute significance level, i.e. find out if you have to right to interpret the relation
# Compute significance level, i.e. find out if you have to right to interpret the relation
#* E.g. Chi-2 for crosstabs, Fisher’s F in regression analysis
#* E.g. Chi-2 for crosstabs, Fisher’s F in regression analysis


Note: With statistical data analysis programs you easily can do several steps in one
operation.


=== Types of statistical coefficients ===


== Data preparation and composite scale making ==
All statistical analysis produce various kinds (lots) of coefficients, i.e. numbers that will summarize certain kinds of informations.
 
 
 
=== Statistics programs and data preparation ===
 
Statistics programs
 
* If available, plan to use a real statistics program like SPSS or Statistica
* Good freeware: WinIDAMS (statistical analysis require the use of a command language)
 
''
http://portal.unesco.org/ci/en/ev.php-URL_ID=2070&amp;URL_DO=DO_TOPIC&amp;URL_SECTION=201.html''
 
* Freeware for advanced statistics and data visualization: R (needs good IT skills !)
 
'' http://lib.stat.cmu.edu/R/CRAN/''
 
* Using programs like Excel will make you loose time
** only use such programs for simple descriptive statistics
** ok if the main thrust of your thesis does not involve any kind of serious data analysis
 
Data preparation
 
* Enter the data
** Assign a number to each response item (planned when you design the questionnaire)
** Enter a clear code for missing values (no response), e.g. -1
* Make sure that your data set is complete and free of errors
** Some simple descriptive statistics (minima, maxima, missing values, etc.) can help
* Learn how to document the data in your statistics program
** Enter labels for variables, labels for responses items, display instructions (e.g.
decimal points to show)
** Define data-types (interval, ordinal or nominal)
 
 
 
=== Composite scales (indicators) ===
 
Basics:
 
* Most scales are made by simply adding the values from different items (sometimes called
"Lickert scales")
* Eliminate items that have a high number of non responses
* Make sure to take into account missing values (non responses) when you add up the
responses from the different items
** A real statistics program (SPSS) does that for you
* Make sure when you create your questionnaire that all items use the same range of
response item, else you will need to standardize !!
 
Quality of a scale:
 
* Again: use a published set of items to measure a variable (if available)
** if you do, you can avoid making long justifications !
* Sensitivity: questionnaire scores discriminate
** e.g. if exploratory research has shown higher degree of presence in one kind of
learning environment than in an other one, results of presence questionnaire should
demonstrate this.
* Reliability: internal consistency is high
** Intercorrelation between items (alpha) is high
* Validity: results obtained with the questionnaire can be tied to other measures
** e.g. were similar to results obtained by other tools (e.g. in depth interviews),
** e.g. results are correlated with similar variables.
 
 
 
=== The COLLES surveys ===
 
'' http://surveylearning.moodle.com/colles/''
 
* The Constructivist On-Line Learning Environment Surveys include one to measure preferred
(or ideal) experience in a teaching unit. It includes 24 statements measuring 6
dimensions.
* We only show the first two (4 questions concerning relevance and 4 questions concerning
reflection).
* Note that in the real questionnaire you do not show labels like "Items concerning
relevance" or "response codes".
 
{| border="1"
! rowspan="1" colspan="1" |
Statements
! rowspan="1" colspan="1" |
Almost Never
! rowspan="1" colspan="1" |
Seldom
! rowspan="1" colspan="1" |
Some-times
! rowspan="1" colspan="1" |
Often
! rowspan="1" colspan="1" |
Almost Always
|-
! rowspan="1" colspan="1" |
response codes
! rowspan="1" colspan="1" |
1
! rowspan="1" colspan="1" |
2
! rowspan="1" colspan="1" |
3
! rowspan="1" colspan="1" |
4
! rowspan="1" colspan="1" |
5
|-
| rowspan="1" colspan="6" |
Items concerning relevance
|-
| rowspan="1" colspan="1" |
a. my learning focuses on issues that interest me.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="1" |
b. what I learn is important for my prof. practice as a trainer.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="1" |
c. I learn how to improve my professional practice as a trainer.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="1" |
d. what I learn connects well with my prof. practice as a trainer.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="6" |
Items concerning Reflection
|-
| rowspan="1" colspan="1" |
... I think critically about how I learn.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="1" |
... I think critically about my own ideas.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="1" |
... I think critically about other students' ideas.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|-
| rowspan="1" colspan="1" |
... I think critically about ideas in the readings.
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
| rowspan="1" colspan="1" |
O
|}
 
 
 
=== Algorithm to compute each scale: ===
 
for each individual add response codes and divide by number of items
 
or use a "means" function in your software package:
 
relevance = mean (a, b, c, d)
 
Examples:
 
 
 
=== Individual A ===
 
who answered a=sometimes, b=often, c=almost always, d= often gives:
 
(3 + 4 + 5 + 4 ) / 4 = 4
 
Missing values (again)
 
* Make sure that you do not add "missing values"
 
 
 
=== Individual B ===
 
who answered a=sometimes, b=often, c=almost always, d=missing gives:
 
(3 + 4 + 5) / 3 = 4
 
 
 
=== and certainly NOT: ===
 
(3 + 4 + 5 + 0) / 4 or (3 + 4 + 5 -1) / 4 !!
 
 


== Overview on statistical methods and coefficients ==
Always make sure to use only coefficients that are appropriate for your data


There are four big kinds of coefficients and you find these in most analysis methods:


; (1) Strength of a relation
* Coefficients usually range from '' -1'' (total negative relationship) to '' +1'' (total positive relationship). '' 0'' means no relationship.


=== Descriptive statistics ===
; (2) Structure (tendency) of a relation
* Summarizes a trend


* Descriptive statistics are not very interesting in most cases<br /> (unless they are
; (3) Percentage of variance explained
used to compare different cases in comparative systems designs)
* Tells how much structure is in your model
* Therefore, do not fill up pages of your thesis with tons of Excel diagrams !!


Some popular summary statistics for interval variables
; (4) Signification level of your model
* Gives that chance that you are in fact gambling
* Typically in the social sciences a sig. level lower than 5% (0.05) is acceptable. Do not interpret data that is above !


* Mean
These four types are mathematically connected: E.g. the signification level is not just dependent on the size of your sample, but also on the strength of a relation.
* Median: the data point that is in the middle of "low" and "high" values
* Standard deviation: the mean deviation from the mean, i.e. how far a typical data point
is away from the mean.
* High and Low value: extremes a both end
* Quartiles: same thing as median for 1/4 intervals


=== Overview of statistical methods ===


Statistical data analysis methods can be categorized according to data types we introduced in the beginning of this tutorial module.


=== Which data analysis for which data types? ===
The following table shows a few popular simple bi-variate analysis methods for a given independent (explaining) variable X and a dependent (to be explained) variable Y.
 
Popular bi-variate analysis


{| border="1"
{| border="1"
Line 414: Line 162:
|-
|-
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Quantitative(interval)
Quantitative (interval)
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Qualitative(nominal or ordinal)
Qualitative (nominal or ordinal)
|-
|-
| rowspan="2" colspan="1" |
! rowspan="2" colspan="1" |
Independent(explaining) <br /> variable X
Independent (explaining) <br /> variable X
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Quantitative
Quantitative
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Correlation and Regression
Correlation and Regression
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Transform X into a qualitative variable and see below
Logistic regression
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Qualitative
Qualitative
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Line 435: Line 183:
|}
|}


Popular multi-variate anaylsis
Popular multi-variate analysis


{| border="1"
{| border="1"
Line 444: Line 192:
|-
|-
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Quantitative(interval)
Quantitative (interval)
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Qualitative(nominal or ordinal)
Qualitative (nominal or ordinal)
|-
|-
| rowspan="2" colspan="1" |
! rowspan="2" colspan="1" |
Independent(explaining) <br /> variable X
Independent(explaining) <br /> variable X
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Quantitative
Quantitative
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Factor Analysis, <br /> multiple regresstion, SEM,Cluster Analysis,
Factor Analysis, multiple regression, SEM, Cluster Analysis,
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Transform X into a qualitative variable and see below
Logit. Alternatively, transform X into a qualitative variable and see below or split a variable into several dichotomic (yes/no) variables and see to the left.
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Qualitative
Qualitative
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Line 466: Line 214:




== Crosstabulation ==


=== Types of statistical coefficients: ===
Crosstabulation is a popular technique to study relationships between normal (categorical) or ordinal variables.


* First of all make sure that the coefficient you use is more or less appropriate for you
=== The principle of cross tabulation analysis ===
data


The big four:
Crosstabulation is simple, but beginner nevertheless get it often wrong. You do have to remember the basic objective of simple data analysis: Explain variable Y with variable X.


# Strength of a relation
; Computing the percentages (probabilities)
#* Coefficients usually range from '' -1'' (total negative relationship) to '' +1'' (total
positive relationship). '' 0'' means no relationship.
# Structure (tendency) of a relation
# Percentage of variance explained
# Signification level of your model
#* Gives that chance that you are in fact gambling
#* Typically in the social sciences a sig. level lower than 5% (0.05) is acceptable
#* Do not interpret data that is above !


These four are mathematically connected:
Since you want to know the probability (percentage) that a value of X leads to a value of Y, you will have to compute percentages in order to able to "talk about probabilities".


E.g. Signification is not just dependent on the size of your sample, but also on the
In a tabulation, the X variable is usually put on top (i.e. its values show in columns) but you can do it the other way round. Just make sure that you get the percentages right !
strength of a relation.


; Steps:
* Compute percentages across each item of X (i.e. "what is the probability that a value of X leads to a value of Y")
* Then compare (interpret) percentages across each item of the dependant (to be explained) variable


Let's recall the simple experimentation paradigm in which most statistical analysis is grounded since research is basically about comparison. Note: X is put to the left (not on top):


== Crosstabulation ==
{| border="1"
! rowspan="1" colspan="1" |Treatment
! rowspan="1" colspan="1" |effect (O)
! rowspan="1" colspan="1" |non-effect (O)
! rowspan="1" colspan="1" |Total effect<br/>for a group
|-
| rowspan="1" colspan="1" |treatment: (group X)
| rowspan="1" colspan="1" |bigger
| rowspan="1" colspan="1" |smaller
| rowspan="1" colspan="1" |100 %
|-
| rowspan="1" colspan="1" |non-treatment: (group non-X)
| rowspan="1" colspan="1" |smaller
| rowspan="1" colspan="1" |bigger
| rowspan="1" colspan="1" |100 %
|}
 
You have to interpret this table in the following way: The chance that a treatment (X) leads to a given effect (Y) is higher than the chance that a non-treatment will have this effect.
 
Anyhow, a "real" statistical crosstabulation example will be presented below. Let's first discuss a few coefficients that can summarize some important information.


* Crosstabulation is a popular technique to study relationships between normal
; Statistical association coefficients (there are many!)
(categorical) or ordinal variables


Computing the percentages (probabilities)
* Phi is a chi-square based measure of association and is usually used for 2x2 tables
* The Contingency Coefficient (Pearson's C). The contingency coefficient is an adjustment to phi, intended to adapt it to tables larger than 2-by-2.
* Somers' d is a popular coefficient for ordinal measures (both X and Y). There exist two variants: "Symmetric" and "Y dependant on X".


* See the example on the next slides
; Statistical significance tests
* For each value of the explaining (independent) variable compute de percentages
** Usually the X variable is put on top (i.e. its values show in columns). If you don’t
you have to compute percentages across lines !
** Remember this: you want to know the probability (percentage) that a value of X leads to
a value of Y
* Compare (interpret) percentages across the dependant (to be explained) variable


Statistical association coefficients (there are many!)
Pearson's chi-square is by far the most common. If simply "chi-square" is mentioned, it is probably Pearson's chi-square. This statistic is used to text the hypothesis of no association of columns and rows in tabular data. It can be used with nominal data.


* Phi is a chi-square based measure of association and is usually used for 2x2 tables
; In SPSS
* The Contingency Coefficient (Pearson's C). The contingency coefficient is an adjustment
to phi, intended to adapt it to tables larger than 2-by-2.
* Somers' d is a popular coefficient for ordinal measures (both X and Y). Two variants:
symmetric and Y dependant on X (but less the other way round).


Statistical significance tests
* You fill find crosstabs under menu: ''Analyze->Descriptive statistics->Crosstabs''
* You then can must select percentages in "Cells" and coefficients in "statistics". This will make it "inferential", not just "descriptive".


* Pearson's chi-square is by far the most common. If simply "chi-square" is mentioned, it
=== Crosstabulation - Example 1 ===
is probably Pearson's chi-square. This statistic is used to text the hypothesis of no
association of columns and rows in tabular data. It can be used with nominal data.


We want to know if ICT training will explain use of presentation software in the classroom.


There are two survey questions:
# Did you receive some formal ICT training ?
# Do you use a computer to prepare slides for classroom presentations ?


=== Crosstabulation Avez-vous reçu une formation à l'informatique ?* Créer des
Now let's examine the results
documents pour afficher en classe ===


{| border="1"
{| border="1"
| rowspan="2" colspan="3" |
| rowspan="2" colspan="3" |
 
| rowspan="1" colspan="2" |X= Did you receive some formal ICT training ?
| rowspan="1" colspan="2" |
| rowspan="1" colspan="1" |Total
X= Avez-vous reçu une formation à l'informatique ?
| rowspan="1" colspan="1" |
Total
|-
|-
| rowspan="1" colspan="1" |No
| rowspan="1" colspan="1" |Yes
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Non
| rowspan="1" colspan="1" |
Oui
| rowspan="1" colspan="1" |
|-
|-
| rowspan="6" colspan="1" |
| rowspan="8" colspan="1" |Y= Do you you use a computer<br/> to prepare slides for<br/> classroom presentations ?
Y= Utilisez-vous l’ordinateur pour créer des documents pour afficher en classe ?
| rowspan="2" colspan="1" |Regularly
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Count
Régulièrement
| rowspan="1" colspan="1" |4
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |45
Effectif
| rowspan="1" colspan="1" |49
| rowspan="1" colspan="1" |
4
| rowspan="1" colspan="1" |
45
| rowspan="1" colspan="1" |
49
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
 
! rowspan="1" colspan="1" |44.4%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |58.4%
% dans X
| rowspan="1" colspan="1" |57.0%
| rowspan="1" colspan="1" |
44.4%
| rowspan="1" colspan="1" |
58.4%
| rowspan="1" colspan="1" |
57.0%
|-
|-
| rowspan="1" colspan="1" |
| rowspan="2" colspan="1" |Occasionally
Occasionnellement
| rowspan="1" colspan="1" |Count
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |4
Effectif
| rowspan="1" colspan="1" |21
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |25
4
| rowspan="1" colspan="1" |
21
| rowspan="1" colspan="1" |
25
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
 
! rowspan="1" colspan="1" |44.4%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |27.3%
% dans X
| rowspan="1" colspan="1" |29.1%
| rowspan="1" colspan="1" |
44.4%
| rowspan="1" colspan="1" |
27.3%
| rowspan="1" colspan="1" |
29.1%
|-
|-
| rowspan="1" colspan="1" |
| rowspan="2" colspan="1" |2 Never
2 Jamais
| rowspan="1" colspan="1" |Count
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1
Effectif
| rowspan="1" colspan="1" |11
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |12
1
| rowspan="1" colspan="1" |
11
| rowspan="1" colspan="1" |
12
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
 
! rowspan="1" colspan="1" |11.1%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |14.3%
% dans X
| rowspan="1" colspan="1" |14.0%
| rowspan="1" colspan="1" |
11.1%
| rowspan="1" colspan="1" |
14.3%
| rowspan="1" colspan="1" |
14.0%
|-
|-
| rowspan="1" colspan="1" |
| rowspan="2" colspan="1" |Total
Total
| rowspan="1" colspan="1" |Count
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |9
 
| rowspan="1" colspan="1" |77
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |86
Effectif
| rowspan="1" colspan="1" |
9
| rowspan="1" colspan="1" |
77
| rowspan="1" colspan="1" |
86
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
! rowspan="1" colspan="1" |100.0%
! rowspan="1" colspan="1" |100.0%
| rowspan="1" colspan="1" |100.0%
|}


| rowspan="1" colspan="1" |
The probability that computer training ("Yes") leads to superior usage of the computer to prepare documents is very weak (you can see this by comparing the % line by line.


| rowspan="1" colspan="1" |
The statistics tell the same story:
% dans X
| rowspan="1" colspan="1" |
100.0%
| rowspan="1" colspan="1" |
100.0%
| rowspan="1" colspan="1" |
100.0%
|}


* The probability that computer training ("oui") leads to superior usage of the computer
* Pearson Chi-Square = 1.15 with a signification= .562
to prepare documents is very weak (you can see this by comparing the % line by line.
** This means that the likelihood of results being random is &gt; 50% and you have to reject relationship
* Contingency coefficient = 0.115, significance = .562. (same result)


Statistics:
Therefore: Not only is the relationship very weak, but it can '''not''' be interpreted. In other words: There is absolutely no way to assert that ICT training leads to more frequent use of presentation software in our case.


* Pearson Chi-Square = 1.15 with a signification= .562
=== Crosstabulation - Example 2 ===
** This means that the likelihood of results being random is &gt; 50% and you have to
reject relationship
* Contingency coefficient = 0.115, significance = .562
** Not only is the relationship very weak (but it can’t be interpreted)


We want to know if the teacher's belief that students will gain autonomy when using Internet resource will have an influence on classroom practice, i.e. organize activities where learners have to search information on the Internet.


(translation needed)


=== Crosstabulation: Pour l'élève, le recours aux ressources de réseau favorise
* X = Teachers belief: Leaners will gain autonomy through using Internet resources
l'autonomie dans l'apprentissage * Rechercher des informations sur Internet ===
* Y = Classroom activities: Search information on the Internet


{| border="1"
{| border="1"
| rowspan="2" colspan="3" |
| rowspan="2" colspan="3" |
 
| rowspan="1" colspan="4" |X= Leaners will gain autonomy through using Internet resources (teacher belief)
| rowspan="1" colspan="4" |
X= Pour l'élève, le recours aux ressources de réseau favorise l'autonomie dans
l'apprentissage
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |0 Fully disagree
0 Tout à fait en désaccord
| rowspan="1" colspan="1" |1 Rather disagree
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |2 Rather agree
1 Plutôt en désaccord
| rowspan="1" colspan="1" |3 Fully agree
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Total
2 Plutôt en accord
| rowspan="1" colspan="1" |
3 Tout à fait en accord
| rowspan="1" colspan="1" |
Total
|-
|-
| rowspan="6" colspan="1" |
| rowspan="6" colspan="1" |Y= Search information<br/> on the Internet
Y= Rechercher des informations sur Internet
| rowspan="2" colspan="1" |0 Regularly
| rowspan="2" colspan="1" |
| rowspan="1" colspan="1" |Count
0 Régulièrement
| rowspan="1" colspan="1" |0
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |2
Count
| rowspan="1" colspan="1" |9
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |11
0
| rowspan="1" colspan="1" |22
| rowspan="1" colspan="1" |
2
| rowspan="1" colspan="1" |
9
| rowspan="1" colspan="1" |
11
| rowspan="1" colspan="1" |
22
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
% within X
! rowspan="1" colspan="1" |.0%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |18.2%
.0%
! rowspan="1" colspan="1" |19.6%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |42.3%
18.2%
! rowspan="1" colspan="1" |25.6%
| rowspan="1" colspan="1" |
19.6%
| rowspan="1" colspan="1" |
42.3%
| rowspan="1" colspan="1" |
25.6%
|-
|-
| rowspan="2" colspan="1" |
| rowspan="2" colspan="1" |1 Occasionnally
1 Occasionnellement
| rowspan="1" colspan="1" |Count
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1
Count
| rowspan="1" colspan="1" |7
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |23
1
| rowspan="1" colspan="1" |11
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |42
7
| rowspan="1" colspan="1" |
23
| rowspan="1" colspan="1" |
11
| rowspan="1" colspan="1" |
42
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
% within X
! rowspan="1" colspan="1" |33.3%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |63.6%
33.3%
! rowspan="1" colspan="1" |50.0%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |42.3%
63.6%
! rowspan="1" colspan="1" |48.8%
| rowspan="1" colspan="1" |
50.0%
| rowspan="1" colspan="1" |
42.3%
| rowspan="1" colspan="1" |
48.8%
|-
|-
| rowspan="2" colspan="1" |
| rowspan="2" colspan="1" |2 Never
2 Jamais
| rowspan="1" colspan="1" |Count
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |2
Count
| rowspan="1" colspan="1" |2
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |14
2
| rowspan="1" colspan="1" |4
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |22
2
| rowspan="1" colspan="1" |
14
| rowspan="1" colspan="1" |
4
| rowspan="1" colspan="1" |
22
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
% within X
! rowspan="1" colspan="1" |66.7%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |18.2%
66.7%
! rowspan="1" colspan="1" |30.4%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |15.4%
18.2%
! rowspan="1" colspan="1" |25.6%
| rowspan="1" colspan="1" |
30.4%
| rowspan="1" colspan="1" |
15.4%
| rowspan="1" colspan="1" |
25.6%
|-
|-
| rowspan="2" colspan="1" |
| rowspan="2" colspan="1" |
 
| rowspan="2" colspan="1" |Total
| rowspan="2" colspan="1" |
| rowspan="1" colspan="1" |Count
Total
| rowspan="1" colspan="1" |3
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |11
Count
| rowspan="1" colspan="1" |46
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |26
3
| rowspan="1" colspan="1" |86
| rowspan="1" colspan="1" |
11
| rowspan="1" colspan="1" |
46
| rowspan="1" colspan="1" |
26
| rowspan="1" colspan="1" |
86
|-
|-
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |% within X
% within X
! rowspan="1" colspan="1" |100.0%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |100.0%
100.0%
! rowspan="1" colspan="1" |100.0%
| rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |100.0%
100.0%
! rowspan="1" colspan="1" |100.0%
| rowspan="1" colspan="1" |
100.0%
| rowspan="1" colspan="1" |
100.0%
| rowspan="1" colspan="1" |
100.0%
|}
|}


* We have a weak significant relationship: the more teachers agree that students will
* We have a weak significant relationship: the more teachers agree that students will increase learning autonomy from using Internet resources, the more is it likely that they will let students do so.
increase learning autonomy from using Internet resources, the more they will let students
do so.


Statistics: Directional Ordinal by Ordinal Measures with Somer’s D
The statistical coefficients we use is "Directional Ordinal by Ordinal Measures with Somer’s D":


{| border="1"
{| border="1"
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |Values
Values
! rowspan="1" colspan="1" |Somer’s D
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |Significance
Somer’s D
! rowspan="1" colspan="1" |
Significance
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Symmetric
Symmetric
| rowspan="1" colspan="1" |-.210
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.025
-.210
| rowspan="1" colspan="1" |
.025
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Y = Search information on the Internet - Dependent
Y = Rechercher des informations sur Internet Dependent
| rowspan="1" colspan="1" |-.215
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.025
-.215
| rowspan="1" colspan="1" |
.025
|}
|}


Therefore, teacher's belief explain things, but the relationship is very weak ....
== Simple analysis of variance ==
Analysis of variance (and it’s multi-variate variant Anova) are the favorite tools of the experimentalists. It is also popular in quasi-experimental research and survey research as the following example shows.
=== The principle of analysis of variance ===
X is an experimental condition (therefore a nominal variable) and Y usually is an interval variable.
Example: Does presence or absence of ICT usage influence grades ?
* You can show that X has an influence on Y if means achieved by different groups (e.g. ICT vs. non-ICT users) are significantly different.


Significance improves when:
* Means of the X groups are different (the further apart the better)
* Variance inside X groups is low (certainly lower than the overall variance)


== Simple analysis of variance ==
; Coefficients
 
* '''Standard deviation''' is a measure of variance. It means "the mean deviation from the mean". I.e. how far from the central point is the "typical" individual.
 
* '''Eta''' is a correlation coefficient
 
* '''Eta square''' measures the explained variance


* Analysis of variance (and it’s multi-variate variant Anova) are the favorite tools of
; In SPSS
the experimentalists.
* X is an experimental condition (therefore a nominal variable) and Y usually is an
interval variable.
** E.g. Does presence or absence of ICT usage influence grades ?
* You can show that X has an influence on Y if means achieved by different groups (e.g.
ICT vs. non-ICT users) are significantly different.
* Significance improves when:
** means of the X groups are different (the further apart the better)
** variance inside X groups is low (certainly lower than the overall variance)


Analysis of variance can be found in two different locations:


* Analyze->Compare Means
* General linear models (avoid this is you are a beginner)


=== Differences between teachers and teacher students ===
=== Differences between teachers and teacher students ===
In this example we want to know if teacher trainees (e.g. primary teacher students) are different from "real" teachers regarding three kinds of variables:
* Frequency of different kinds of learner activities
* Frequency of exploratory activities outside the classroom
* Frequency of individual student work
COP1, COP2, COP3 are indices (composite variables) that range from  0 (little) to 2 (a lot)
Therefore we compare the average (mean) of the populations for each variable.


{| border="1"
{| border="1"
! rowspan="1" colspan="1" |Population
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Population
! rowspan="1" colspan="1" |COP1 Frequency of different kinds of learner activities
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |COP2 Frequency of exploratory activities outside the classroom
 
! rowspan="1" colspan="1" |COP3 Frequency of individual student work
! rowspan="1" colspan="1" |
COP1 Fréquence de différentes manières de travailler des élèves
! rowspan="1" colspan="1" |
COP2 Fréquence des activités d'exploration à l'extérieur de la classe
! rowspan="1" colspan="1" |
COP3 Fréquence des travaux individuels des élèves
|-
|-
| rowspan="3" colspan="1" |
| rowspan="3" colspan="1" |1 Teacher trainee
1 Etudiant(e) LME
! rowspan="1" colspan="1" |Mean
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1.528
Mean
| rowspan="1" colspan="1" |1.042
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.885
1.528
| rowspan="1" colspan="1" |
1.042
| rowspan="1" colspan="1" |
.885
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |N
N
| rowspan="1" colspan="1" |48
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |48
48
| rowspan="1" colspan="1" |48
| rowspan="1" colspan="1" |
48
| rowspan="1" colspan="1" |
48
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Std. Deviation
Std. Deviation
| rowspan="1" colspan="1" |.6258
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.6260
.6258
| rowspan="1" colspan="1" |.5765
| rowspan="1" colspan="1" |
.6260
| rowspan="1" colspan="1" |
.5765
|-
|-
| rowspan="3" colspan="1" |
| rowspan="3" colspan="1" |2 Regular teacher
2 Enseignant(e) du primaire
! rowspan="1" colspan="1" |Mean
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1.816
Mean
| rowspan="1" colspan="1" |1.224
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1.224
1.816
| rowspan="1" colspan="1" |
1.224
| rowspan="1" colspan="1" |
1.224
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |N
N
| rowspan="1" colspan="1" |38
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |38
38
| rowspan="1" colspan="1" |38
| rowspan="1" colspan="1" |
38
| rowspan="1" colspan="1" |
38
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Std. Deviation
Std. Deviation
| rowspan="1" colspan="1" |.3440
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.4302
.3440
| rowspan="1" colspan="1" |.5893
| rowspan="1" colspan="1" |
.4302
| rowspan="1" colspan="1" |
.5893
|-
|-
| rowspan="3" colspan="1" |
| rowspan="3" colspan="1" |Total
Total
! rowspan="1" colspan="1" |Mean
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1.655
Mean
| rowspan="1" colspan="1" |1.122
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1.035
1.655
| rowspan="1" colspan="1" |
1.122
| rowspan="1" colspan="1" |
1.035
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |N
N
| rowspan="1" colspan="1" |86
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |86
86
| rowspan="1" colspan="1" |86
| rowspan="1" colspan="1" |
86
| rowspan="1" colspan="1" |
86
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Std. Deviation
Std. Deviation
| rowspan="1" colspan="1" |.5374
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.5527
.5374
| rowspan="1" colspan="1" |.6029
| rowspan="1" colspan="1" |
.5527
| rowspan="1" colspan="1" |
.6029
|}
|}


* COP1, COP2, COP3 sont des indicateurs composé allant de 0 (peu) et 2 (beaucoup)
Standard deviations within groups are rather high (in particular for students), which is a bad thing: it means that among students they are highly different.
* The difference for COP2 is not significant (see next slide)
* Standard deviations within groups are rather high (in particular for students), which is
a bad thing: it means that among students they are highly different.


=== Anova Table and measures of associations ===


 
At this stage, all you will have to do is look at the ''sig.'' level which should be below 0.5. You only accept 4.99% chance that the relationship is random.
=== Anova Table and measures of associations ===


{| border="1"
{| border="1"
! rowspan="1" colspan="1" |Variables (Y) explained by population (X)
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
 
! rowspan="1" colspan="1" |Sum of Squares
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |df
 
! rowspan="1" colspan="1" |Mean Square
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |F
Sum of Squares
! rowspan="1" colspan="1" |Sig.
! rowspan="1" colspan="1" |
df
! rowspan="1" colspan="1" |
Mean Square
! rowspan="1" colspan="1" |
F
! rowspan="1" colspan="1" |
Sig.
|-
|-
| rowspan="3" colspan="1" |
| rowspan="3" colspan="1" |COP1 Frequency of different kinds of learner activities<br/> * Population
Var_COP1 Fréquence de différentes manières de travailler des élèves * Population_bis
| rowspan="1" colspan="1" |Between Groups
Population
| rowspan="1" colspan="1" |1.759
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1
Between Groups
| rowspan="1" colspan="1" |1.759
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |6.486
1.759
| rowspan="1" colspan="1" |.013
| rowspan="1" colspan="1" |
1
| rowspan="1" colspan="1" |
1.759
| rowspan="1" colspan="1" |
6.486
| rowspan="1" colspan="1" |
.013
|-
|-
| rowspan="1" colspan="1" |Within Groups
| rowspan="1" colspan="1" |22.785
| rowspan="1" colspan="1" |84
| rowspan="1" colspan="1" |.271
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Within Groups
| rowspan="1" colspan="1" |
22.785
| rowspan="1" colspan="1" |
84
| rowspan="1" colspan="1" |
.271
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
|-
|-
| rowspan="1" colspan="1" |Total
| rowspan="1" colspan="1" |24.544
| rowspan="1" colspan="1" |85
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Total
| rowspan="1" colspan="1" |
24.544
| rowspan="1" colspan="1" |
85
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
|-
|-
| rowspan="3" colspan="1" |
| rowspan="3" colspan="1" |COP2 Frequency of exploratory activities outside the classroom<br/> * Population
Var_COP2 Fréquence des activités d'exploration à l'extérieur de la classe * Population_bis
| rowspan="1" colspan="1" |Between Groups
Population
| rowspan="1" colspan="1" |.703
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1
Between Groups
| rowspan="1" colspan="1" |.703
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |2.336
.703
| rowspan="1" colspan="1" |.130
| rowspan="1" colspan="1" |
1
| rowspan="1" colspan="1" |
.703
| rowspan="1" colspan="1" |
2.336
| rowspan="1" colspan="1" |
.130
|-
|-
| rowspan="1" colspan="1" |Within Groups
| rowspan="1" colspan="1" |25.265
| rowspan="1" colspan="1" |84
| rowspan="1" colspan="1" |.301
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Within Groups
| rowspan="1" colspan="1" |
25.265
| rowspan="1" colspan="1" |
84
| rowspan="1" colspan="1" |
.301
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
|-
|-
| rowspan="1" colspan="1" |Total
| rowspan="1" colspan="1" |25.968
| rowspan="1" colspan="1" |85
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Total
| rowspan="1" colspan="1" |
25.968
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
85
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |


|-
|-
| rowspan="3" colspan="1" |
| rowspan="3" colspan="1" |COP3 Frequency of individual student work<br/> * Population
Var_COP3 Fréquence des travaux individuels des élèves * Population_bis Population
| rowspan="1" colspan="1" |Between Groups
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |2.427
Between Groups
| rowspan="1" colspan="1" |1
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |2.427
2.427
| rowspan="1" colspan="1" |7.161
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.009
1
| rowspan="1" colspan="1" |
2.427
| rowspan="1" colspan="1" |
7.161
| rowspan="1" colspan="1" |
.009
|-
|-
| rowspan="1" colspan="1" |Within Groups
| rowspan="1" colspan="1" |28.468
| rowspan="1" colspan="1" |84
| rowspan="1" colspan="1" |339
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Within Groups
| rowspan="1" colspan="1" |
28.468
| rowspan="1" colspan="1" |
84
| rowspan="1" colspan="1" |
339
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
|-
|-
| rowspan="1" colspan="1" |Total
| rowspan="1" colspan="1" |30.895
| rowspan="1" colspan="1" |85
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Total
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
30.895
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
85
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
|}
|}


Line 1,098: Line 639:
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Var_COP1 Fréquence de différentes manières de travailler des élèves * Population
Var_COP1 Frequency of different kinds of learner activities * Population
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
.268
.268
Line 1,105: Line 646:
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Var_COP2 Fréquence des activités d'exploration à l'extérieur de la classe * Population
Var_COP2 Frequency of exploratory activities outside the classroom * Population
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
.164
.164
Line 1,112: Line 653:
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Var_COP3 Fréquence des travaux individuels des élèves * Population
Var_COP3 Frequency of individual student work * Population
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
.280
.280
Line 1,119: Line 660:
|}
|}


* associations are week and explained variance very weak
Result: Associations are week and explained variance very weak. The "COP2" relation is not significant.
 
== Regression Analysis and Pearson Correlations ==
 
We already introduced the principle of linear regression above. It is use to compute a trend between an explaining variable X and explained variable Y. Both must be quantitative variables.
 
=== The principle of regression analysis ===
 
Let's recall the principle: Regression analysis tries to find a line that will maximize prediction and minimize residuals.
 
* DATA = ''predicted'' regression line + ''residuals'' (unexplained noise)
 
; Regression coefficients:
 
We have two parameters that summarize the model:
: B = the slope of the line
: A (constant) = offset from 0


The Pearson correlation ('''r''') summarizes the strength of the relation


R square represents the variance explained.


== Regression Analysis and Pearson Correlations ==
[[Image:statistical-structure.png]]




=== Linear bi-variate regression example ===


=== Does teacher age explain exploratory activities outside the classroom ? ===
The question: Does teacher age explain exploratory activities outside the classroom ?  


* Independant variable: AGE
* Independent variable X: Age of the teacher
* Dependent variable: Fréquence des activités d'exploration à l'extérieur de la classe
* Dependent variable Y: Frequency of exploratory activities organized in the classroom


Model Summary
; Regression Model Summary


{| border="1"
{| border="1"
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |R
R
! rowspan="1" colspan="1" |R Square
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |Adjusted R Square
R Square
! rowspan="1" colspan="1" |Std. Error of the Estimate
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |Pearson Correlation
Adjusted R Square
! rowspan="1" colspan="1" |Sig. (1-tailed)
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |N
Std. Error of the Estimate
! rowspan="1" colspan="1" |
Pearson Correlation
! rowspan="1" colspan="1" |
Sig. (1-tailed)
! rowspan="1" colspan="1" |
N
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.316
.316
| rowspan="1" colspan="1" |.100
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.075
.100
| rowspan="1" colspan="1" |.4138
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.316
.075
| rowspan="1" colspan="1" |.027
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |38
.4138
| rowspan="1" colspan="1" |
.316
| rowspan="1" colspan="1" |
.027
| rowspan="1" colspan="1" |
38
|}
|}


Model Coefficients
;Model Coefficients


{| border="1"
{| border="1"
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
 
! rowspan="1" colspan="1" |Coefficients
! rowspan="1" colspan="1" |
Coefficients
! rowspan="1" colspan="1" |
 
! rowspan="1" colspan="1" |
Stand. coeff.
! rowspan="1" colspan="1" |
t
! rowspan="1" colspan="1" |
Sig.
! rowspan="1" colspan="1" |
! rowspan="1" colspan="1" |
Correlations
! rowspan="1" colspan="1" |Stand. coeff.
! rowspan="1" colspan="1" |t
! rowspan="1" colspan="1" |Sig.
! rowspan="1" colspan="1" |Correlations
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
 
| rowspan="1" colspan="1" |B
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |Std. Error
B
| rowspan="1" colspan="1" |Beta
| rowspan="1" colspan="1" |
Std. Error
| rowspan="1" colspan="1" |
Beta
| rowspan="1" colspan="1" |
 
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
Zero-order
| rowspan="1" colspan="1" |Zero-order
|-
|-
| rowspan="1" colspan="1" |(Constant)
| rowspan="1" colspan="1" |.706
| rowspan="1" colspan="1" |.268
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
(Constant)
| rowspan="1" colspan="1" |2.639
| rowspan="1" colspan="1" |.012
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
.706
| rowspan="1" colspan="1" |
.268
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |
2.639
| rowspan="1" colspan="1" |
.012
| rowspan="1" colspan="1" |
|-
|-
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |AGE Age
AGE Age
| rowspan="1" colspan="1" |.013
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.006
.013
| rowspan="1" colspan="1" |.316
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |1.999
.006
| rowspan="1" colspan="1" |.053
| rowspan="1" colspan="1" |
| rowspan="1" colspan="1" |.316
.316
| rowspan="1" colspan="1" |
1.999
| rowspan="1" colspan="1" |
.053
| rowspan="1" colspan="1" |
.316
|-
|-
| rowspan="1" colspan="7" |
| rowspan="1" colspan="7" |Dependent Variable: Var_COP2 Fréquence des activités d'exploration à l'extérieur de la classe
Dependent Variable: Var_COP2 Fréquence des activités d'exploration à l'extérieur de la
classe
|}
|}


All this means:
All this means:


* We have a week relation (.316) between age and exploratory activities. It is significant
* We have a week relation (.316) between age and exploratory activities. It is significant (.027)
(.027)
* Formally the relation is:
 
exploration scale = .705 + 0.013 * AGE
 
(roughly: only people over 99 are predicted a top score of 2)
 
Here is a scatter plot of this relation
 
* No need for statistical coefficients to see that the relation is rather week and why the
prediction states that it takes a 100 years ... :)
 
[[Image:book-research-design-195.png]]
 
== Exploratory Multi-variate Analysis ==
 
There many techniques, here we just introduce cluster analysis, e.g. Factor Analysis
(principal components) or Discriminant analysis are missing here
 


Formally speaking, the relation is:


=== Cluster Analysis ===
exploration scale = .705 + 0.013 * AGE


* Cluster analysis or classification refers to a set of multivariate methods for grouping
It also can be interpreted as: "only people over 99 are predicted a top score of 2" :)
elements (subjects or variables) from some finite set into clusters of similar elements
(subjects or variables).
* There 2 different kinds: hierarchical cluster analysis and K-means cluster.
* Typical examples: Classify teachers into 4 to 6 different groups regarding ICT usage


Here is a scatter plot of this relation:


[[Image:linear-regression-example.png]]


=== Gonzalez classification of teachers ===
There is no need for statistical coefficients to see that the relation is rather week and why the prediction states that it takes a 100 years to get there... :)


* A hierarchical analysis allow to identify 6 major types of teachers
== Links ==
* Type 1 : l’enseignant convaincu
* Type 2 : les enseignants actifs
* Type 3 : les enseignants motivés ne disposant pas d’un environnement favorable
* Type 4 : les enseignants volontaires, mais faibles dans le domaine des technologies
* Type 5 : l’enseignant techniquement fort mais peu actif en TIC
* Type 6 : l’enseignant à l’aise malgré un niveau moyen de maîtrise


Dendogram (tree diagram of the population)
; Online handbooks
There are excellent statistics resources on the web. For starters we recommend:


[[Image:book-research-design-196.png]]
* [http://www.statsoft.com/textbook/stathome.html StatSoft Electronic Statistics Textbook]. '''Very good and fairly suitable for beginners''', from StatSoft the makers of Statistica.


Statistics of a subset of the 36 variables used for analysis:
* [http://www2.chass.ncsu.edu/garson/pa765/statnote.htm PA 765 Statnotes: An Online Textbook] by G. David Garson, NC State University. '''Very good Hypertext with many detailed "chapters", not always suitable for total beginners'''. Also makes references to SPSS procedures . In this tutorial, we referred to several pages.


[[Image:book-research-design-197.png]]
; Online pages
* [http://en.wikipedia.org/wiki/Normalization_(statistics) Normalization_(statistics)] (Wikipedia)
* [http://en.wikipedia.org/wiki/Standard_score Standard score] (Wikipedia)
* [http://en.wikipedia.org/wiki/Skewness Skewness] (Wikipedia)


* Final note: confirmatory multivariate analysis (e.g. structural equation modelling) is
; More
not even mentionnend in this document
See [[Research_methodology_resources#Statistics|Research methodology resources]] for more pointers.


== To do ==


* Translate examples
* Review everything and add some more explanations
* Add logistic regression ?


[[Category: research methodologies]]
[[Category: research methodologies]]
[[Category: tutorials]]
[[Category:Research methodology tutorials]]

Latest revision as of 19:06, 22 August 2016


This is part of the methodology tutorial (see its table of contents).

Introduction

This tutorial is a short introduction to simple (rather confirmatory) statistics for beginners.

Learning goals
  • Understand the importance of data assumptions, e.g. understand the bad influence of "extreme cases"
  • Understand the "find structure" principle of statistical data analysis
  • Be able to identify the major stages of (simple) statistical analysis
  • Know the difference between the four kinds of statistical coefficients
  • Be able to select a procedure for bi-variate analysis according to data types
  • Understand crosstabulation
  • Understand analysis of variance
  • Understand simple regression analysis
Prerequisites
Moving on
Level and target population
  • Beginners
Quality
  • to be improved, but usable

Scales and "data assumptions"

Types of quantitative measures (scales)

Quantitative data come in different types or forms as we have seen in descriptive statistics and scales tutorial.

Let's recall the three data types:

  • nominal, i.e. categorized observations (e.g. country names)
  • ordinal, i.e. rankings
  • interval, i.e. quantitative scaled observations (e.g. a score)

For each type of measure or combinations of types of measures, you will have to use particular analysis techniques. In other words, most statistical procedures only work with certain kinds of data types.

There is a bigger choice of statistical techniques for quantitative (interval) variables. Therefore scales like (1) strongly agree, (2) agree, (3) somewhat agree, etc. usually are treated as interval variables, although it's not totally correct to do so.

Data types are not the only technical constraints for the selection of a statistical procedure, sample size and data assumptions are others.

Data assumptions

In addition to their data types, many statistical analysis types only work for given sets of data distributions and relations between variables.

In practical terms this means that not only you have to adapt your analysis techniques to types of measures but you also (roughly) should respect other data assumptions.

Linearity

The most frequent assumption about relations between variables is that the relationships are linear.

In the following example the relationship is non-linear: students that show weak daily computer use have bad grades, but so do they ones that show very strong use.

Popular measures like the Pearson’s r correlation will "not work", i.e. you will have a very weak correlation and therefore miss this non-linear relationship.

Non-linear-relation.png

Normal distribution

Most methods for interval data also require a so-called normal distribution (see the Methodology tutorial - descriptive statistics and scales)

If you have data with "extreme cases" and/or data that is skewed (assymetrical), some individuals will have much more "weight" than the others.

Hypothetical example:

  • The "red" student who uses the computer for very long hours will lead to a positive correlation and positive regression rate, whereas the "black" ones alone in the data suggest an inexistent correlation. Mean use of computers does not represent "typical" usage in this case, since the "red" one "pulls the mean upwards".
  • The "green" student however, will not have a major impact on the result, since the other data are well distributed along the 2 axis. In this second case the "mean" represents a "typical" student.

Non-normal-distribution.png

In addition you also should understand that extreme values already have more weight with variance-based analysis methods (i.e. regression analysis, Anova, factor analysis, etc.) since since distances are computed as squares.

The principle of statistical analysis

Finding structure

The goal of statistical analysis is quite simple: find structure in the data. We can express this principle with two synonymous formulas:

DATA = STRUCTURE + NON-STRUCTURE
DATA = EXPLAINED VARIANCE + NOT EXPLAINED VARIANCE

Example: Simple regression analysis

  • DATA = predicted regression line + residuals (unexplained noise)

In other words: regression analysis tries to find a line that will maximize prediction and minimize residuals.

Statistical-structure.png

Stages of statistical analysis

Let's have look of what we mean be statistical analysis and what your typically have to do. We shall come back to most stages throughout this tutorial page:

  1. Clean your data
    • Make very sure that your data are correct (e.g. check data transcription)
    • Make very sure that missing values (e.g. not answered questions in a survey) are clearly identified as missing data
  2. Gain knowledge about your data
    • Make lists of data (for small data sets only !)
    • Produce descriptive statistics, e.g. means, standard-deviations, minima, maxima for each variable
    • Produce graphics, e.g. histograms or box plot that show the distribution
  3. Produce composed scales
    • E.g. create a single variable from a set of questions
  4. Make graphics or tables that show relationships
    • E.g. Scatter plots for interval data (as in our previous examples) or cross-tabulations
  5. Calculate coefficients that measure the strength and the structure of a relation
    • Strength examples: Cramer’s V for cross-tabulations, or Pearson’s R for interval data
    • Structure examples: regression coefficient, tables of means in analysis of variance
  6. Calculate coefficients that describe the percentage of variance explained
    • E.g. R2 in a regression analysis
  7. Compute significance level, i.e. find out if you have to right to interpret the relation
    • E.g. Chi-2 for crosstabs, Fisher’s F in regression analysis

Note: With statistical data analysis programs you easily can do several steps in one operation.

Types of statistical coefficients

All statistical analysis produce various kinds (lots) of coefficients, i.e. numbers that will summarize certain kinds of informations.

Always make sure to use only coefficients that are appropriate for your data

There are four big kinds of coefficients and you find these in most analysis methods:

(1) Strength of a relation
  • Coefficients usually range from -1 (total negative relationship) to +1 (total positive relationship). 0 means no relationship.
(2) Structure (tendency) of a relation
  • Summarizes a trend
(3) Percentage of variance explained
  • Tells how much structure is in your model
(4) Signification level of your model
  • Gives that chance that you are in fact gambling
  • Typically in the social sciences a sig. level lower than 5% (0.05) is acceptable. Do not interpret data that is above !

These four types are mathematically connected: E.g. the signification level is not just dependent on the size of your sample, but also on the strength of a relation.

Overview of statistical methods

Statistical data analysis methods can be categorized according to data types we introduced in the beginning of this tutorial module.

The following table shows a few popular simple bi-variate analysis methods for a given independent (explaining) variable X and a dependent (to be explained) variable Y.

Dependant variable Y

Quantitative (interval)

Qualitative (nominal or ordinal)

Independent (explaining)
variable X

Quantitative

Correlation and Regression

Logistic regression

Qualitative

Analysis of variance

Crosstabulations

Popular multi-variate analysis

Dependant variable Y

Quantitative (interval)

Qualitative (nominal or ordinal)

Independent(explaining)
variable X

Quantitative

Factor Analysis, multiple regression, SEM, Cluster Analysis,

Logit. Alternatively, transform X into a qualitative variable and see below or split a variable into several dichotomic (yes/no) variables and see to the left.

Qualitative

Anova

Multidimensional scaling etc.


Crosstabulation

Crosstabulation is a popular technique to study relationships between normal (categorical) or ordinal variables.

The principle of cross tabulation analysis

Crosstabulation is simple, but beginner nevertheless get it often wrong. You do have to remember the basic objective of simple data analysis: Explain variable Y with variable X.

Computing the percentages (probabilities)

Since you want to know the probability (percentage) that a value of X leads to a value of Y, you will have to compute percentages in order to able to "talk about probabilities".

In a tabulation, the X variable is usually put on top (i.e. its values show in columns) but you can do it the other way round. Just make sure that you get the percentages right !

Steps
  • Compute percentages across each item of X (i.e. "what is the probability that a value of X leads to a value of Y")
  • Then compare (interpret) percentages across each item of the dependant (to be explained) variable

Let's recall the simple experimentation paradigm in which most statistical analysis is grounded since research is basically about comparison. Note: X is put to the left (not on top):

Treatment effect (O) non-effect (O) Total effect
for a group
treatment: (group X) bigger smaller 100 %
non-treatment: (group non-X) smaller bigger 100 %

You have to interpret this table in the following way: The chance that a treatment (X) leads to a given effect (Y) is higher than the chance that a non-treatment will have this effect.

Anyhow, a "real" statistical crosstabulation example will be presented below. Let's first discuss a few coefficients that can summarize some important information.

Statistical association coefficients (there are many!)
  • Phi is a chi-square based measure of association and is usually used for 2x2 tables
  • The Contingency Coefficient (Pearson's C). The contingency coefficient is an adjustment to phi, intended to adapt it to tables larger than 2-by-2.
  • Somers' d is a popular coefficient for ordinal measures (both X and Y). There exist two variants: "Symmetric" and "Y dependant on X".
Statistical significance tests

Pearson's chi-square is by far the most common. If simply "chi-square" is mentioned, it is probably Pearson's chi-square. This statistic is used to text the hypothesis of no association of columns and rows in tabular data. It can be used with nominal data.

In SPSS
  • You fill find crosstabs under menu: Analyze->Descriptive statistics->Crosstabs
  • You then can must select percentages in "Cells" and coefficients in "statistics". This will make it "inferential", not just "descriptive".

Crosstabulation - Example 1

We want to know if ICT training will explain use of presentation software in the classroom.

There are two survey questions:

  1. Did you receive some formal ICT training ?
  2. Do you use a computer to prepare slides for classroom presentations ?

Now let's examine the results

X= Did you receive some formal ICT training ? Total
No Yes
Y= Do you you use a computer
to prepare slides for
classroom presentations ?
Regularly Count 4 45 49
% within X 44.4% 58.4% 57.0%
Occasionally Count 4 21 25
% within X 44.4% 27.3% 29.1%
2 Never Count 1 11 12
% within X 11.1% 14.3% 14.0%
Total Count 9 77 86
% within X 100.0% 100.0% 100.0%

The probability that computer training ("Yes") leads to superior usage of the computer to prepare documents is very weak (you can see this by comparing the % line by line.

The statistics tell the same story:

  • Pearson Chi-Square = 1.15 with a signification= .562
    • This means that the likelihood of results being random is > 50% and you have to reject relationship
  • Contingency coefficient = 0.115, significance = .562. (same result)

Therefore: Not only is the relationship very weak, but it can not be interpreted. In other words: There is absolutely no way to assert that ICT training leads to more frequent use of presentation software in our case.

Crosstabulation - Example 2

We want to know if the teacher's belief that students will gain autonomy when using Internet resource will have an influence on classroom practice, i.e. organize activities where learners have to search information on the Internet.

(translation needed)

  • X = Teachers belief: Leaners will gain autonomy through using Internet resources
  • Y = Classroom activities: Search information on the Internet
X= Leaners will gain autonomy through using Internet resources (teacher belief)
0 Fully disagree 1 Rather disagree 2 Rather agree 3 Fully agree Total
Y= Search information
on the Internet
0 Regularly Count 0 2 9 11 22
% within X .0% 18.2% 19.6% 42.3% 25.6%
1 Occasionnally Count 1 7 23 11 42
% within X 33.3% 63.6% 50.0% 42.3% 48.8%
2 Never Count 2 2 14 4 22
% within X 66.7% 18.2% 30.4% 15.4% 25.6%
Total Count 3 11 46 26 86
% within X 100.0% 100.0% 100.0% 100.0% 100.0%
  • We have a weak significant relationship: the more teachers agree that students will increase learning autonomy from using Internet resources, the more is it likely that they will let students do so.

The statistical coefficients we use is "Directional Ordinal by Ordinal Measures with Somer’s D":

Values Somer’s D Significance
Symmetric -.210 .025
Y = Search information on the Internet - Dependent -.215 .025

Therefore, teacher's belief explain things, but the relationship is very weak ....

Simple analysis of variance

Analysis of variance (and it’s multi-variate variant Anova) are the favorite tools of the experimentalists. It is also popular in quasi-experimental research and survey research as the following example shows.

The principle of analysis of variance

X is an experimental condition (therefore a nominal variable) and Y usually is an interval variable.

Example: Does presence or absence of ICT usage influence grades ?

  • You can show that X has an influence on Y if means achieved by different groups (e.g. ICT vs. non-ICT users) are significantly different.

Significance improves when:

  • Means of the X groups are different (the further apart the better)
  • Variance inside X groups is low (certainly lower than the overall variance)
Coefficients
  • Standard deviation is a measure of variance. It means "the mean deviation from the mean". I.e. how far from the central point is the "typical" individual.
  • Eta is a correlation coefficient
  • Eta square measures the explained variance
In SPSS

Analysis of variance can be found in two different locations:

  • Analyze->Compare Means
  • General linear models (avoid this is you are a beginner)

Differences between teachers and teacher students

In this example we want to know if teacher trainees (e.g. primary teacher students) are different from "real" teachers regarding three kinds of variables:

  • Frequency of different kinds of learner activities
  • Frequency of exploratory activities outside the classroom
  • Frequency of individual student work

COP1, COP2, COP3 are indices (composite variables) that range from 0 (little) to 2 (a lot)

Therefore we compare the average (mean) of the populations for each variable.

Population COP1 Frequency of different kinds of learner activities COP2 Frequency of exploratory activities outside the classroom COP3 Frequency of individual student work
1 Teacher trainee Mean 1.528 1.042 .885
N 48 48 48
Std. Deviation .6258 .6260 .5765
2 Regular teacher Mean 1.816 1.224 1.224
N 38 38 38
Std. Deviation .3440 .4302 .5893
Total Mean 1.655 1.122 1.035
N 86 86 86
Std. Deviation .5374 .5527 .6029

Standard deviations within groups are rather high (in particular for students), which is a bad thing: it means that among students they are highly different.

Anova Table and measures of associations

At this stage, all you will have to do is look at the sig. level which should be below 0.5. You only accept 4.99% chance that the relationship is random.

Variables (Y) explained by population (X) Sum of Squares df Mean Square F Sig.
COP1 Frequency of different kinds of learner activities
* Population
Between Groups 1.759 1 1.759 6.486 .013
Within Groups 22.785 84 .271
Total 24.544 85
COP2 Frequency of exploratory activities outside the classroom
* Population
Between Groups .703 1 .703 2.336 .130
Within Groups 25.265 84 .301
Total 25.968 85
COP3 Frequency of individual student work
* Population
Between Groups 2.427 1 2.427 7.161 .009
Within Groups 28.468 84 339
Total 30.895 85

Measures of Association

Eta

Eta Squared

Var_COP1 Frequency of different kinds of learner activities * Population

.268

.072

Var_COP2 Frequency of exploratory activities outside the classroom * Population

.164

.027

Var_COP3 Frequency of individual student work * Population

.280

.079

Result: Associations are week and explained variance very weak. The "COP2" relation is not significant.

Regression Analysis and Pearson Correlations

We already introduced the principle of linear regression above. It is use to compute a trend between an explaining variable X and explained variable Y. Both must be quantitative variables.

The principle of regression analysis

Let's recall the principle: Regression analysis tries to find a line that will maximize prediction and minimize residuals.

  • DATA = predicted regression line + residuals (unexplained noise)
Regression coefficients

We have two parameters that summarize the model:

B = the slope of the line
A (constant) = offset from 0

The Pearson correlation (r) summarizes the strength of the relation

R square represents the variance explained.

Statistical-structure.png


Linear bi-variate regression example

The question: Does teacher age explain exploratory activities outside the classroom ?

  • Independent variable X: Age of the teacher
  • Dependent variable Y: Frequency of exploratory activities organized in the classroom
Regression Model Summary
R R Square Adjusted R Square Std. Error of the Estimate Pearson Correlation Sig. (1-tailed) N
.316 .100 .075 .4138 .316 .027 38
Model Coefficients
Coefficients Stand. coeff. t Sig. Correlations
B Std. Error Beta Zero-order
(Constant) .706 .268 2.639 .012
AGE Age .013 .006 .316 1.999 .053 .316
Dependent Variable: Var_COP2 Fréquence des activités d'exploration à l'extérieur de la classe

All this means:

  • We have a week relation (.316) between age and exploratory activities. It is significant (.027)

Formally speaking, the relation is:

exploration scale = .705 + 0.013 * AGE

It also can be interpreted as: "only people over 99 are predicted a top score of 2" :)

Here is a scatter plot of this relation:

Linear-regression-example.png

There is no need for statistical coefficients to see that the relation is rather week and why the prediction states that it takes a 100 years to get there... :)

Links

Online handbooks

There are excellent statistics resources on the web. For starters we recommend:

  • PA 765 Statnotes: An Online Textbook by G. David Garson, NC State University. Very good Hypertext with many detailed "chapters", not always suitable for total beginners. Also makes references to SPSS procedures . In this tutorial, we referred to several pages.
Online pages
More

See Research methodology resources for more pointers.

To do

  • Translate examples
  • Review everything and add some more explanations
  • Add logistic regression ?