Statistics FAQ

Compiled by Chris Fife-Schaw

This set of web pages is intended as a resource for psychology students and staff at the University of Surrey though it is available to all. It deals with a range of questions that we get asked a lot but are not routinely covered in basic statistics books.

The idea is to give practical advice, some stats utilities and to point people in the direction of authoritative sources. It is not intended to give extensive statistical arguments in support of the advice given – that will be available via references and web-links.

It is an evolving resource and I’ll add new features and update the entries as received wisdom on these matters changes over time. While every effort will be made to ensure that the answers to these FAQs are correct it is inevitable that there will be occasions where the advice given is the subject of debate or is possibly just plain wrong. Please let me know if you find such things and I will change the entry accordingly (c.fife-Schaw@surrey.ac.uk). Similarly, where there are web links I cannot guarantee their accuracy and continued availability – if you notice any problems again e-mail me at c.fife-Schaw@surrey.ac.uk and I will try to fix the problem.

Where I am suggesting solutions that are based on my own views or preferences this will be indicated by being indented and prefaced by
Chris says:’ – treat such advice with considerable caution!

On Correlations

How do I disattenuate a correlation (‘correct’ for measurement (un)reliability)?

When we correlate two variables together the value of the correlation is depressed if one or both of the variables contains some random measurement error (i.e. is not perfectly reliable). As random error by definition cannot be correlated with anything the more random error there is in your variables the lower the maximum possible correlation between them can be (remember that the square of a Pearson’s correlation coefficient is proportion of variance that the two variables share). This is particularly the case with those kinds of measures popular in psychology that involve composite scaled measures made up of responses to a number of questionnaire items – these rarely have less than 5% error variance and 15%+ is probably the norm. Since we are usually interested in the true correlation between the conceptual variables rather than the correlation between the observed measures it would be handy to be able to estimate what the correlation would be after taking measurement (un)reliability into account.

The following formula does this:

where r*12 is the disattenuated correlation, r12 the observed correlation and r11 and r22 the reliabilities of the two variables

Since trying to gauge the ‘true’, disattenuated correlation is usually important for readers wanting to assess the magnitude of any relationship found; you should always report the reliabilities of your measures on your data even if you are not going to report the disattenuated values yourself.

It is similarly possible to correct a partial correlation and/or adjust a regression weight to reflect unreliability in measurement (Osborne, 2003). Here the effect of measurement error can be to artificially increase the apparent importance of a relationship. The formula for correcting a partial correlation coefficient is:

Where r*12.3 is the corrected partial coefficient, r12, r13 and r23 the zero order observed correlations and r11, r22 and r33 the variables’ respective reliabilities. See Osborne (2003) or Aiken and West (1991) for discussions of the impact of measurement errors on regression coefficients

Chris’s Calculator [download a copy] has a utility for doing these calculations.

Assumptions:
All the usual ones for doing Pearson product moment correlations plus:

1.the errors in each measure are assumed to be random
2.your estimates of the reliabilities are unbiase

Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks: Sage.

Fan, X. (2003). Two approaches for correcting correlation attenuation caused by measurement error: Implications for research practice.  Educational and Psychological Measurement,. 63, 915-930.  Download a copy from: http://epm.sagepub.com/cgi/content/abstract/63/6/915

Osborne, J. W. (2003). Effect sizes and the disattenuation of correlation and regression coefficients: lessons from educational psychology. Practical Assessment, Research & Evaluation, 8(11). – web paper at http://pareonline.net/getvn.asp?v=8&n=11

Chris says:
Correcting for attenuation often makes your results look better and is seen by some as appealing for this reason alone. The corrected correlation increases more over the observed one as the reliabilities of the measures decrease but this is not a good reason to be happy to use poor quality, low reliability measures knowing that the correlations you observe will be increased more substantially when you correct them later.  First, the true correlation between the variables is not changed by the quality of the tool used to measure them.  Secondly, experience shows that as the quality of measures decreases so does the likelihood that the source of the errors is random.  This violates a key assumption of the procedure and renders the result ambiguous.

Disattenuating a correlation is not a way of turning a not-quite-significant observed correlation into a significant one either.  Most texts on this do not go into the significance of the disattenuated correlation preferring instead to emphasise its estimated size since this is the important bit of information being conveyed (note that the formulae contain no information on sample sizes that would be necessary in order to draw up confidence limits).  Don’t try to use disattenuation to argue that your findings are now ‘significant’ when they weren’t pre-correction – this will be seen through.  Always indicate when you have disattenuated a correlation too.

This method for correcting for attenuation is the best known and is easy to use. In more complex modelling situations it is probably easier to adopt a SEM approach to assessing relationships between variables with measurement errors ‘removed’ than to try to apply this formula on many relationships simultaneously.  Fan (2003) shows that the SEM approach (at least in the CFA context) produces equivalent results to the application of the formula above.

How do I work out if two correlation coefficients are significantly different?

The key here is whether the correlations concerned are independent – i.e. from different samples – or from the same sample. This page deals with three prototypical cases:

Case 1: Testing the difference between correlations from difference samples
Case 2: Testing the difference between two dependent correlations – i.e. the correlation between X and Z and Y and Z
Case 3: Testing the difference between two correlations from the same sample – i.e. the correlation between A and B and C and D

http://www.quantitativeskills.com/sisa/statistics/corrhlp.htm - this contains an on-line calculator for doing the calculations for Cases 1 and 2.

Steiger, J.H. (1980) Tests for comparing elements of a correlation matrix.  Psychological Bulletin, 87, 245–251. (download from http://www.statpower.net/publications_and_papers.htm)

Hittner, J.J., May, K. & Silver, N.C. (2003) A Monte Carlo evaluation of tests for comparing dependent correlations. Journal of General Psychology 130, 149-168.

Chris says:
Not a lot.  James Steiger said a lot about this 25 years ago and while there have been more recent useful Monte Carlo studies (e.g. Hittner et al, 2003) the broad conclusions remain the same.  Let me know if you know different.

Case 1: Testing the difference between correlations from different samples

The most widely used procedure starts by first converting the two correlations via Fisher’s r-z transformation using:

You calculate a Z for each correlation which I’ll call Zr1 and Zr2. Pearson’s correlation coefficients cannot readily be added or subtracted from each other but these transformed ones can be.  Confusingly the procedure uses ‘z’ in two different ways as we’ll see.  Next up you need to calculate the standard error for the differences between the two correlations which is given by:

where n1 and n2 are the sample sizes for the first and second correlations respectively. Next divide the difference between the two z-scores by the standard error as follows:

This time ‘z’ is the ordinary normal deviate z. You can look up the probability of having observed a ‘z’ as big as this (either positive or negative) if the two correlations did not differ from each other in the population in normal ‘z’ tables in the back of a statistics book. As a guide, if the z value is 1.96 or higher (ignoring the sign), the difference in the correlations is significant at the .05 level (2-tailed). Use a 2.58 cut-off for significance at the .01 level (2-tailed).

Assumptions
1.The two correlations are from independent samples/groups of subjects.
2.The scores in each group are sampled randomly and independently
3.The distribution of the two variables involved in each correlation is bivariate normal.

Chris’s Calculator [download a copy] has a utility for doing these calculations.

Case 2: Testing the difference between dependent correlations –r12 vs r13

In this case you want to test the null hypothesis that two correlations are equal when they share one variable in common. James Steiger (1980) recommends Williams’ formula (T2) as the best general purpose test for this (there are lots out there!):

This has a t-distribution with df = N – 3 and can be looked up in tables in the normal way.

Case 3: Testing the difference between dependent correlations – r12 vs r34

In this situation you want to test the null hypothesis that two separate correlations from the same population are the same. As an example you might have looked at the correlation between IQ and reading ability at age seven and want to ask whether the strength of this correlation has changed by the time the same individuals have reached eleven years old.  In this situation you need to use James Steiger’s (1980) Zbar*2

Where z12 and z34 are the Fisher’s R-Z transformed correlations that you want to compare and

Chris’s Calculator [download a copy] has a utility for doing these calculations.

How do I work out if two correlation coefficients are significantly different?

The key here is whether the correlations concerned are independent – i.e. from different samples – or from the same sample. This page deals with three prototypical cases:

Case 1: Testing the difference between correlations from difference samples
Case 2: Testing the difference between two dependent correlations – i.e. the correlation between X and Z and Y and Z
Case 3: Testing the difference between two correlations from the same sample – i.e. the correlation between A and B and C and D

http://www.quantitativeskills.com/sisa/statistics/corrhlp.htm - this contains an on-line calculator for doing the calculations for Cases 1 and 2.

Steiger, J.H. (1980) Tests for comparing elements of a correlation matrix.  Psychological Bulletin, 87, 245–251. (download from http://www.statpower.net/publications_and_papers.htm)

Hittner, J.J., May, K. & Silver, N.C. (2003) A Monte Carlo evaluation of tests for comparing dependent correlations. Journal of General Psychology 130, 149-168.

Chris says:
Not a lot.  James Steiger said a lot about this 25 years ago and while there have been more recent useful Monte Carlo studies (e.g. Hittner et al, 2003) the broad conclusions remain the same.  Let me know if you know different.

Case 1: Testing the difference between correlations from different samples

The most widely used procedure starts by first converting the two correlations via Fisher’s r-z transformation using:

You calculate a Z for each correlation which I’ll call Zr1 and Zr2. Pearson’s correlation coefficients cannot readily be added or subtracted from each other but these transformed ones can be.  Confusingly the procedure uses ‘z’ in two different ways as we’ll see.  Next up you need to calculate the standard error for the differences between the two correlations which is given by:

where n1 and n2 are the sample sizes for the first and second correlations respectively. Next divide the difference between the two z-scores by the standard error as follows:

This time ‘z’ is the ordinary normal deviate z. You can look up the probability of having observed a ‘z’ as big as this (either positive or negative) if the two correlations did not differ from each other in the population in normal ‘z’ tables in the back of a statistics book. As a guide, if the z value is 1.96 or higher (ignoring the sign), the difference in the correlations is significant at the .05 level (2-tailed). Use a 2.58 cut-off for significance at the .01 level (2-tailed).

Assumptions
1.The two correlations are from independent samples/groups of subjects.
2.The scores in each group are sampled randomly and independently
3.The distribution of the two variables involved in each correlation is bivariate normal.

Chris’s Calculator [download a copy] has a utility for doing these calculations.

Case 2: Testing the difference between dependent correlations –r12 vs r13

In this case you want to test the null hypothesis that two correlations are equal when they share one variable in common. James Steiger (1980) recommends Williams’ formula (T2) as the best general purpose test for this (there are lots out there!):

This has a t-distribution with df = N – 3 and can be looked up in tables in the normal way.

Case 3: Testing the difference between dependent correlations – r12 vs r34

In this situation you want to test the null hypothesis that two separate correlations from the same population are the same. As an example you might have looked at the correlation between IQ and reading ability at age seven and want to ask whether the strength of this correlation has changed by the time the same individuals have reached eleven years old.  In this situation you need to use James Steiger’s (1980) Zbar*2

Where z12 and z34 are the Fisher’s R-Z transformed correlations that you want to compare and

Chris’s Calculator [download a copy] has a utility for doing these calculations.

On Regression

How can I tell whether a variable mediates the relationship between two other variables?

Chris says:
Since first compiling these pages and spelling out the classic Baron and Kenny procedures things have moved on and the work of Andrew Hayes and   Kristopher Preacher among others have made doing this kind of analysis much easier especially via their PROCESS add-in to SPSS.  They have done such a good job on this that there’s little point in me repeating it so check out their links below. 4th edition of Andy Field’s Discovering Statistics Using IBM SPSS Statistics now covers this topic though it only scratches the surface of what Preacher and Hayes have done.

Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior Research Methods, Instruments, & Computers, 36(4), 717-731.

Preacher, K., & Hayes, A. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 879-891.

http://www.afhayes.com/macrofaq.html

http://quantpsy.org/medn.htm

How do I test whether a variable moderates the relationship between two variables?

Chris says:
Since first compiling these pages and spelling out the classic Baron and Kenny procedures things have moved on and the work of Andrew Hayes and   Kristopher Preacher among others have made doing this kind of analysis much easier especially via their PROCESS add-in to SPSS.  They have done such a good job on this that there’s little point in me repeating it so check out their links below. The 4th edition of Andy Field’s Discovering Statistics Using IBM SPSS Statistics now covers this topic though it only scratches the surface of what Preacher and Hayes have done.

Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks: Sage.

http://www.afhayes.com/macrofaq.html

http://quantpsy.org/medn.htm

How do I work out if two regression coefficients differ significantly?

Case 1: Regression coefficients come from separate samples.

It is not possible to propose a single formula to deal with this as there are many competing formulae and approaches being proposed in the literature.  I have gone with suggested method of Raymond Paternoster and colleagues for the calculator.  It is fairly conservative and follows on from more extensive work by Clogg et al. (1995).

The reason that finding a single test for this is difficult is that there are a number of different situations in which you might want to test the equality of two regression coefficients.  A common application is to see whether a grouping variable such as gender moderates a relationship between two variables and this is dealt with under a separate heading in these FAQ pages (this is effectively testing the null hypothesis that the regression coefficients are the same in both groups).

The Z-test above assumes that the regressions are done on two separate groups and that the variables in the regression model in each group are the same.  It is not appropriate if/when there are different sets of variables because the meaning of the regression coefficient is dependent on the other variables in the regression equation.  If you recall the coefficient indicates the expected change in the outcome for a unit change in the predictor assuming the other variables in the model are held constant. If there are different sets of variables involved the effects of holding them constant may well be different.

For those keen on testing equivalence of regression coefficients more rigorously a slightly better way would be to take a multi-group structural equation modelling approach. Here you would initially test your regression model constraining all coefficients to be equivalent in both groups.  If this equivalence model is not rejected then it may be reasonable to accept (or rather not reject) the hypothesis that the coefficients are the same in both groups.  If allowing just the coefficients of interest to be different in the two groups, while keeping everything else is constrained to be identical, substantially improves the fit of the model then it would be reasonable to conclude that the coefficients differed.  This all assumes you are comfortable with using SEM packages though.

Chris’s Calculator [download a copy] has a utility for doing this calculation.

Assumptions:
All the usual ones for conducting parametric tests and an assumption that the sample is fairly large. Cohen (1983) offers alternatives when samples are small.  The Z test assumes that the regression models for both groups have the same predictor variables in them.

Clogg, C.C., Petkova, E. and Haritou, A. (1995). Statistical methods for comparing regression coefficients between models. American Journal of Sociology 100:1261-1293.

Cohen, A. (1983). Comparing regression coefficients across subsamples: A study of the statistical test. Sociological Methods and Research 12:77-94.

Paternoster, R., Brame, R., Mazerolle, P., & Piquero, A. (1998). Using the Correct Statistical Test for the Equality of Regression Coefficients. Criminology, 36, 4, 859-866

Miscellany

Comparing two samples – how do I show that they are equivalent? A salutary tale…

This is a common thing to want to do and tends to occur most often in the following two situations:

Case 1:
You have collected longitudinal data (e.g. before and after an intervention) and your response rate at the follow-up is a bit disappointing so you want to try and show that the ‘lost’ sample members are not systematically different from those who stayed with the study.  Essentially you are trying to avoid criticisms of non-random sampling bias.

Case 2:
You have run a between-subjects experiment (or quasi-experiment) and, for whatever reason, you were not able to match the samples (groups) in advance and now you want to show that the samples were indeed pretty well matched on key variables anyway.

By far the most commonly seen approach to these problems is to test for differences between the samples on key (usually demographic) variables. If you get non-significant results in these tests you conclude that the samples were essentially the same.

Chris says:
This is undoubtedly WRONG and is a practice that should not be encouraged at all even if you do see it in respectable journals.

The reason it is wrong is fairly simple to understand if you think about the logic of traditional null-hypothesis testing. If your test reaches the conclusion of “no significant difference”, that simply means that the evidence you have is not strong enough to allow you to claim that the two samples were different. This is not the same thing as saying that the two samples were the ‘same’. The null-hypothesis that you are testing with traditional tests is in some senses a hypothesis that you want to prove is ‘true’ in the present situations but, of course, we can only reject hypotheses, we never ‘prove’ them to be true.

The traditional p-value that comes out of something like a t-test is the probability of having observed a difference between your samples as large as you have done if the null-hypothesis were indeed true.  Conventionally if that probability is very small, less than a one in 20 chance (p<0.05), we conclude that the null-hypothesis is probably wrong, reject it and accept an alternative hypothesis.  Imagine a situation where the p-value resulting from your t-test was 0.0667.  This would be ‘non-significant’ in the traditional sense but it means that there’s only a one in fifteen chance of seeing a difference as big as you have observed if the two samples really were identical on the variable concerned – would you really want to claim that the samples were ‘the same’ in this case? This is all made much worse if you have relatively small sample sizes and thus low power to correctly reject a false traditional null-hypothesis in the first place – even quite large differences between your samples won’t be flagged up by this process.

The problem here is that we are testing the wrong null-hypothesis and we need to be clearer about what ‘the same’ really means.  How different would the two samples have to be before we concluded that the bias would undermine our study’s conclusions?  Unfortunately we psychologists are not very good at being clear about what differences on our variables mean substantively and really we ought to focus on this much more clearly.

So, what should I do then?  By far the simplest approach is to produce a table of the two samples’ scores on the key variables along with their respective 95% confidence intervals and leave it to the reader of your report to decide whether your claim that the samples are essentially equivalent is justified.  Stem and leaf plots, assuming the journal will give you the space to print them, are another alternative. The approach here is to be open about the differences in your samples and leave it to the judgement of your peers as to whether any differences between your samples is substantively (not statistically) significant or important.  If they are substantively significant you should not be trying to hide this from the scientific community.

In the experimental case 2 above an alternative strategy is to statistically control for a background variable that the groups might differ on.  Say your study was an experiment about the effectiveness of two teaching methods on children’s reading scores and you were worried that differences in IQ between the two groups might account for any differences you found.  Here you could control for the possible influence of IQ on reading scores by conducting an analysis of covariance (ANCOVA) which would remove the influence of IQ differentials from your test of the teaching methods – you would not then have to show that the two groups had equal IQs before you started.

More specialised tests are available based on what are called ‘tests of biological equivalence’. They were not originally designed to address the problem cases I listed here but arose in the context of wanting to show that two biomedical treatments were effectively equally good – an important thing to want to do if one ‘treatment’ is considerably cheaper than the other.  The logic of these tests, though, is equally applicable in our cases - see the Wellek (2003) reference below for more on this.

Useful references:

Wellek, S. (2003). Testing Statistical Hypotheses of Equivalence. Chapman and Hall/CRC Press, Boca Raton, FL.

How do I test the normality of a variable’s distribution?

We are all told that before deciding on whether to use parametric or non-parametric tests we should check to see whether our variables are normally distributed (i.e. follow a Gaussian distribution).  I won’t go over why this important here as it is covered in all good stats text books but how you make this decision is an area of a great deal of ambiguity unfortunately.

You thought this was going to be quick and simple but…..

Chris says:
My attempts to find a clear unambiguous way to determine the normality of a variable has thus far failed – indeed I do not think clear practical and unambiguous criteria for making this decision exist at the moment.  I am coming to the view that the reason for this is to do with a series of tensions between statistical purity, somewhat imprecise measurement practices and real world research pressures.

There are three broad issues that plague this area.

The first is both philosophical and of the sort that keeps sociologists of science in business. Parametric procedures require that you accept the assumption that in the population the variable concerned is normally distributed.  As our data are a sample from this population it is unlikely that our data will be exactly normally distributed so we move into an area where we ask whether the distribution of our sample data deviates so much from the normal distribution that we question this assumption.  Should we do this on the basis of our single sample though? If the variable concerned has appeared to be normally distributed in other people’s samples then why should we reject this assumption just because it doesn’t look normally distributed in our data?

This has led to some ‘interesting’ thinking where common practice is to assume that if 10 published papers in the area have all reported parametric tests on the key variable then we should do the same in order not to break with tradition. Who are we to challenge the current orthodoxy? The problem, of course, is that many (most?) psychology papers do not report the results of normality tests, one just sees the ANOVA results, say, and one has to take it on faith that the data met the assumptions for this procedure.

Fortunately or unfortunately depending on your viewpoint many readily available parametric procedures offer the potential to answer more interesting questions than can be addressed with the limited range of non-parametric alternatives so there is a subtle pressure to assume your variables are normally distributed so that you can engage with the interesting research questions and take part in the published literature.  Alternative procedures tend not to be implemented on the readily available stats packages (or indeed very well understood) so many people are prepared to ignore problems with data assumptions just to get their work done.  In the absence of widespread open reporting normality testing (or even skew and kurtosis figures) we may well be unwittingly building a body of research on very shaky ground.  Micceri (1989) suggests that real normally distributed psychological data are actually rather rare.

The second issue concerns the nature and quality of many psychological measures.  I won’t restate the orthodoxy on levels of measurement but many of the more interesting variables psychologists work with are fundamentally latent constructs with unknown units of measurement (e.g. depression, satisfaction, attitude, extroversion etc.).  As a result we tend to measure these indirectly often using psychometric scales and the like.  These have scales of measurement where the interval between scale points cannot be claimed to be equal and thus these measures are strictly speaking ordinal.  The mean and variance of an ordinal measure do not have the same meaning as they do for interval or ratio-scale measures and so it is NOT strictly possible to determine whether such measures are normally distributed.

HOWEVER, many decades worth of psychological research has ignored this for at least two reasons: 1) as before, not being able to do parametric tests limits the kinds of questions that can easily be answered and many researchers would argue that by using them we have revealed some useful truths that have stood the test of time – violating these assumptions has been worth it in some sense.  2) There is an argument that many ordinal measures actually contain more information than merely order and that some of the better measures lie in a region somewhere between ordinal and interval level measurement (see Minium et al, 1993).

Take a simple example of a seven-point response scale for an attitude item. At one level this allows you to rank order people relative to their agreement with the attitude statement. It is also likely that a two-point difference in scores for two individuals reflects more of a difference than if they had only differed by one point. The possibility that you might be able to rank order the magnitude of differences, while not implying interval level measurement, suggests that the measure contains more than merely information on how to rank order respondents. The argument then runs that it would be rash to throw away this additional useful information and unnecessarily limit the possibility of revealing greater theoretical insights via the use of more elaborate statistical procedures.

Needless to say there are dissenters (e.g. Stine, 1989) who argue for a more traditional and strict view says that using sophisticated techniques designed for one level of measurement on data of a less sophisticated level simply results in nonsense. Computer outputs will provide you with sensible-looking figures but these will still be nonsense and should not be used to draw inferences about anything. This line of argument rejects the claim that using parametric tests with ordinal data will lead to the same conclusion most of the time on the grounds that you will not know when you have stumbled across an exception to this ‘rule’.

The third issue concerns the lack of consensus on how best to decide whether a variable is normally distributed or not.  There are a number of options available including:

1) The Mk1 Eyeball Test – look at histogram for your variable and superimpose a normal curve over the top.  SPSS and most stats packages will do this for you and some will produce P-P plots as well (where all the data points should lie on the diagonal of the diagram if the variable is normally distributed).  The advantage is that this is easy to do but the obvious disadvantage is that there are no agreed criteria for determining how far your data can deviate from normality for you to be unsafe in proceeding with parametric tests.

2a) Skew and Kurtosis.  SPSS will readily give you these figures for any variable along with their standard errors.  Both skew and kurtosis should be zero for a perfectly normally distributed variable (strictly speaking the ideal kurtosis value is 3 but most packages including SPSS subtract 3 from the value so that both skew and kurtosis ideal values are zero).  You will see some texts suggesting that you divide the skew and kurtosis values by their standard errors to get z-scores and they suggest certain rules of thumb for what counts as a significant deviation from normality.  Unfortunately as the standard errors are directly related to sample size in big samples most variables will fail these tests even though the variables may not differ from normality by enough to make any real difference.  Conversely in small samples we are likely to conclude that they are normally distributed even if there are quite substantial deviations from normality.

2b) A related rule of thumb approach is to set boundaries for the skew and kurtosis values.  Some researchers are happy to accept variables with skew and kurtosis values in the range +2 to -2 as near enough normally distributed for most purposes (note the vague ‘most purposes’ which is common in this field and rarely clearly defined).  Both skew and kurtosis have to be in this range – if either one is outside it then the variable fails the normality test. Others are slightly more conservative and use a +1 to -1 range.

3) Formal tests of normality.  A range of these exist with the best known being the Kolmogrov-Smirnov test. Others include the K-S Lilliefors Test, the Shapiro-Wilk test, the Jarque-Bera LM test and the D’Agostino-Pearson K2 omnibus test.  While each has their following they suffer (to varying degrees) from problems like those for the z-scores noted above – as the sample size increases so does the likelihood of rejecting a variable that deviates only slightly from normality.  Of these tests the D’Agostino-Pearson omnibus test is claimed to be the most powerful but unfortunately it is not yet available readily in SPSS (but see links below).

These tend to be quite conservative and can reject as non-normal variables that have ‘passed’ the eye-ball and rule of thumb tests above.  These formal tests indicate significant deviation from normality but do not indicate whether your variable is so non-normal that it will so violate the assumptions of your parametric test that it is not going to give you the right answer (see the section Variable Robustness of Tests below).

4) A great range of alternative indices exist to assess degree of asymmetry, ‘weight’ in the tails of distributions and degree of multi-modality (Micceri, 1989 discusses many of these).  Few are readily available in standard packages and a good degree of vagueness exists about what the appropriate cut-off values are for these so I won’t expand on them here.

Variable Robustness of Tests

A related issue rarely given much of an airing in introductory stats texts concerns the variable ‘robustness’ of the various parametric tests out there.  You might be forgiven for thinking that your variables are either normal enough to do parametric tests or they are not but really we ought to be asking ‘normal enough for what purpose?’.  Some tests like the humble t-test or ANOVA are said to be ‘robust’ to violations of the normality assumption meaning that your sample data might deviate quite a bit from normality but the test will still lead you to the right conclusion about your null hypothesis.  Other parametric procedures, for example, t-tests with unequal group sizes, are less robust and comparatively small deviations from normality can have important consequences for drawing appropriate conclusions from the data.  Excessive kurtosis tends to effect procedures based on variances and covariances while excessive skew biases tests on means. How your data deviate from normality matters (see DeCarlo, 1997).

You won’t be surprised that like the normality tests themselves no clear consensus has emerged on how much deviation from normality is a problem for which tests.  This is not to say that the impact of varying degrees of non-normality cannot be assessed.  In fact it is relatively easy to do this using Monte Carlo simulation studies and many articles do exist summarising these effects (see West, Finch and Curran, 1995; Micceri, 1989) however studies on real data are still relatively rare and there is a debate about whether simulated data really mimic real data.

OK, but what do I do then?

Ultimately the best way forward is to strive for clarity and openness so:

1) Always report skew and kurtosis values (with their SEs) when describing your data – if the journal asks you to take them out to save space later then that is fine – it is their decision and the reviewers will have had a chance to see the figures when making up their minds about your analyses.

2) Always look at the histogram with normal curve superimposed over it.  You can get good skew and kurtosis figures yet still have horrible looking distributions especially in small samples (n<50) – strive to understand why your data are the way they are.  Remember that true normality is relatively rare in psychology (Micceri, 1989).

3) Pick a normality test or criterion.

Chris says:
Which one you use rather depends upon your sample size(s).   If your sample is small (n < 100 in a correlational study or n < 50 in each group if comparing means) then calculate z-scores for skew and kurtosis and reject as non-normal those variables with either z-score greater than an absolute value of 1.96.  If your sample is of medium size (100 < n < 300 for correlational studies or 50 < n < 150 in each group if comparing means) then calculate z-scores for skew and kurtosis and reject as non-normal those variables with either z-score greater than an absolute value of 3.29.  With sample sizes greater than 300 (or 150 per group) be guided by the histogram and the absolute sizes of the skew and kurtosis values.  Absolute values above 2 are likely to indicate substantial non-normality.

If you are keen, run a Shapiro-Wilk test in SPSS (which should be non-significant if your variables are normally distributed) but do not rely on this if your sample is large (n>300).

Getting z-scores is simple:

4) Be clear in your text which normality test or criteria you have used. This can be in a footnote if you like. As with (1) above if the journal or your supervisor asks you to take this out later then that’s OK.

5) If using ordinal scaled measures know that you are violating the assumptions of parametric procedures and know that you are doing this in order to engage with the literature on its own, possibly misguided, terms. Where possible conduct both the parametric test and the more appropriate non-parametric equivalent and see if you arrive at the same conclusion about the null hypothesis. If the conclusions are different err on the side of caution and use the more appropriate non-parametric technique..

Outliers

Although it may seem obvious, whatever way you chose to decide on the normality of your variables you need to do this after having screened for outliers and removed them (if that is appropriate), not before!  If you have outliers that should not really be in your data (typos, invalid responses) these can inflate both skew and kurtosis values.

Remember to acknowledge David if you use this facility.

Slightly more involved is Lawrence DeCarlo’s SPSS macro which can be found at: http://www.columbia.edu/~ld208/. This will calculate the D’Agostino-Pearson Test but as it also does a lot of other screening some of which may cause errors depending on your data so don’t be phased by the error messages.

De Carlo, L.T. (1997). On the meaning and use of kurtosis. Psychological methods, 2, 292-307.

Micceri, T. (1989). The unicorn, the normal curve, and other improbably creatures. Psychological Bulletin, 105, 156-166.

Minium, E.W., King, B.M. and Bear, G. (1993) Statistical Reasoning in Psychology and Education, New York: John Wiley and Sons.

Stine, W.W. (1989). Meaningful inference: The role of measurement in statistics. Psychological Bulletin, 105, 147-155.

West, S.G., Finch, J.F. and Curran, P.J. (1995) Structural equation models with non-normal variables: Problems and Remedies. In R.H. Hoyle (ed.) Structural Equation Modeling: Concepts, Issues and Applications. Sage Publications; Thousand Oaks CA.

Chris’s Calculator utility

Thanks

I am grateful to all those academics around the globe who have made their own thoughts on these issues freely available. This site wouldn’t work without you.

Legal backside covering bit:
The pages and links are provided on a "as is" basis, and the user must accept responsibility for the results produced. The author, the Department, and the University accept no liability or responsibility for any damage, injury, loss of money, time or reputation, or anything else anyone can think of, caused by or related to the use of these programs, downloading these pages, or in the course of being connected to our site.

Page Owner: pss1ab
Page Created: Tuesday 3 November 2009 11:39:55 by pss1ab