Statistics FAQ
Compiled by Chris Fife-Schaw
This set of web pages is intended as a resource for psychology students and staff at the University of Surrey though it is available to all. It deals with a range of questions that we get asked a lot but are not routinely covered in basic statistics books.
The idea is to give practical advice, some stats utilities and to point people in the direction of authoritative sources. It is not intended to give extensive statistical arguments in support of the advice given – that will be available via references and web-links.
It is an evolving resource and I’ll add new features and update the entries as received wisdom on these matters changes over time. While every effort will be made to ensure that the answers to these FAQs are correct it is inevitable that there will be occasions where the advice given is the subject of debate or is possibly just plain wrong. Please let me know if you find such things and I will change the entry accordingly (c.fife-Schaw@surrey.ac.uk). Similarly, where there are web links I cannot guarantee their accuracy and continued availability – if you notice any problems again e-mail me at c.fife-Schaw@surrey.ac.uk and I will try to fix the problem.
Where I am suggesting solutions that are based on my own views or preferences this will be indicated by being indented and prefaced by
‘Chris says:’ – treat such advice with considerable caution!
On Correlations
- How do I disattenuate a correlation (‘correct’ for measurement (un)reliability)?
When we correlate two variables together the value of the correlation is depressed if one or both of the variables contains some random measurement error (i.e. is not perfectly reliable). As random error by definition cannot be correlated with anything the more random error there is in your variables the lower the maximum possible correlation between them can be (remember that the square of a Pearson’s correlation coefficient is proportion of variance that the two variables share). This is particularly the case with those kinds of measures popular in psychology that involve composite scaled measures made up of responses to a number of questionnaire items – these rarely have less than 5% error variance and 15%+ is probably the norm. Since we are usually interested in the true correlation between the conceptual variables rather than the correlation between the observed measures it would be handy to be able to estimate what the correlation would be after taking measurement (un)reliability into account.
The following formula does this:

where r*12 is the disattenuated correlation, r12 the observed correlation and r11 and r22 the reliabilities of the two variables
Since trying to gauge the ‘true’, disattenuated correlation is usually important for readers wanting to assess the magnitude of any relationship found; you should always report the reliabilities of your measures on your data even if you are not going to report the disattenuated values yourself.
It is similarly possible to correct a partial correlation and/or adjust a regression weight to reflect unreliability in measurement (Osborne, 2003). Here the effect of measurement error can be to artificially increase the apparent importance of a relationship. The formula for correcting a partial correlation coefficient is:
Where r*12.3 is the corrected partial coefficient, r12, r13 and r23 the zero order observed correlations and r11, r22 and r33 the variables’ respective reliabilities. See Osborne (2003) or Aiken and West (1991) for discussions of the impact of measurement errors on regression coefficients
Chris’s Calculator [download a copy] has a utility for doing these calculations.
Assumptions:
All the usual ones for doing Pearson product moment correlations plus:1.the errors in each measure are assumed to be random
2.your estimates of the reliabilities are unbiaseUseful web links and references:
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks: Sage.
Fan, X. (2003). Two approaches for correcting correlation attenuation caused by measurement error: Implications for research practice. Educational and Psychological Measurement,. 63, 915-930. Download a copy from: http://epm.sagepub.com/cgi/content/abstract/63/6/915
Osborne, J. W. (2003). Effect sizes and the disattenuation of correlation and regression coefficients: lessons from educational psychology. Practical Assessment, Research & Evaluation, 8(11). – web paper at http://pareonline.net/getvn.asp?v=8&n=11
Chris says:
Correcting for attenuation often makes your results look better and is seen by some as appealing for this reason alone. The corrected correlation increases more over the observed one as the reliabilities of the measures decrease but this is not a good reason to be happy to use poor quality, low reliability measures knowing that the correlations you observe will be increased more substantially when you correct them later. First, the true correlation between the variables is not changed by the quality of the tool used to measure them. Secondly, experience shows that as the quality of measures decreases so does the likelihood that the source of the errors is random. This violates a key assumption of the procedure and renders the result ambiguous.Disattenuating a correlation is not a way of turning a not-quite-significant observed correlation into a significant one either. Most texts on this do not go into the significance of the disattenuated correlation preferring instead to emphasise its estimated size since this is the important bit of information being conveyed (note that the formulae contain no information on sample sizes that would be necessary in order to draw up confidence limits). Don’t try to use disattenuation to argue that your findings are now ‘significant’ when they weren’t pre-correction – this will be seen through. Always indicate when you have disattenuated a correlation too.
This method for correcting for attenuation is the best known and is easy to use. In more complex modelling situations it is probably easier to adopt a SEM approach to assessing relationships between variables with measurement errors ‘removed’ than to try to apply this formula on many relationships simultaneously. Fan (2003) shows that the SEM approach (at least in the CFA context) produces equivalent results to the application of the formula above.
- Case 1: Testing the difference between correlations from different samples
The most widely used procedure starts by first converting the two correlations via Fisher’s r-z transformation using:

You calculate a Z for each correlation which I’ll call Zr1 and Zr2. Pearson’s correlation coefficients cannot readily be added or subtracted from each other but these transformed ones can be. Confusingly the procedure uses ‘z’ in two different ways as we’ll see. Next up you need to calculate the standard error for the differences between the two correlations which is given by:

where n1 and n2 are the sample sizes for the first and second correlations respectively. Next divide the difference between the two z-scores by the standard error as follows:

This time ‘z’ is the ordinary normal deviate z. You can look up the probability of having observed a ‘z’ as big as this (either positive or negative) if the two correlations did not differ from each other in the population in normal ‘z’ tables in the back of a statistics book. As a guide, if the z value is 1.96 or higher (ignoring the sign), the difference in the correlations is significant at the .05 level (2-tailed). Use a 2.58 cut-off for significance at the .01 level (2-tailed).
Assumptions
1.The two correlations are from independent samples/groups of subjects.
2.The scores in each group are sampled randomly and independently
3.The distribution of the two variables involved in each correlation is bivariate normal.Chris’s Calculator [download a copy] has a utility for doing these calculations.
- Case 2: Testing the difference between dependent correlations –r12 vs r13
In this case you want to test the null hypothesis that two correlations are equal when they share one variable in common. James Steiger (1980) recommends Williams’ formula (T2) as the best general purpose test for this (there are lots out there!):

This has a t-distribution with df = N – 3 and can be looked up in tables in the normal way.
- Case 3: Testing the difference between dependent correlations – r12 vs r34
In this situation you want to test the null hypothesis that two separate correlations from the same population are the same. As an example you might have looked at the correlation between IQ and reading ability at age seven and want to ask whether the strength of this correlation has changed by the time the same individuals have reached eleven years old. In this situation you need to use James Steiger’s (1980) Zbar*2

Where z12 and z34 are the Fisher’s R-Z transformed correlations that you want to compare and

Chris’s Calculator [download a copy] has a utility for doing these calculations.
How do I work out if two correlation coefficients are significantly different?
The key here is whether the correlations concerned are independent – i.e. from different samples – or from the same sample. This page deals with three prototypical cases:
Case 1: Testing the difference between correlations from difference samples
Case 2: Testing the difference between two dependent correlations – i.e. the correlation between X and Z and Y and Z
Case 3: Testing the difference between two correlations from the same sample – i.e. the correlation between A and B and C and D
Select the case that matches your problem for more advice.
Useful web links and references:
http://home.clara.net/sisa/corrhlp.htm - this contains an on-line calculator for doing the calculations for Cases 1 and 2.
Steiger, J.H. (1980) Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245–251. (download from http://www.statpower.net/publications_and_papers.htm)
Hittner, J.J., May, K. & Silver, N.C. (2003) A Monte Carlo evaluation of tests for comparing dependent correlations. Journal of General Psychology 130, 149-168.
(download from http://www.findarticles.com/p/articles/mi_m2405/is_2_130/ai_101939294)
Chris says:
Not a lot. James Steiger said a lot about this 25 years ago and while there have been more recent useful Monte Carlo studies (e.g. Hittner et al, 2003) the broad conclusions remain the same. Let me know if you know different.
How do I work out if two correlation coefficients are significantly different?
The key here is whether the correlations concerned are independent – i.e. from different samples – or from the same sample. This page deals with three prototypical cases:
Case 1: Testing the difference between correlations from difference samples
Case 2: Testing the difference between two dependent correlations – i.e. the correlation between X and Z and Y and Z
Case 3: Testing the difference between two correlations from the same sample – i.e. the correlation between A and B and C and D
Select the case that matches your problem for more advice.
Useful web links and references:
http://home.clara.net/sisa/corrhlp.htm - this contains an on-line calculator for doing the calculations for Cases 1 and 2.
Steiger, J.H. (1980) Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245–251. (download from http://www.statpower.net/publications_and_papers.htm)
Hittner, J.J., May, K. & Silver, N.C. (2003) A Monte Carlo evaluation of tests for comparing dependent correlations. Journal of General Psychology 130, 149-168.
(download from http://www.findarticles.com/p/articles/mi_m2405/is_2_130/ai_101939294)
Chris says:
Not a lot. James Steiger said a lot about this 25 years ago and while there have been more recent useful Monte Carlo studies (e.g. Hittner et al, 2003) the broad conclusions remain the same. Let me know if you know different.
- Case 1: Testing the difference between correlations from different samples
The most widely used procedure starts by first converting the two correlations via Fisher’s r-z transformation using:

You calculate a Z for each correlation which I’ll call Zr1 and Zr2. Pearson’s correlation coefficients cannot readily be added or subtracted from each other but these transformed ones can be. Confusingly the procedure uses ‘z’ in two different ways as we’ll see. Next up you need to calculate the standard error for the differences between the two correlations which is given by:

where n1 and n2 are the sample sizes for the first and second correlations respectively. Next divide the difference between the two z-scores by the standard error as follows:

This time ‘z’ is the ordinary normal deviate z. You can look up the probability of having observed a ‘z’ as big as this (either positive or negative) if the two correlations did not differ from each other in the population in normal ‘z’ tables in the back of a statistics book. As a guide, if the z value is 1.96 or higher (ignoring the sign), the difference in the correlations is significant at the .05 level (2-tailed). Use a 2.58 cut-off for significance at the .01 level (2-tailed).
Assumptions
1.The two correlations are from independent samples/groups of subjects.
2.The scores in each group are sampled randomly and independently
3.The distribution of the two variables involved in each correlation is bivariate normal.Chris’s Calculator [download a copy] has a utility for doing these calculations.
- Case 2: Testing the difference between dependent correlations –r12 vs r13
In this case you want to test the null hypothesis that two correlations are equal when they share one variable in common. James Steiger (1980) recommends Williams’ formula (T2) as the best general purpose test for this (there are lots out there!):

This has a t-distribution with df = N – 3 and can be looked up in tables in the normal way.
- Case 3: Testing the difference between dependent correlations – r12 vs r34
In this situation you want to test the null hypothesis that two separate correlations from the same population are the same. As an example you might have looked at the correlation between IQ and reading ability at age seven and want to ask whether the strength of this correlation has changed by the time the same individuals have reached eleven years old. In this situation you need to use James Steiger’s (1980) Zbar*2

Where z12 and z34 are the Fisher’s R-Z transformed correlations that you want to compare and

Chris’s Calculator [download a copy] has a utility for doing these calculations.
On Regression
- How can I tell whether a variable mediates the relationship between two other variables?
The classic text on this is Baron and Kenny’s (1986) paper and you ought to read this and MacKinnon et al. (2002) too before launching into mediation analyses. Mediation testing is an area where there is some agreement about the broad principles involved but some argument about the details (see later in this section).
I will deal with the Sobel and Aroian Tests here. First a (very) little background based on Baron and Kenny’s (1986) ideas. In the diagram below X is an IV, Y a DV and M is a potential mediator variable. For M to be a mediator of the relationship between X and Y several conditions have to be met.
- X must significantly predict Y. If it does not then there is no relationship to mediate (note that X can still have a direct effect on M and M can effect Y but this is an indirect effect, not mediation of a relationship between X and Y).
- X must significantly predict M.
- M must significantly predict Y after controlling for X. If, by adding M to the prediction of Y from X in a regression model, the effect of X falls close to zero (c ~ 0) then you have full or complete mediation. If the effect of introducing M is to reduce c by a non-trivial amount but not to zero, you have partial mediation. If c is not reduced there is no mediation.
Traditionally people check that the above conditions have been met and leave it at that but a stronger test of mediation is that the indirect effect of X on Y via M is significantly different from zero which his what the Sobel and Aroian tests do.

a, b, and c are path coefficients. Variables in brackets are the standard errors of these coefficients.
The Sobel and Aroian Tests look to see whether the indirect effect of X on Y (i.e. via M) is significantly different from zero. To do this, run a regression analysis with the IV (X) predicting the mediator (M). This will give a and sa. Then run a regression analysis with the IV (X) and mediator (M) both predicting the DV. This will give b and sb. Both a and b should be the unstandardised regression coefficients taken from the column headed ‘B’ in the SPSS regression output tables. When you have the figures plug them into the following equations:

The resulting z-value can be looked up in the normal way in statistical tables. Assuming a reasonable sized sample, values greater than +/- 1.96 will be significant at the 0.05 level (2-tailed). The Aroian test is recommended over the Sobel test as the sa2*sb2 term is assumed to be so small as to be insignificant in the Sobel test though this may not be true in practice.
Chris’s Calculator [download a copy] has a utility for doing these calculations.
Assumptions:
All the usual ones for doing multiple regression plus- There should be no measurement error in the mediator
- The DV should not cause the mediator
- You have a reasonably large sample (at least 200 - any less and the power of the test is likely to be very low esp. if you have less-than-perfect measures of your variables)
Chris says:
It is tempting to try to demonstrate that several variables act as mediators either in the form of a chain of mediating relationships e.g. A → B → C → D or simultaneously as in A → (BC) → D. Trying to sort this out via multiple Sobel/Aroian testing is likely to be very confusing indeed. If you have such models it might be easier to test these within an SEM framework if your data meet the assumptions for that approach. The advantage of SEM is that problems with measurement errors and correlated measurement errors (another source of problems in mediation analysis) can be modelled directly and in some sense be accounted for. Preacher and Hayes (2008) also argue that the approaches I have outlined above are somewhat unreliable in small samples and that it would be better to use bootstrapping approaches especially as these do not rely heavily on assumptions of normality. Kristopher Preacher’s web pages (http://quantpsy.org/medn.htm) contain a number of macros for doing this in both SPSS and in SEM packages.Useful web links and references:
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173-1182.
Frazier, P.A., Tix, A.P. & Barron, K.E. (2004). Testing moderator and mediator effects in counseling psychology research. Journal of Counseling Psychology, 51, 115-134. (available from http://psyphz.psych.wisc.edu/~shackman/frazier_barron_mediation_jcc.pdf)
MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7, 83-104.
Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior Research Methods, Instruments, & Computers, 36(4), 717-731.
Preacher, K., & Hayes, A. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 879-891.
http://quantpsy.org/medn.htm - this leads to Kristopher Preacher’s pages on mediation testing and an on-line calculator for doing the analyses. There are also some useful SPSS macros there for bootstrapping alternatives to the Sobel test which don’t rely on making so many assumptions about your data. In fact these are to be preferred to the Sobel and Aroian Tests and should be used if you are familiar with SPSS syntax macros (which are worth learning about anyway).
http://www.public.asu.edu/~davidpm/ripl/q&a.htm - goes to Dave MacKinnon’s FAQ page on mediation which deals with common questions asked about mediation.
- How do I test whether a variable moderates the relationship between two variables?
The classic paper on this topic is Baron and Kenny’s (1986) JPSP paper and what follows is based largely on this article but if you are really interested in this Frazier et al (2004) has more about analytic strategies available and includes a discussion of SEM approaches to moderation. First some basic stuff:
To quote directly from Baron and Kenny (1986, p1174) ‘…a moderator is a qualitative (e.g. sex, race, class) or quantitative (e.g. level of reward) variable that effects the direction and/or strength of the relation between an independent or predictor variable and a dependent or criterion variable.’

Conceptually when running a test for moderation you test main effects of the predictor and moderator on the outcome variable AND the interaction between predictor and the moderator. If the interaction term is significant (i.e. c is significantly different from zero) then, with a few cautionary caveats, you can claim to have demonstrated moderation. Whether the main effects (a and b) are significant or not isn’t strictly relevant to whether you have demonstrated moderation.
From analytic viewpoint there are 4 prototypical cases of moderation that depend on the level of measurement you have.
Case 1: Both the predictor/IV and the moderator are dichotomous.
Here a standard 2 X 2 ANOVA is appropriate and the key test for moderation is that the interaction term is significant. The power of test of the interaction term is greatly influenced by the sample size and the proportion of the sample in each level of the variables. As the dichotomies deviate from 50/50 the power goes down quite dramatically (see Frazier et al, 2004).
Case 2: The moderator is dichotomous but the predictor and outcome are both continuous.
The simplest analytical approach is to regress the outcome onto the predictor separately for each level of the moderator variable. So, if your moderator variable is ‘sex’ you regress your outcome onto the predictor separately for males and females. The unstandardised regression coefficients are then compared to see if they are different. It is the difference between them (i.e. the interaction) which is important for moderation rather than whether the regression coefficients are themselves significant.
Chris’s Calculator [download a copy] has a utility for doing these calculations.
See also the page on comparing regression coefficient.This approach makes the assumption that measurement errors in the predictor are the same in both groups. If not then the results will be biased and it would be better to adopt a multigroup structural equation modelling (SEM) approach to this problem.
Case 3: The moderator and outcome are continuous but the predictor is a dichotomy.
In this case things get a little more complicated as the moderator has many levels and the problem is complicated by the need to know how the moderator influences the relationship between the IV and DV in order to tell whether it influences the relationship.
Taking Baron and Kenny’s (1986) example we might have two health messages, a fear-arousing one and a rational one – with message type being our dichotomous IV. We are interested in how much attitude change this produces (the DV) and we think the relationship might be moderated by IQ with, perhaps, the high-IQ participants more influenced by the rational message and the low-IQ folk more persuaded by the fear-arousing message. We need to know how IQ moderates this relationship. Does the relationship between message type and persuasion change gradually as IQ increases or is there some critical threshold of IQ at which point there is a sudden, step-like change in the relationship and fear-arousing messages cease to be effective at all and rational ones suddenly become highly effective? If the relationship is a gradual one is it a linear relationship or a curvilinear one, perhaps of a quadratic sort, where the strength of the IV-DV relationship changes much more quickly at high levels of IQ?
If the suspected relationship is a linear one then the strategy is to regress the DV (Y below) on the IV (X below), the Moderator (M below) and the interaction of the moderator and the IV so the model would be:

For there to be evidence of moderation the b3 term must be significantly different from zero. Practically when creating the X*M term you first need to ‘centre’ the variables to reduce the likelihood of multicollinearity problems. To centre a variable you simply compute a new variable in SPSS which is the original variable minus its mean so that the mean of this new variable is now zero. You then compute a new variable which is your centred X multiplied by your centred M. Run the regression using the centred variables rather than the originals.
If the suspected relationship is a quadratic one then the strategy is to regress the DV (Y below) on the IV (X below), the Moderator (M below), the interaction of the moderator and the IV, the square of the moderator and the interaction of the squared moderator and the IV so (using centred variables as above) the model would be:

If the anticipated relationship is a step-like one then the problem becomes slightly more difficult but conceptually like Case 1. You dichotomise the moderator at the point at which the change occurs (making the variable just have ‘high’ and ‘low’ values) and proceed using an ANOVA approach. The difficulty lies in finding out where on the moderator’s scale the effect occurs. Unfortunately (or not, depending on your position on these things) the ANOVA won’t detect this kind of moderation if the moderator is dichotomised at the wrong point on its scale – you need to do some data exploration first.
Case 4: All three variables are continuous.
This is very much like Case 3’s linear and quadratic example and the same approaches can be used dependent on the kind of moderating relationship you expect.
If you also want to ‘control’ for the effects of other covariates in your analysis these need to be included in the regression equation and should also be centred even if they are not involved in the computation of any interaction terms (this applies to Cases 2 and 3 as well. In Case 1 covariates can be considered by conducting ANCOVA in place of ANOVA).
Simple slopes analysis:
In the above four cases the presence of statistically significant moderation is indicated by the tests of interaction terms (or difference in regression coefficients in Case 2) being significant. This is all well and good but it is usual to want to go a step further and work out what effect the moderator has on the nature of relationship between the IV and DV. To do this it is usual to plot out graphs of the relationship between Y and X at different values of M. It is helpful to plot these on the same graph with the relationship between Y and X plotted for the mean level of M and (usually) M at one standard deviation above its mean and one standard deviation below.
See the link to Jason Newsom’s web pages below and the SPSS macros found therein will do all this for you including the original regressions. The graph below is from the output - I haven’t re-labelled the X and Y variables which you would do were these your own data (though I did re-label the moderator as Jason uses ‘Z’ rather than ‘M’).

As you can see in this example when the moderator is high (1SD above its mean) the relationship between Y and X is strong and positive. When the moderator is low (1SD below its mean) there is virtually no relationship between Y and X.
Assumptions:
All the usual ones for doing multiple regression and ANOVA plus- Ideally the moderator should not be correlated with either the predictor or the outcome.
Useful web links and references:
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks: Sage.Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173-1182.
Frazier, P.A., Tix, A.P. & Barron, K.E. (2004). Testing moderator and mediator effects in counseling psychology research. Journal of Counseling Psychology, 51, 115-134. (available from http://psyphz.psych.wisc.edu/~shackman/frazier_barron_mediation_jcc.pdf)
http://www.psych.ku.edu/preacher/ - yet again Kristopher Preacher’s pages are very helpful and clear about moderation and they explain simple slopes analysis too.
http://www.upa.pdx.edu/IOA/newsom/macros.htm - Jason Newsom has written useful SPSS macros that deal with Cases 2 and 4 and produce simple slopes graphs for you - magic!
http://www.math.montana.edu/~rjboik/power.html - is an on-line power calculator provided by Robert J. Boik that will help you assess the likely sample size you will need to detect a moderation effect of a given size.
Chris says:
The procedures suggested above are all parametric ones that rely on your DV being measured at at least interval level and being normally distributed. Despite this most questionnaire based research reported in psychology journals will have been based on DVs that are probably ‘only’ ordinal in nature. It is tempting to get sucked into flashy technicalities associated with moderator analysis while losing sight of this fundamental violation of the techniques’ assumptions.Frazier et al. (2004) also note that the number of response options in the DV should be greater than the product of the number of response options in the moderator and IV – for e.g. if your IV and moderator were responded to on 5-point scales the DV should be responded to on 25-point scales. Simply having a composite DV made up of lots of 5-point items doesn’t get over this problem and they suggest allowing respondents to use a mark on a continuous line as an alternative. I’m not sure how important this is or how often it is met in practice but it is something to be aware of in case you are asked at your viva…..
Matters only get worse when attempting to test multiple moderating relationships at the same time. In some texts like Aiken and West (1991) you will see examples of higher order (e.g. 3-way and 4-way) interactions being tested. This is clearly technically possible to do but my experience suggests that detecting 2-way interactions, let alone higher order ones, is difficult enough in practice. The tests for interactions are relatively low powered so you need to be looking at strong and clear moderating relationships before these will be readily found to be ‘significant’. This should make you wary of attempting to find moderation in small data sets…
- How do I work out if two regression coefficients differ significantly?
Case 1: Regression coefficients come from separate samples.
It is not possible to propose a single formula to deal with this as there are many competing formulae and approaches being proposed in the literature. I have gone with suggested method of Raymond Paternoster and colleagues for the calculator. It is fairly conservative and follows on from more extensive work by Clogg et al. (1995).

The reason that finding a single test for this is difficult is that there are a number of different situations in which you might want to test the equality of two regression coefficients. A common application is to see whether a grouping variable such as gender moderates a relationship between two variables and this is dealt with under a separate heading in these FAQ pages (this is effectively testing the null hypothesis that the regression coefficients are the same in both groups).
The Z-test above assumes that the regressions are done on two separate groups and that the variables in the regression model in each group are the same. It is not appropriate if/when there are different sets of variables because the meaning of the regression coefficient is dependent on the other variables in the regression equation. If you recall the coefficient indicates the expected change in the outcome for a unit change in the predictor assuming the other variables in the model are held constant. If there are different sets of variables involved the effects of holding them constant may well be different.
For those keen on testing equivalence of regression coefficients more rigorously a slightly better way would be to take a multi-group structural equation modelling approach. Here you would initially test your regression model constraining all coefficients to be equivalent in both groups. If this equivalence model is not rejected then it may be reasonable to accept (or rather not reject) the hypothesis that the coefficients are the same in both groups. If allowing just the coefficients of interest to be different in the two groups, while keeping everything else is constrained to be identical, substantially improves the fit of the model then it would be reasonable to conclude that the coefficients differed. This all assumes you are comfortable with using SEM packages though.
.
Chris’s Calculator [download a copy] has a utility for doing this calculation.
Assumptions:
All the usual ones for conducting parametric tests and an assumption that the sample is fairly large. Cohen (1983) offers alternatives when samples are small. The Z test assumes that the regression models for both groups have the same predictor variables in themUseful web links and references:
Clogg, C.C., Petkova, E. and Haritou, A. (1995). Statistical methods for comparing regression coefficients between models. American Journal of Sociology 100:1261-1293.
Cohen, A. (1983). Comparing regression coefficients across subsamples: A study of the statistical test. Sociological Methods and Research 12:77-94.
Paternoster, R., Brame, R., Mazerolle, P., & Piquero, A. (1998). Using the Correct Statistical Test for the Equality of Regression Coefficients. Criminology, 36, 4, 859-866
Miscellany
- Comparing two samples – how do I show that they are equivalent? A salutary tale…
This is a common thing to want to do and tends to occur most often in the following two situations:
Case 1:
You have collected longitudinal data (e.g. before and after an intervention) and your response rate at the follow-up is a bit disappointing so you want to try and show that the ‘lost’ sample members are not systematically different from those who stayed with the study. Essentially you are trying to avoid criticisms of non-random sampling bias.Case 2:
You have run a between-subjects experiment (or quasi-experiment) and, for whatever reason, you were not able to match the samples (groups) in advance and now you want to show that the samples were indeed pretty well matched on key variables anyway.By far the most commonly seen approach to these problems is to test for differences between the samples on key (usually demographic) variables. If you get non-significant results in these tests you conclude that the samples were essentially the same.
Chris says:
This is undoubtedly WRONG and is a practice that should not be encouraged at all even if you do see it in respectable journals.The reason it is wrong is fairly simple to understand if you think about the logic of traditional null-hypothesis testing. If your test reaches the conclusion of “no significant difference”, that simply means that the evidence you have is not strong enough to allow you to claim that the two samples were different. This is not the same thing as saying that the two samples were the ‘same’. The null-hypothesis that you are testing with traditional tests is in some senses a hypothesis that you want to prove is ‘true’ in the present situations but, of course, we can only reject hypotheses, we never ‘prove’ them to be true.
The traditional p-value that comes out of something like a t-test is the probability of having observed a difference between your samples as large as you have done if the null-hypothesis were indeed true. Conventionally if that probability is very small, less than a one in 20 chance (p<0.05), we conclude that the null-hypothesis is probably wrong, reject it and accept an alternative hypothesis. Imagine a situation where the p-value resulting from your t-test was 0.0667. This would be ‘non-significant’ in the traditional sense but it means that there’s only a one in fifteen chance of seeing a difference as big as you have observed if the two samples really were identical on the variable concerned – would you really want to claim that the samples were ‘the same’ in this case? This is all made much worse if you have relatively small sample sizes and thus low power to correctly reject a false traditional null-hypothesis in the first place – even quite large differences between your samples won’t be flagged up by this process.
The problem here is that we are testing the wrong null-hypothesis and we need to be clearer about what ‘the same’ really means. How different would the two samples have to be before we concluded that the bias would undermine our study’s conclusions? Unfortunately we psychologists are not very good at being clear about what differences on our variables mean substantively and really we ought to focus on this much more clearly.
So, what should I do then? By far the simplest approach is to produce a table of the two samples’ scores on the key variables along with their respective 95% confidence intervals and leave it to the reader of your report to decide whether your claim that the samples are essentially equivalent is justified. Stem and leaf plots, assuming the journal will give you the space to print them, are another alternative. The approach here is to be open about the differences in your samples and leave it to the judgement of your peers as to whether any differences between your samples is substantively (not statistically) significant or important. If they are substantively significant you should not be trying to hide this from the scientific community.
In the experimental case 2 above an alternative strategy is to statistically control for a background variable that the groups might differ on. Say your study was an experiment about the effectiveness of two teaching methods on children’s reading scores and you were worried that differences in IQ between the two groups might account for any differences you found. Here you could control for the possible influence of IQ on reading scores by conducting an analysis of covariance (ANCOVA) which would remove the influence of IQ differentials from your test of the teaching methods – you would not then have to show that the two groups had equal IQs before you started.
More specialised tests are available based on what are called ‘tests of biological equivalence’. They were not originally designed to address the problem cases I listed here but arose in the context of wanting to show that two biomedical treatments were effectively equally good – an important thing to want to do if one ‘treatment’ is considerably cheaper than the other. The logic of these tests, though, is equally applicable in our cases - see the Wellek (2003) reference below for more on this.
Useful web links and references:
The logic of the above is explained in more detail at:http://www.graphpad.com/library/BiostatsSpecial/article_182.htm
Wellek, S. (2003). Testing Statistical Hypotheses of Equivalence. Chapman and Hall/CRC Press, Boca Raton, FL.
- How do I test the normality of a variable’s distribution?
We are all told that before deciding on whether to use parametric or non-parametric tests we should check to see whether our variables are normally distributed (i.e. follow a Gaussian distribution). I won’t go over why this important here as it is covered in all good stats text books but how you make this decision is an area of a great deal of ambiguity unfortunately.
You thought this was going to be quick and simple but…..
Chris says:
My attempts to find a clear unambiguous way to determine the normality of a variable has thus far failed – indeed I do not think clear practical and unambiguous criteria for making this decision exist at the moment. I am coming to the view that the reason for this is to do with a series of tensions between statistical purity, somewhat imprecise measurement practices and real world research pressures.There are three broad issues that plague this area.
The first is both philosophical and of the sort that keeps sociologists of science in business. Parametric procedures require that you accept the assumption that in the population the variable concerned is normally distributed. As our data are a sample from this population it is unlikely that our data will be exactly normally distributed so we move into an area where we ask whether the distribution of our sample data deviates so much from the normal distribution that we question this assumption. Should we do this on the basis of our single sample though? If the variable concerned has appeared to be normally distributed in other people’s samples then why should we reject this assumption just because it doesn’t look normally distributed in our data?
This has led to some ‘interesting’ thinking where common practice is to assume that if 10 published papers in the area have all reported parametric tests on the key variable then we should do the same in order not to break with tradition. Who are we to challenge the current orthodoxy? The problem, of course, is that many (most?) psychology papers do not report the results of normality tests, one just sees the ANOVA results, say, and one has to take it on faith that the data met the assumptions for this procedure.
Fortunately or unfortunately depending on your viewpoint many readily available parametric procedures offer the potential to answer more interesting questions than can be addressed with the limited range of non-parametric alternatives so there is a subtle pressure to assume your variables are normally distributed so that you can engage with the interesting research questions and take part in the published literature. Alternative procedures tend not to be implemented on the readily available stats packages (or indeed very well understood) so many people are prepared to ignore problems with data assumptions just to get their work done. In the absence of widespread open reporting normality testing (or even skew and kurtosis figures) we may well be unwittingly building a body of research on very shaky ground. Micceri (1989) suggests that real normally distributed psychological data are actually rather rare.
The second issue concerns the nature and quality of many psychological measures. I won’t restate the orthodoxy on levels of measurement but many of the more interesting variables psychologists work with are fundamentally latent constructs with unknown units of measurement (e.g. depression, satisfaction, attitude, extroversion etc.). As a result we tend to measure these indirectly often using psychometric scales and the like. These have scales of measurement where the interval between scale points cannot be claimed to be equal and thus these measures are strictly speaking ordinal. The mean and variance of an ordinal measure do not have the same meaning as they do for interval or ratio-scale measures and so it is NOT strictly possible to determine whether such measures are normally distributed.
HOWEVER, many decades worth of psychological research has ignored this for at least two reasons: 1) as before, not being able to do parametric tests limits the kinds of questions that can easily be answered and many researchers would argue that by using them we have revealed some useful truths that have stood the test of time – violating these assumptions has been worth it in some sense. 2) There is an argument that many ordinal measures actually contain more information than merely order and that some of the better measures lie in a region somewhere between ordinal and interval level measurement (see Minium et al, 1993).
Take a simple example of a seven-point response scale for an attitude item. At one level this allows you to rank order people relative to their agreement with the attitude statement. It is also likely that a two-point difference in scores for two individuals reflects more of a difference than if they had only differed by one point. The possibility that you might be able to rank order the magnitude of differences, while not implying interval level measurement, suggests that the measure contains more than merely information on how to rank order respondents. The argument then runs that it would be rash to throw away this additional useful information and unnecessarily limit the possibility of revealing greater theoretical insights via the use of more elaborate statistical procedures.
Needless to say there are dissenters (e.g. Stine, 1989) who argue for a more traditional and strict view says that using sophisticated techniques designed for one level of measurement on data of a less sophisticated level simply results in nonsense. Computer outputs will provide you with sensible-looking figures but these will still be nonsense and should not be used to draw inferences about anything. This line of argument rejects the claim that using parametric tests with ordinal data will lead to the same conclusion most of the time on the grounds that you will not know when you have stumbled across an exception to this ‘rule’.
The third issue concerns the lack of consensus on how best to decide whether a variable is normally distributed or not. There are a number of options available including:
1) The Mk1 Eyeball Test – look at histogram for your variable and superimpose a normal curve over the top. SPSS and most stats packages will do this for you and some will produce P-P plots as well (where all the data points should lie on the diagonal of the diagram if the variable is normally distributed). The advantage is that this is easy to do but the obvious disadvantage is that there are no agreed criteria for determining how far your data can deviate from normality for you to be unsafe in proceeding with parametric tests.
2a) Skew and Kurtosis. SPSS will readily give you these figures for any variable along with their standard errors. Both skew and kurtosis should be zero for a perfectly normally distributed variable (strictly speaking the ideal kurtosis value is 3 but most packages including SPSS subtract 3 from the value so that both skew and kurtosis ideal values are zero). You will see some texts suggesting that you divide the skew and kurtosis values by their standard errors to get z-scores and they suggest certain rules of thumb for what counts as a significant deviation from normality. Unfortunately as the standard errors are directly related to sample size in big samples most variables will fail these tests even though the variables may not differ from normality by enough to make any real difference. Conversely in small samples we are likely to conclude that they are normally distributed even if there are quite substantial deviations from normality.
2b) A related rule of thumb approach is to set boundaries for the skew and kurtosis values. Some researchers are happy to accept variables with skew and kurtosis values in the range +2 to -2 as near enough normally distributed for most purposes (note the vague ‘most purposes’ which is common in this field and rarely clearly defined). Both skew and kurtosis have to be in this range – if either one is outside it then the variable fails the normality test. Others are slightly more conservative and use a +1 to -1 range.
3) Formal tests of normality. A range of these exist with the best known being the Kolmogrov-Smirnov test. Others include the K-S Lilliefors Test, the Shapiro-Wilk test, the Jarque-Bera LM test and the D’Agostino-Pearson K2 omnibus test. While each has their following they suffer (to varying degrees) from problems like those for the z-scores noted above – as the sample size increases so does the likelihood of rejecting a variable that deviates only slightly from normality. Of these tests the D’Agostino-Pearson omnibus test is claimed to be the most powerful but unfortunately it is not yet available readily in SPSS (but see links below).
These tend to be quite conservative and can reject as non-normal variables that have ‘passed’ the eye-ball and rule of thumb tests above. These formal tests indicate significant deviation from normality but do not indicate whether your variable is so non-normal that it will so violate the assumptions of your parametric test that it is not going to give you the right answer (see the section Variable Robustness of Tests below).4) A great range of alternative indices exist to assess degree of asymmetry, ‘weight’ in the tails of distributions and degree of multi-modality (Micceri, 1989 discusses many of these). Few are readily available in standard packages and a good degree of vagueness exists about what the appropriate cut-off values are for these so I won’t expand on them here.
Variable Robustness of Tests
A related issue rarely given much of an airing in introductory stats texts concerns the variable ‘robustness’ of the various parametric tests out there. You might be forgiven for thinking that your variables are either normal enough to do parametric tests or they are not but really we ought to be asking ‘normal enough for what purpose?’. Some tests like the humble t-test or ANOVA are said to be ‘robust’ to violations of the normality assumption meaning that your sample data might deviate quite a bit from normality but the test will still lead you to the right conclusion about your null hypothesis. Other parametric procedures, for example, t-tests with unequal group sizes, are less robust and comparatively small deviations from normality can have important consequences for drawing appropriate conclusions from the data. Excessive kurtosis tends to effect procedures based on variances and covariances while excessive skew biases tests on means. How your data deviate from normality matters (see DeCarlo, 1997).
You won’t be surprised that like the normality tests themselves no clear consensus has emerged on how much deviation from normality is a problem for which tests. This is not to say that the impact of varying degrees of non-normality cannot be assessed. In fact it is relatively easy to do this using Monte Carlo simulation studies and many articles do exist summarising these effects (see West, Finch and Curran, 1995; Micceri, 1989) however studies on real data are still relatively rare and there is a debate about whether simulated data really mimic real data.
OK, but what do I do then?
Ultimately the best way forward is to strive for clarity and openness so:
1) Always report skew and kurtosis values (with their SEs) when describing your data – if the journal asks you to take them out to save space later then that is fine – it is their decision and the reviewers will have had a chance to see the figures when making up their minds about your analyses.
2) Always look at the histogram with normal curve superimposed over it. You can get good skew and kurtosis figures yet still have horrible looking distributions especially in small samples (n<50) – strive to understand why your data are the way they are. Remember that true normality is relatively rare in psychology (Micceri, 1989).
3) Pick a normality test or criterion.
Chris says:
Which one you use rather depends upon your sample size(s). If your sample is small (n < 100 in a correlational study or n < 50 in each group if comparing means) then calculate z-scores for skew and kurtosis and reject as non-normal those variables with either z-score greater than an absolute value of 1.96. If your sample is of medium size (100 < n < 300 for correlational studies or 50 < n < 150 in each group if comparing means) then calculate z-scores for skew and kurtosis and reject as non-normal those variables with either z-score greater than an absolute value of 3.29. With sample sizes greater than 300 (or 150 per group) be guided by the histogram and the absolute sizes of the skew and kurtosis values. Absolute values above 2 are likely to indicate substantial non-normality.If you are keen, run a Shapiro-Wilk test in SPSS (which should be non-significant if your variables are normally distributed) but do not rely on this if your sample is large (n>300).
Getting z-scores is simple:

4) Be clear in your text which normality test or criteria you have used. This can be in a footnote if you like. As with (1) above if the journal or your supervisor asks you to take this out later then that’s OK.
5) If using ordinal scaled measures know that you are violating the assumptions of parametric procedures and know that you are doing this in order to engage with the literature on its own, possibly misguided, terms. Where possible conduct both the parametric test and the more appropriate non-parametric equivalent and see if you arrive at the same conclusion about the null hypothesis. If the conclusions are different err on the side of caution and use the more appropriate non-parametric technique..
Outliers
Although it may seem obvious, whatever way you chose to decide on the normality of your variables you need to do this after having screened for outliers and removed them (if that is appropriate), not before! If you have outliers that should not really be in your data (typos, invalid responses) these can inflate both skew and kurtosis values.
Useful web links and references:
Graphpad.com have a page discussing normality testing which can be found at:
http://www.graphpad.com/library/BiostatsSpecial/article_197.htmDavid Moriarty’s web pages contain links to his StatCat.xls Excel suite of utilities which includes a calculator to run D’Agostino-Pearson Tests. Go to the link and download StatCat – it will run tests on variables with up to 200 cases:
http://www.csupomona.edu/~djmoriarty/www/b211home.htm
Remember to acknowledge David if you use this facility.Slightly more involved is Lawrence DeCarlo’s SPSS macro which can be found at: http://www.columbia.edu/~ld208/. This will calculate the D’Agostino-Pearson Test but as it also does a lot of other screening some of which may cause errors depending on your data so don’t be phased by the error messages.
De Carlo, L.T. (1997). On the meaning and use of kurtosis. Psychological methods, 2, 292-307.
Micceri, T. (1989). The unicorn, the normal curve, and other improbably creatures. Psychological Bulletin, 105, 156-166.
Minium, E.W., King, B.M. and Bear, G. (1993) Statistical Reasoning in Psychology and Education, New York: John Wiley and Sons.
Stine, W.W. (1989). Meaningful inference: The role of measurement in statistics. Psychological Bulletin, 105, 147-155.
West, S.G., Finch, J.F. and Curran, P.J. (1995) Structural equation models with non-normal variables: Problems and Remedies. In R.H. Hoyle (ed.) Structural Equation Modeling: Concepts, Issues and Applications. Sage Publications; Thousand Oaks CA.
Useful Links
General Statistics links page
Chris’s Calculator utility
Thanks
I am grateful to all those academics around the globe who have made their own thoughts on these issues freely available. This site wouldn’t work without you.
Legal backside covering bit:
The pages and links are provided on a "as is" basis, and the user must accept responsibility for the results produced. The author, the Department, and the University accept no liability or responsibility for any damage, injury, loss of money, time or reputation, or anything else anyone can think of, caused by or related to the use of these programs, downloading these pages, or in the course of being connected to our site.
