Correlation Testing

Correlation, also called as correlation analysis, is a term used to denote the association or relationship between two (or more) quantitative variables. This analysis is fundamentally based on the assumption of a straight-line [linear] relationship between the quantitative variables. Similar to the measures of association for binary variables, it measures the “strength” or the “extent” of an association between the variables and also its direction. The end result of a correlation analysis is a correlation coefficient whose values range from -1 to +1. A correlation coefficient of +1 indicates that the two variables are perfectly related in a positive [linear] manner, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative [linear] manner, while a correlation coefficient of zero indicates that there is no linear relationship between the two variables being studied.

Point: Correlation coefficients do not give information about whether one variable moves in response to another. There is no attempt to establish one variable as “dependent” and the other as “independent”. We shall discuss the concept of independent and dependent variables in the next article on regression analysis. Relationships identified using correlation coefficients should be interpreted for what they are: associations, and not causal relationships.

Factors that Affect a Correlation Analysis:
i. Correlation analysis should not be used when data is repeated measures of the same variable from the same individual at the same or varied time points.
ii. It is useful to draw a scatter plot as an important prerequisite to any correlation analysis as it helps eyeball the data for outliers, non-linear relationships, and heteroscedasticity.
iii. An outlier is essentially an infrequently occurring value in the data set. It is important to remember that even a single outlier can dramatically alter the correlation coefficient.
iv. If there is a non-linear relationship between the quantitative variables, correlation analysis should not be performed.
v. If the dataset has two distinct subgroups of individuals whose values for one or both variables differ considerably from each other, a false correlation may be found, when none may exist.
vi. The sample size should be appropriately calculated à priori. Small sample sizes may show a false positive relationship.
vii. If one data set forms part of the second data set, for example, height at age 12 (X-axis) and height at age 30 (Y-axis), we would expect to find a positive correlation between them because the second quantity “contains” the first quantity.
viii. Heteroscedasticity is a situation in which one variable has unequal variability across the range of values of the second variable.

Conclusion: In summary, correlation coefficients are used to assess the strength and direction of the linear relationships between pairs of continuous variables. When both variables are normally distributed, we use Pearson’s correlation coefficient “r”. Otherwise, we use Spearman’s correlation coefficient rho (ρ), which is non–parametric in nature, and is more robust to outliers than is the Pearson’s correlation coefficient “r”. Correlation analysis is seldom used alone and is usually accompanied by regression analysis. The difference between correlation and regression lies in the fact that while a correlation analysis stops with the calculation of the correlation coefficient and perhaps a test of significance, a regression analysis goes ahead to express the relationship in the form of an equation and moves into the realm of prediction.

reference: Gogtay, Nithya J., and Urmila M. Thatte. “Principles of correlation analysis.” Journal of the Association of Physicians of India 65.3 (2017): 78-81.

Leave a Comment

Your email address will not be published. Required fields are marked *