I recently came upon a rather interesting publication. The authors in this publication PLoS Comput Biol. 2015 Mar 16;11(3) highlight the perils of performing correlations between the type of composite data sets that I often generate for my clients. We often report analysis of changes in gene expression in a system expressed as ‘relative’ expression levels. This is conventional practice where changes in gene expression are reported as ‘relative changes’ where the expression levels are noramlized to a common housekeeping gene.

The authors in the publication use yeast gene expression data to illustrate how correlations between relative gene expression data can often be quite different from correlations obtained from absolute gene expression levels.

The table below illustrates the gene expression data as a time series for two genes that were selected to highlight the differences in correlation between the relative expression and absolute expression data.

##              Rel.SPBC21C3.01c Rel.SPAC823.06 Abs.SPBC21C3.01c
## timepoint0       4.270746e-05   3.439527e-05        1.4900000
## timepoint0.5     4.307769e-05   3.077703e-05        1.1513909
## timepoint1       4.187077e-05   3.154671e-05        1.0357257
## timepoint2       4.603428e-05   2.636756e-05        0.8163213
## timepoint3       5.777702e-05   2.504087e-05        0.7826243
## timepoint4       5.450042e-05   2.271070e-05        0.5108444
## timepoint6       5.466728e-05   2.081995e-05        0.4642510
## timepoint9       6.658633e-05   1.723500e-05        0.4654234
## timepoint12      6.190308e-05   2.018923e-05        0.3817052
## timepoint16      5.642907e-05   1.857965e-05        0.3892306
## timepoint20      5.725145e-05   1.982208e-05        0.3905759
## timepoint24      5.909757e-05   2.012737e-05        0.3839748
## timepoint48      6.913509e-05   1.208574e-05        0.4044706
## timepoint72      6.918934e-05   1.426925e-05        0.3638314
## timepoint96      7.055342e-05   1.119380e-05        0.3334455
## timepoint168     8.032341e-05   4.492412e-06        0.3091656
##              Abs.SPAC823.06
## timepoint0       1.20000000
## timepoint0.5     0.82261591
## timepoint1       0.78034723
## timepoint2       0.46757333
## timepoint3       0.33919362
## timepoint4       0.21287238
## timepoint6       0.17680928
## timepoint9       0.12046874
## timepoint12      0.12449033
## timepoint16      0.12815676
## timepoint20      0.13522849
## timepoint24      0.13077360
## timepoint48      0.07070690
## timepoint72      0.07503468
## timepoint96      0.05290346
## timepoint168     0.01729134

If we plot the pairwise values of these two time series, we can clearly see that the correlation coefficient of the two relative time series data is the opposite extreme to that of the two corresponding absolute time series data.

pairs(plot_data, main = "simple correlation plot")

As we can see, the correlation between Rel.SPBC21C3.01c and Rel.SPAC823.06 shows a clearly negative correlation between the relative expression data between the two genes SPBC21C3.01c and SPAC823.06. If we compare this to the correlation between Abs.SPBC21C3.01c and Abs.SPSC823.06, we can see a positive correlation.

Let’s look at the associated correlation coefficients -

cor(plot_data)

##                  Rel.SPBC21C3.01c Rel.SPAC823.06 Abs.SPBC21C3.01c
## Rel.SPBC21C3.01c        1.0000000     -0.9645486       -0.8010835
## Rel.SPAC823.06         -0.9645486      1.0000000        0.8633485
## Abs.SPBC21C3.01c       -0.8010835      0.8633485        1.0000000
## Abs.SPAC823.06         -0.8192958      0.8671468        0.9899198
##                  Abs.SPAC823.06
## Rel.SPBC21C3.01c     -0.8192958
## Rel.SPAC823.06        0.8671468
## Abs.SPBC21C3.01c      0.9899198
## Abs.SPAC823.06        1.0000000

So the negative correlation coefficient between the relative gene expression values is -0.96 and the positive correlation coefficient between the absolute gene expression values is 0.99.

The authors go on to propose a solution to the problem of spurious correlations by implementing methods for proportionality also exemplified in this R package propr.

In fact, the problem of spurious correlation is nothing new and I found myself innundated with more information than I could handle upon simply googling the term. For those interesting in taking a wander into the wonderful world of spurious correlations, I would highly recommend Tyler Vigen’s site which gives meaning to the old adage that “correlation does not always equal causation.”

Conducting correlation analysis on our data sets certainly ranks high amongst the repertoire of analytical tools that we scientists posess. This type of analysis certainly has its place within biological drug discovery, where its often used to show equivalence between different assays or perhaps to highlight a causal relationship between two factors. What is perhaps not so familar to the casual reader is that we biologist also often show correlation analysis between ‘standardized’ data. This type of transformation is often used to standardize our data against a variable that is of relatively little interest to the researcher. For instance, if we wish to compare the efficacy of drug A to drug B in an animal model then statisticians often standardize the effect of treatment to the body weight of the animals so that the effect of drug A is colinear with drug B. The problem with this type of analysis is that if two variables, that are shown to be not at all colinear with each other, are plotted as ratios of a third non correlated variable, they will show a correlation between the resultant ratios. This type of correlation was defined as ‘spurious correlation’ in 1897 by Pearson who was the first to report the problem of using ratios in correlation analysis. Let’s try and illustrate this with some hypothetical data examples -

Let’s generate a set of variables x, y and z, each with 500 randomly generated values with mean of 10 and an sd of 1 for x and y. For the z variable, we will generate variable with a mean of 30 and an sd of 3.

set.seed(123)
data <- data.frame(x = rnorm(500, 10, 1), y = rnorm(500, 10, 1), z = rnorm(500, 30, 3))

Now let’s create a quick plot of the variables x and y to see how these points are correlated with each other -

g <- ggplot(data, aes(x, y)) 
g <- g + geom_point(alpha = 0.3) + theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")
g

The actual correlation between these variables is -

cor(data$x, data$y)

## [1] -0.05193691

As we can see, there is hardly any correlation between the two variables. Now, let’s repeat the plot with a few ratios. Firstly, let’s plot x with x/y -

h <- ggplot(data, aes(x, x/y))
h <- h + geom_point(aes(color = x), alpha = 0.5) + theme_bw() + geom_smooth(method = "lm") + scale_color_gradientn(colours = c("red", "white", "blue")) + labs(title = "Plot of x versus x/y for 500 observations with x,y N(10, 1)")
h

We can see that we have immediately improved the correlation between the variables to -

cor(data$x, data$x/data$y)

## [1] 0.7029796

Likewise, let’s plot the correlation of the ratios of x and y variables to a third uncorrelated variable z -

i <- ggplot(data, aes(x/z, y/z))
i <- i + geom_point(aes(color = z), alpha = 0.5) + theme_bw() + geom_smooth(method = "lm") + scale_color_gradientn(colours = c("red", "white", "blue")) + labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")
i

This correlation between the ratio of x/z and y/z is now -

cor(data$x / data$z, data$y / data$z)

## [1] 0.460591

In fact, the relationship between the ratio is highly dependent on the sd of the common z variable in the denominator. If we change the value in the z variable to have a higher sd from 3 to 6, we can see from the plot below that the correlation between the two ratios is now improved -

data$z <- rnorm(500, 30, 6)
cor(data$x / data$z, data$y / data$z)

## [1] 0.8362875

j <- ggplot(data, aes(x/z, y/z))
j <- j + geom_point(aes(color = z), alpha = 0.5) + theme_bw() + geom_smooth(method = "lm") + scale_color_gradientn(colours = c("red", "white", "blue")) + labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")
j

As we can see, the effect of the uncorrelated third variable on the ratios can be profound and lead to misleading interpretations.