Mean, Median, Standard Deviation and Correlation

In this short note, we introduce several useful functions in R: mean(), median(), max(), min(), sort(), var(), sd() and cor.

Mean, Median, Maximum, Minimum and Sorting

Let’s first create a numeric vector:

x <- c(1.4, 5.66, 7.13, 9.21)

The max() and min() functions return the maximum and minimum of a vector:

max(x)

[1] 9.21

min(x)

[1] 1.4

The mean() and median() functions return the mean and median:

mean(x)

[1] 5.85

median(x)

[1] 6.395

The sort() function sorts the vector in increasing order:

sort(x)

[1] 1.40 5.66 7.13 9.21

There are other optional parameters we can set in these functions. Use, e.g., ?sort in the R console to pull up a help page. For example, we can sort the vector in decreasing order using

sort(x, decreasing=TRUE)

[1] 9.21 7.13 5.66 1.40

Variance and Standard Deviation

The var() and sd() functions calculate the variance and standard deviation of a vector. Note that they are defined as \[ {\rm var}(X)= \frac{1}{N-1}\sum_{i=1}^N (X_i -\mu_X)^2 \ \ \ , \ \ \ {\rm sd}(X)=\sqrt{{\rm var}(X)} ,\] where \(N\) is the number of elements in \(X\) and \(\mu_X\) is the mean of \(X\). This is the estimated population variance based on the random sample. Recall that as you have learned in Stat 100, the population variance \(\sigma^2\) is defined as \[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i -\mu_X)^2\] So you have to be careful in using the var() and sd() functions in R. For example, if we want to calculate the standard deviation, defined as the square root of \(\sigma^2\) above, for the x vector, we need to use the expression

stdx <- sd(x) * sqrt((length(x)-1)/length(x))

Here length(x) returns the number of elements in the vector x:

length(x)

[1] 4

The factor sqrt((length(x)-1)/length(x)), or \(\sqrt{(N-1)/N}\), takes into account the \(1/N\) and \(1/(N-1)\) differences between the sd() function and the expression for \(\sigma\). Another way of calculating \(\sigma\) is to use the fact that \(\sigma\) is the square root of the mean of \((X-\mu_X)^2\):

stdx2 <- sqrt( mean( (x-mean(x))^2 ) )

We can confirm that these two expressions give the same result:

c(stdx, stdx2, stdx-stdx2)

[1] 2.862106 2.862106 0.000000

Correlation

Now that we have mean and standard deviation, we can calculate the Z-score, defined as \[ Z_i = \frac{X_i-\mu_X}{\sigma}\] This can be done easily with R’s vectorized operation:

Zx <- (x-mean(x))/stdx
Zx

[1] -1.55479923 -0.06638469  0.44722315  1.17396077

Consider another numeric vector y of length 4:

y <- c(1.53, -3.45, 6.7, 4.63)

We can easily convert it to the Z score as well:

stdy <- sd(y) * sqrt((length(y)-1)/length(y))
Zy <- (y-mean(y))/stdy
Zy

[1] -0.2151968 -1.5181512  1.1374687  0.5958793

Recall that the correlation coefficient \(r\) of two sets of variables X and Y is defined as the mean of the product their Z scores: \[ r = \frac{1}{N}\sum_{i=1}^N Z_{xi} Z_{yi} \] The correlation between the vectors x and y can be calculated using

mean(Zx*Zy)

[1] 0.4109028

Since the correlation coefficient is an important concept in statistics, R already has a built-in function, cor(), to compute it directly:

cor(x,y)

[1] 0.4109028

To demonstrate another use of the cor() function, we create a third vector:

z <- c(2.39,3.19,8.31,-4.67)

With 3 sets of data x, y, z, we can calculate the correlation coefficients between all pairs of the 3 variables:

cor(x,y)

[1] 0.4109028

cor(x,z)

[1] -0.3078786

cor(y,z)

[1] 0.07096528

There is a simpler way of calculating all 3 correlations by combining the cbind() function and the cor() function.

The cbind() function is mentioned on P.18 of the textbook. It combines vectors into a matrix whose columns are the input vectors. For example,

cbind(x,y,z)

        x     y     z
[1,] 1.40  1.53  2.39
[2,] 5.66 -3.45  3.19
[3,] 7.13  6.70  8.31
[4,] 9.21  4.63 -4.67

creates a 4×3 matrix. The first column is x, second column is y and third column is z.

The cor() function accepts a matrix argument. In cor(m), if m is a \(p\times q\) matrix, cor(m) returns a \(q \times q\) matrix whose \((i,j)\) element is the correlation coefficient between the \(i\)th column vector and \(j\)th column vector of m. For example, in

cor(cbind(x,y,z))

           x          y           z
x  1.0000000 0.41090275 -0.30787858
y  0.4109028 1.00000000  0.07096528
z -0.3078786 0.07096528  1.00000000

the (1,1) element is cor(x,x), which is 1; the (1,2) element is cor(x,y); the (1,3) element is cor(x,z); the (2,1) element is cor(y,x), which is equal to cor(x,y); the (2,2) element is cor(y,y), which is 1; the (2,3) element is cor(y,z), and so on. The returned \(3\times 3\) matrix is called the correlation matrix for the variables \(x\), \(y\), and \(z\).