### Introduction

This guide is an introduction to **Spearman's rank correlation coefficient**, its mathematical calculation, and its computation via Python's `pandas`

library. We'll construct various examples to gain a basic understanding of this coefficient and demonstrate how to visualize the **correlation matrix** via **heatmaps**.

### What Is the Spearman Rank Correlation Coefficient?

**Spearman rank correlation** is closely related to the **Pearson correlation**, and both are a bounded value, from `-1`

to `1`

denoting a **correlation** between two variables.

If you'd like to read more about the alternative correlation coefficient - read our Guide to the Pearson Correlation Coefficient in Python.

The Pearson correlation coefficient is computed using raw data values, whereas, the Spearman correlation is calculated from the *ranks* of individual values. While the Pearson correlation coefficient is a measure of the linear relation between two variables, the Spearman rank correlation coefficient measures the monotonic relation between a *pair of variables*. To understand the Spearman correlation, we need a basic understanding of **monotonic functions**.

#### Monotonic Functions

There are monotonically increasing, monotonically decreasing, and non-montonic functions.

For a monotonically increasing function, as X increases, Y also increases (and it doesn't have to be linear). For a monotonically decreasing function, as one variable increases, the other one decreases (also doesn't have to be linear). A non-monotonic function is where the increase in the value of one variable can sometimes lead to an increase and sometimes lead to a decrease in the value of the other variable.
Spearman rank correlation coefficient measures **the monotonic relation between two variables**. Its values range from -1 to +1 and can be interpreted as:

**+1:**Perfectly monotonically increasing relationship**+0.8:**Strong monotonically increasing relationship**+0.2:**Weak monotonically increasing relationship**0:**Non-monotonic relation**-0.2:**Weak monotonically decreasing relationship**-0.8:**Strong monotonically decreasing relationship**-1:**Perfectly monotonically decreasing relationship

#### Mathematical Expression

Suppose we have (n) observations of two random variables, (X) and (Y). We first rank all values of both variables as (X*r) and (Y*r) respectively. The Spearman rank correlation coefficient is denoted by (r*s) and is calculated by:
$$
r*s = \rho

*{X*r,Y

*r} = \frac{\text{COV}(X*r,Y

*r)}{\text{STD}(X*r)\text{STD}(Y

*r)} = \frac{n\sum\limits*{x

*r\in X*r, y

*r \in Y*r} x

*r y*r - \sum\limits

*{x*r\in X

*r}x*r\sum\limits

*{y*r\in Y

*r}y*r}{\sqrt{\Big(n\sum\limits

*{x*r \in X

*r} x*r^2 -(\sum\limits

*{x*r\in X

*r}x*r)^2\Big)}\sqrt{\Big(n\sum\limits

*{y*r \in Y

*r} y*r^2 - (\sum\limits

*{y*r\in Y

*r}y*r)^2 \Big)}}

$$ Here,

`COV()`

is the covariance, and `STD()`

is the standard deviation. Before we see Python's functions for computing this coefficient, let's do an example computation by hand to understand the expression and get to appreciate it.#### Example Computation

Suppose we are given some observations of the random variables (X) and (Y). The first step is to convert (X) and (Y) to (X*r) and (Y*r), which represent their corresponding ranks. A few intermediate values would also be needed, which are shown below:

Let's use the formula from before to compute the Spearman correlation:

$${r}_{s}=\frac{5\ast 38-(15)(15)}{\sqrt{(5\ast 55-{15}^{2})}\sqrt{(5\ast 55-{15}^{2})}}=\frac{181}{532}=-0.7$$Great! Though, calculating this manually is time-consuming, and the best use of computers is to, well, compute things for us. Computing the Spearman correlation is really easy and straightforward with built-in functions in Pandas.

### Computing the Spearman Rank Correlation Coefficient Using *Pandas*

```
The various correlation coefficients, including Spearman, can be computed via the
````corr()`

method of the Pandas library.

As an input argument, the `corr()`

function accepts the method to be used for computing correlation (`spearman`

in our case). The method is called on a `DataFrame`

, say of size `mxn`

, where each column represents the values of a random variable and `m`

represents the total samples of each variable.
For `n`

random variables, it returns an `nxn`

square matrix `R`

. `R(i,j)`

indicates the Spearman rank correlation coefficient between the random variable `i`

and `j`

. As the correlation coefficient between a variable and itself is 1, all diagonal entries `(i,i)`

are equal to unity. In short:

Note that the correlation matrix is symmetric as correlation is symmetric, i.e., `M(i,j)=M(j,i)`

. Let's take our simple example from the previous section and see how to use Pandas' `corr()`

fuction:

```
import numpy as np
import pandas as pd
import seaborn as sns # For pairplots and heatmaps
import matplotlib.pyplot as plt
```

We'll be using Pandas for the computation itself, Matplotlib with Seaborn for visualization and Numpy for additional operations on the data.
The code below computes the Spearman correlation matrix on the dataframe `x_simple`

. Note the **ones** on the diagonals, indicating that the correlation coefficient of a variable with itself is naturally, **one**:

```
x_simple = pd.DataFrame([(-2,4),(-1,1),(0,3),(1,2),(2,0)],
columns=["X","Y"])
my_r = x_simple.corr(method="spearman")
print(my_r)
```

```
X Y
X 1.0 -0.7
Y -0.7 1.0
```

### Visualizing the Correlation Coefficient

Given the table-like structure of bounded intensities, `[-1, 1]`

- a natural and convenient way of *visualizing* the correlation coefficient is a **heatmap**.

If you'd like to read more about heatmaps in Seaborn, read our Ultimate Guide to Heatmaps in Seaborn with Python!

A heatmap is a grid of cells, where each cell is assigned a color according to its value, and this visual way of interpreting correlation matrices is much easier for us than parsing numbers. For small tables like the one previously output - it's perfectly fine. But with *a lot* of variables, it's much harder to actually interpret what's going on.
Let's define a `display_correlation()`

function that computes the correlation coefficient and displays it as a heatmap:

```
def display_correlation(df):
r = df.corr(method="spearman")
plt.figure(figsize=(10,6))
heatmap = sns.heatmap(df.corr(), vmin=-1,
vmax=1, annot=True)
plt.title("Spearman Correlation")
return(r)
```

Let's call `display_correlation()`

on our `r_simple`

DataFrame to visualize the Spearman correlation:

```
r_simple=display_correlation(x_simple)
```

### Understanding the Spearman's Correlation Coefficient on Synthetic Examples

To understand the Spearman correlation coefficient, let's generate a few synthetic examples that accentuate the how the coefficient works - before we dive into more natural examples. These examples will help us understand, for what type of relationships this coefficient is +1, -1, or close to zero.
Before generating the examples, we'll create a new helper function, `plot_data_corr()`

, that calls `display_correlation()`

and plots the data against the `X`

variable:

```
def plot_data_corr(df,title,color="green"):
r = display_correlation(df)
fig, ax = plt.subplots(nrows=1, ncols=len(df.columns)-1,figsize=(14,3))
for i in range(1,len(df.columns)):
ax[i-1].scatter(df["X"],df.values[:,i],color=color)
ax[i-1].title.set_text(title[i] +'\n r = ' +
"{:.2f}".format(r.values[0,i]))
ax[i-1].set(xlabel=df.columns[0],ylabel=df.columns[i])
fig.subplots_adjust(wspace=.7)
plt.show()
```

#### Monotonically Increasing Functions

Let's generate a few monotonically increasing functions, using Numpy, and take a peek at the `DataFrame`

once filled with the synthetic data:

```
seed = 11
rand = np.random.RandomState(seed)
# Create a data frame using various monotonically increasing functions
x_incr = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_incr["Line+"] = x_incr.X*2+1
x_incr["Sq+"] = x_incr.X**2
x_incr["Exp+"] = np.exp(x_incr.X)
x_incr["Cube+"] = (x_incr.X-5)**3
print(x_incr.head())
```

X | Line+ | Sq+ | Exp+ | Cube+ | |
---|---|---|---|---|---|

0 | 1.802697 | 4.605394 | 3.249716 | 6.065985 | -32.685221 |

1 | 0.194752 | 1.389505 | 0.037929 | 1.215010 | -110.955110 |

2 | 4.632185 | 10.264371 | 21.457140 | 102.738329 | -0.049761 |

3 | 7.249339 | 15.498679 | 52.552920 | 1407.174809 | 11.380593 |

4 | 4.202036 | 9.404072 | 17.657107 | 66.822246 | -0.508101 |

Now let's look at the Spearman correlation's heatmap and the plot of various functions against `X`

:

```
plot_data_corr(x_incr,["X","2X+1","$X^2$","$e^X$","$(X-5)^3$"])
```

We can see that for all these examples, there is a perfectly monotonically increasing relationship between the variables. The Spearman correlation is a +1, regardless of whether the variables have a linear or a non-linear relationship.
**Pearson** would've produced much different results here, since it's computed based on the **linear** relationship between the variables.

As long as Y increases as X increases,without fail, the Spearman Rank Correlation Coefficient will be1.

#### Monotonically Decreasing Functions

Let's repeat the same examples on monotonically decreasing functions. We'll again generate synthetic data and compute the Spearman rank correlation. First, let's look at the first 4 rows of the `DataFrame`

:

```
# Create a data matrix
x_decr = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_decr["Line-"] = -x_decr.X*2+1
x_decr["Sq-"] = -x_decr.X**2
x_decr["Exp-"] = np.exp(-x_decr.X)
x_decr["Cube-"] = -(x_decr.X-5)**3
x_decr.head()
```

X | Line- | Sq- | Exp- | Cube- | |
---|---|---|---|---|---|

0 | 3.181872 | -5.363744 | -10.124309 | 0.041508 | 6.009985 |

1 | 2.180034 | -3.360068 | -4.752547 | 0.113038 | 22.424963 |

2 | 8.449385 | -15.898771 | -71.392112 | 0.000214 | -41.041680 |

3 | 3.021647 | -5.043294 | -9.130350 | 0.048721 | 7.743039 |

4 | 4.382207 | -7.764413 | -19.203736 | 0.012498 | 0.235792 |

The correlation matrix's heatmap and the plot of the variables is given below:

```
plot_data_corr(x_decr,["X","-2X+1","$-X^2$","$-e^X$","$-(X-5)^3$"],"blue")
```

#### Non-monotonic Functions

The examples below are for various non-monotonic functions. The last column added to the `DataFrame`

is that of an independent variable `Rand`

, which has no association with `X`

.
These examples should also clarify that Spearman correlation is a measure of ** monotonicity** of a relationship between two variables. A zero coefficient does not necessarily indicate no relationship, but it does indicate that there is no

**between them. Before generating synthetic data, we'll define yet another helper function,**

*monotonicity*`display_corr_pairs()`

, that calls `display_correlation()`

to display the heatmap of the correlation matrix and then plots all pairs of variables in the `DataFrame`

against each other using the Seaborn library.
On the diagonals, we'll display the histogram of each variable in yellow color using `map_diag()`

. Below the diagonals, we'll make a scatter plot of all variable pairs. As the correlation matrix is symmetric, we don't need the plots above the diagonals.
Let's also display the Pearson correlation coefficient for comparison:```
def display_corr_pairs(df,color="cyan"):
s = set_title = np.vectorize(lambda ax,r,rho: ax.title.set_text("r = " +
"{:.2f}".format(r) +
'\n $\\rho$ = ' +
"{:.2f}".format(rho)) if ax!=None else None
)
r = display_correlation(df)
rho = df.corr(method="pearson")
g = sns.PairGrid(df,corner=True)
g.map_diag(plt.hist,color="yellow")
g.map_lower(sns.scatterplot,color="magenta")
set_title(g.axes,r,rho)
plt.subplots_adjust(hspace = 0.6)
plt.show()
```

We'll create a non-monotonic DataFrame, `x_non`

, with these functions of `X`

:

- Parabola: \( (X-5)^2 \)
- Sin: \( \sin (\frac{X}{10}2\pi) \)
- Frac: \( \frac{X-5}{(X-5)^2+1} \)
- Rand: Random numbers in the range [-1,1]

Below are the first 4 lines of `x_non`

:

```
x_non = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_non["Parabola"] = (x_non.X-5)**2
x_non["Sin"] = np.sin(x_non.X/10*2*np.pi)
x_non["Frac"] = (x_non.X-5)/((x_non.X-5)**2+1)
x_non["Rand"] = rand.uniform(-1,1,100)
print(x_non.head())
```

X | Parabola | Sin | Frac | Rand | |
---|---|---|---|---|---|

0 | 0.654466 | 18.883667 | 0.399722 | -0.218548 | 0.072827 |

1 | 5.746559 | 0.557351 | -0.452063 | 0.479378 | -0.818150 |

2 | 6.879362 | 3.532003 | -0.924925 | 0.414687 | -0.868501 |

3 | 5.683058 | 0.466569 | -0.416124 | 0.465753 | 0.337066 |

4 | 6.037265 | 1.075920 | -0.606565 | 0.499666 | 0.583229 |

The Spearman correlation coefficient between different data pairs is illustrated below:

```
display_corr_pairs(x_non)
```

These examples show for what type of data the Spearman correlation is close to zero and where it has intermediate values. Another thing to note is that the Spearman correlation and Pearson correlation coefficient are not always in agreement with each other, so a lack of one doesn't mean a lack of another. They're used to test correlation for different facets of data, and can't be used interchangeably. While they will be in agreement in some cases, they won't always be.

### Spearman Correlation Coefficient on *Linnerud* Dataset

Let's apply the Spearman Correlation coefficient on an actual dataset. We have chosen the simple physical exercise dataset called `linnerud`

from the `sklearn.datasets`

package for demonstration:

```
import sklearn.datasets.load_linnerud
```

The code below loads the dataset and joins the target variables and attributes in one `DataFrame`

. Let's look at the first 4 rows of the `linnerud`

data:

```
d=load_linnerud()
dat = pd.DataFrame(d.data,columns=d.feature_names)
alldat=dat.join(pd.DataFrame(d.target,columns=d.target_names) )
alldat.head()
```

Chins | Situps | Jumps | Weight | Waist | Pulse | |
---|---|---|---|---|---|---|

0 | 5.0 | 162.0 | 60.0 | 191.0 | 36.0 | 50.0 |

1 | 2.0 | 110.0 | 60.0 | 189.0 | 37.0 | 52.0 |

2 | 12.0 | 101.0 | 101.0 | 193.0 | 38.0 | 58.0 |

3 | 12.0 | 105.0 | 37.0 | 162.0 | 35.0 | 62.0 |

4 | 13.0 | 155.0 | 58.0 | 189.0 | 35.0 | 46.0 |

Now, let's display the correlation pairs using our `display_corr_pairs()`

function:

```
display_corr_pairs(alldat)
```

Looking at the Spearman correlation values, we can make interesting conclusions such as:

- Higher waist values imply increasing weight values (from
**r = 0.81**) - More situps have lower waist values (from
**r = -0.72**) - Chins, situps and jumps don't seem to have a monotonic relationship with pulse, as the corresponding r values are close to zero.

### Conclusions

In this guide, we discussed the Spearman rank correlation coefficient, its mathematical expression, and its computation via Python's `pandas`

library.
We demonstrated this coefficient on various synthetic examples and also on the `Linnerrud`

dataset. Spearman correlation coefficient is an ideal measure for computing the monotonicity of the relationship between two variables. However, a close to zero value does not necessarily indicate that the variables have no association between them.

**Reference: stackabuse.com**