16 Regression

When you analyse experimental data you will want to see whether there is any relation between the sets of data. One way to do this is to calculate the Pearson product-moment correlation coefficient.

16.1 Pearson's Product-Moment Correlation Coefficient

Karl Pearson devised a coefficient to measure the correlation between two sets of data. The coefficient ranges from $-1$ to $1$. A value of $1$ means there is perfect correlation between the data sets, a value of $0$ means there is no correlation and a value of $-1$ means there is perfect negative correlation between the data sets.

Note:Independent data may be strongly correlated but correlation does not mean causality.

Correlation
Fig 16.1 Correlation

Assume you have a set of independent data $x$ with corresponding dependent data $y$ then Pearson's product-moment correlation coefficient is given by $r=\frac{n \sum xy- \sum x \sum y}{\sqrt{n\sum x^2-(\sum x)^2} \times \sqrt{n\sum y^2-(\sum y)^2}}$
where $n$ is the number of samples and $\sum$ means the sum of the values in the set of data.

If the value of $|r| \lt 0.5$ then the correlation is weak or non-existent. If the value of $|r| \geq 0.5$ then there is a correlation between the two sets of data.


$r=\frac{n \sum xy- \sum x \sum y}{\sqrt{n\sum x^2-(\sum x)^2} \times \sqrt{n\sum y^2-(\sum y)^2}}$

where $n$ is the number of samples and $\sum$ means the sum of the values in the set of data.


If $|r| \geq 0.5$ then you will want to find the gradient and y-intercept of the regression line. To do this we will use the method of least squares. Least squares minimises the square of the perpendicular distance between each data point and the regression line. The square of the distance is used because points below the line give a negative distance and we want to minimise to sum of all the separate distances. The gradient is given by $m = \frac{n \sum xy- \sum x \sum y}{n\sum x^2-(\sum x)^2}$ and the y-intercept is given by $c = \frac{\sum x^2 \sum y- \sum x \sum xy}{n\sum x^2-(\sum x)^2}$


$m = \frac{n \sum xy- \sum x \sum y}{n\sum x^2-(\sum x)^2}$    $c = \frac{\sum x^2 \sum y- \sum x \sum xy}{n\sum x^2-(\sum x)^2}$


Example 16.1: Given the following data calculate the Pearson correlation coefficient.

$x$01234
$y$814132019

To find $r$ we need $\sum x$, $\sum y$, $\sum xy$, $\sum x^2$, $\sum y^2$ and $n$.


$x$0123410
$y$81413201974
$xy$014266076176
$x^2$01491630
$y^2$641961694003611190

Putting these values into Pearson's equation

$r=\frac{n \sum xy- \sum x \sum y}{\sqrt{n\sum x^2-(\sum x)^2} \times \sqrt{n\sum y^2-(\sum y)^2}}$

$=\frac{5 \times 176 - 10 \times 74}{\sqrt{5 \times 30 - 10^2} \times \sqrt{5 \times 1190 - 74^2}}$

$=0.909$

A value for $r=0.909$ means there is good correlation for these data which means it is worth calculating the gradient and y-intercept of the correlation line.

$m=\frac{n \sum xy- \sum x \sum y}{n\sum x^2-(\sum x)^2}$

$=\frac{5 \times 176 - 10 \times 74}{5 \times 30 - 10^2}$

$=2.80$

$c=\frac{ \sum x^2 \times \sum y - \sum x \sum xy}{n\sum x^2-(\sum x)^2} $

$=\frac{30 \times 74 - 10 \times 176}{5 \times 30 - 10^2}$

$=9.2$

Here is a plot of the data and the regression line

Regression line through data
Fig 16.2 Regression line through data.

16.2 Non-linear Regression

There are several families of non-linear regression. In this section we will consider problems where some functions of the dependent variable can be used in place of the variable itself. As an example consider a simple pendulum. The period of a simple pendulum is given by $t = 2 \pi \sqrt{\frac{l}{g}}$. If we plot $t$ for a range of values of $l$ we get a quadratic curve. If, on the other hand, we plot $t^2$ for a range of values of $l$ we get a straight line and that simplifies the arithmetic.

Altitude (km) Time (s) Time2 (s2 x1000)

Example 16.2: The following data is the altitude (km) of a rocket at given times (s) after launch.

Time04080120160
Altitude04.716.228.949.1

Here is a plot of the data.

Quadratic data plot
Fig 16.3 Altitude vs time.

We know the curve is of the form $\frac{1}{2}at^2$ so instead of plotting altitude vs time we will plot altitude vs time2. Here is a plot.

Linear data plot
Fig 16.4 Altitude vs time2x1000.

To find Pearsons correlation coefficient $r$ we need $\sum x$, $\sum y$, $\sum xy$, $\sum x^2$, $\sum y^2$ and $n$.


$x$01.66.414.425.648
$y$04.716.228.949.198.9
$xy$07.52103.7416.212571784
$x^2$02.5640.96207.4655.4906.2
$y^2$02226283524113530

Putting these values into Pearson's equation

$r=\frac{n \sum xy- \sum x \sum y}{\sqrt{n\sum x^2-(\sum x)^2} \times \sqrt{n\sum y^2-(\sum y)^2}}$

$=\frac{5 \times 1784 - 48 \times 98.9}{\sqrt{5 \times 906.2 - 48^2} \times \sqrt{5 \times 3530 - 98.9^2}}$

$=0.9970$

A value for $r=0.9970$ means there is a high correlation for these data which means it is worth calculating the gradient.

$m=\frac{n \sum xy- \sum x \sum y}{n\sum x^2-(\sum x)^2}$

$=\frac{5 \times 1784 - 48 \times 98.9}{5 \times 906.2 - 48^2}$

$=1.87$

Remember $s=\frac{1}{2}at^2$. We have plotted $s$ against $t^2$ so the gradient $=\frac{1}{2}a$ and the acceleration is $3.75$ m/s2.