Unit V
Correlation and Regression
The Scatter Diagram Method is the simplest method to study the correlation between two variables wherein the values for each pair of a variable is plotted on a graph in the form of dots thereby obtaining as many points as the number of observations. Then by looking at the scatter of several points, the degree of correlation is ascertained.
The degree to which the variables are related to each other depends on the manner in which the points are scattered over the chart. The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more the points plotted are closer to the line, the higher is the degree of correlation. The degree of correlation is denoted by “r”.
He following types of scatter diagrams tell about the degree of correlation between variable X and variable Y. There are different types of scatter diagram depicting the type of correlation which may be as under:
 Perfect Positive Correlation (r=+1): The correlation is said to be perfectly positive when all the points lie on the straight line rising from the lower lefthand corner to the upper righthand corner.
2. Perfect Negative Correlation (r=1): When all the points lie on a straight line falling from the upper lefthand corner to the lower righthand corner, the variables are said to be negatively correlated.
3. High Degree of +Ve Correlation (r= + High): The degree of correlation is high when the points plotted fall under the narrow band and is said to be positive when these show the rising tendency from the lower lefthand corner to the upper righthand corner.
4. High Degree of –Ve Correlation (r= – High): The degree of negative correlation is high when the point plotted fall in the narrow band and show the declining tendency from the upper lefthand corner to the lower righthand corner.
5. Low degree of +Ve Correlation (r= + Low): The correlation between the variables is said to be low but positive when the points are highly scattered over the graph and show a rising tendency from the lower lefthand corner to the upper righthand corner.
6. Low Degree of –Ve Correlation (r= + Low): The degree of correlation is low and negative when the points are scattered over the graph and the show the falling tendency from the upper lefthand corner to the lower righthand corner.
7. No Correlation (r= 0): The variable is said to be unrelated when the points are haphazardly scattered over the graph and do not show any specific pattern. Here the correlation is absent and hence r = 0
Thus, the scatter diagram method is the simplest device to study the degree of relationship between the variables by plotting the dots for each pair of variable values given. The chart on which the dots are plotted is also called as a Dotogram.
Twoway tables organize data based on two categorical variables.
Two way frequency tables
Twoway frequency tables show how many data points fit in each category. A twoway table (also called a contingency table) is a useful tool for examining relationships between categorical variables. The entries in the cells of a twoway table can be frequency counts or relative frequencies (just like a oneway table ).
 Dance  Sports  TV  Total 
Men  2  10  8  20 
Women  16  6  8  30 
Total  18  16  16  50 
Above, a twoway table shows the favorite leisure activities for 50 adults  20 men and 30 women. Because entries in the table are frequency counts, the table is a frequency table.
Here's another example:
Preference  Male  Female 
Prefers dogs  363636  222222 
Prefers cats  888  262626 
No preference  222  666 
The columns of the table tell us whether the student is a male or a female. The rows of the table tell us whether the student prefers dogs, cats, or doesn't have a preference.
Each cell tells us the number (or frequency) of students. For example, the 363636 is in the male column and the prefers dogs row. This tells us that there are 363636 males who preferred dogs in this dataset.
Notice that there are two variables—gender and preference—this is where the two in twoway frequency table comes from.
A conditional distribution is a probability distribution for a subpopulation. In other words, it shows the probability that a randomly selected item in a subpopulation has a characteristic you’re interested in. For example, if you are studying eye colors (the population) you might want to know how many people have blue eyes (the subpopulation). Conditional distributions are easier to find with the help of a table.
The following table shows how computer use differs across socioeconomic backgrounds. The table shows the “big picture” across all subjects for the entire sample. SES in the table stands for socioeconomic status.
Marginal Distribution
Definition of a marginal distribution = If X and Y are discrete random variables and f (x,y) is the value of their joint probability distribution at (x,y), the functions given by:
g(x) = Σy f (x,y) and h(y) = Σx f (x,y) are the marginal distributions of X and Y , respectively.
A marginal distribution gets its name because it appears in the margins of a probability distribution table.
Of course, it’s not quite as simple as that. You can’t just look at any old frequency distribution table and say that the last column (or row) is a “marginal distribution.” Marginal distributions follow a couple of rules:
 The distribution must be from bivariate data. Bivariate is just another way of saying “two variables,” like X and Y. In the table above, the random variables i and j are coming from the roll of two dice.
 A marginal distribution is where you are only interested in one of the random variables. In other words, either X or Y. If you look at the probability table above, the sum probabilities of one variable are listed in the bottom row and the other sum probabilities are listed in the right column. So this table has two marginal distributions.
Karl Pearson’s Coefficient of Correlation is widely used mathematical method is used to calculate the degree and direction of the relationship between linear related variables. The coefficient of correlation is denoted by “r”.
Direct method
Shortcut method –
The value of the coefficient of correlation (r) always lies between ±1. Such as:
 r=+1, perfect positive correlation
 r=1, perfect negative correlation
 r=0, no correlation
Example 1  Compute Pearson’s coefficient of correlation between advertisement cost and sales as per the data given below:
Advertisement cost  39  65  62  90  82  75  25  98  36  78 
Sales  47  53  58  86  62  68  60  91  51  84 
Solution
X  Y  X  X  (X  X)2  Y  Y  (Y  Y)2 

39  47  26  676  19  361  494 
65  53  0  0  13  169  0 
62  58  3  9  8  64  24 
90  86  25  625  20  400  500 
82  62  17  289  4  16  68 
75  68  10  100  2  4  20 
25  60  40  1600  6  36  240 
98  91  33  1089  25  625  825 
36  51  29  841  15  225  435 
78  84  13  169  18  324  234 
650  660 
 5398 
 2224  2704 







r = (2704)/√5398 √2224 = (2704)/(73.2*47.15) = 0.78
Thus Correlation coefficient is positively correlated
Example 2
Compute correlation coefficient from the following data
Hours of sleep (X)  Test scores (Y) 
8  81 
8  80 
6  75 
5  65 
7  91 
6  80 
X  Y  X  X  (X  X)2  Y  Y  (Y  Y)2 

8  81  1.3  1.8  2.3  5.4  3.1 
8  80  1.3  1.8  1.3  1.8  1.8 
6  75  0.7  0.4  3.7  13.4  2.4 
5  65  1.7  2.8  13.7  186.8  22.8 
7  91  0.3  0.1  12.3  152.1  4.1 
6  80  0.7  0.4  1.3  1.8  0.9 
40  472 
 7 
 361  33 
X = 40/6 =6.7
Y = 472/6 = 78.7
r = (33)/√7 √361 = (33)/(2.64*19) = 0.66
Thus Correlation coefficient is positively correlated
Example 3
Calculate coefficient of correlation between X and Y series using Karl Pearson shortcut method
X  14  12  14  16  16  17  16  15 
Y  13  11  10  15  15  9  14  17 
Solution
Let assumed mean for X = 15, assumed mean for Y = 14
X  Y  Dx  Dx2  Dy  Dy2  Dxdy 
14  13  1.0  1.0  1.0  1.0  1.0 
12  11  3.0  9.0  3.0  9.0  9.0 
14  10  1.0  1.0  4.0  16.0  4.0 
16  15  1.0  1.0  1.0  1.0  1.0 
16  15  1.0  1.0  1.0  1.0  1.0 
17  9  2.0  4.0  5.0  25.0  10.0 
16  14  1  1  0  0  0 
15  17  0  0  3  9  0 
120  104  0  18  8  62  6 
r = 8 *6 – (0)*(8)
√8*18(0)2 √8*62 – (8)2
r = 48/√144*√432 = 0.19
Example 4  Calculate coefficient of correlation between X and Y series using Karl Pearson shortcut method
X  1800  1900  2000  2100  2200  2300  2400  2500  2600 
F  5  5  6  9  7  8  6  8  9 
Solution
Assumed mean of X and Y is 2200, 6
X  Y  Dx  Dx (i=100)  Dx2  Dy  Dy2  Dxdy 
1800  5  400  4  16  1.0  1.0  4.0 
1900  5  300  3  9  1.0  1.0  3.0 
2000  6  200  2  4  0.0  0.0  0.0 
2100  9  100  1  1  3.0  9.0  3.0 
2200  7  0  0  0  1.0  1.0  0.0 
2300  8  100  1  1  2.0  4.0  2.0 
2400  6  200  2  4  0  0  0.0 
2500  8  300  3  9  2  4  6.0 
2600  9  400  4  16  3  9  12.0 










 0  60  9  29  24 
Note – we can also proceed dividing x/100
r = (9)(24) – (0)(9)
√9*60(0)2 √9*29– (9)2
r = 0.69
Example 5 –
X  28  45  40  38  35  33  40  32  36  33 
Y  23  34  33  34  30  26  28  31  36  35 
Solution
X  Y  X  X  (X  X)2  Y  Y  (Y  Y)2 

28  23  8  64  8.0  64.0  64.0 
45  34  9  81  3.0  9.0  27.0 
40  33  4  16  2.0  4.0  8.0 
38  34  2  4  3.0  9.0  6.0 
35  30  1  1  1.0  1.0  1.0 
33  26  3  9  5.0  25.0  15.0 
40  28  4  16  3  9  12.0 
32  31  4  16  0  0  0.0 
36  36  0  0  5  25  0.0 
33  35  3  9  4  16  12 
360  310  0  216  0  162  97 
X = 360/10 = 36
Y = 310/10 = 31
r = 97/(√216 √162 = 0.51
It is a mathematical method and with it gives a fitted trend line for the set of data in such a manner that the following two conditions are satisfied.
 The sum of the deviations of the actual values of Y and the computed values of Y is zero.
 The sum of the squares of the deviations of the actual values and the computed values is least.
This method gives the line which is the line of best fit. This method is applicable to give results either to fit a straightline trend or a parabolic trend.
The method of least squares as studied in time series analysis is used to find the trend line of best fit to a time series data.
Secular Trend Line
The secular trend line (Y) is defined by the following equation:
Y = a + b X
Where, Y = predicted value of the dependent variable
a = Yaxis intercept i.e. the height of the line above origin (when X = 0, Y = a)
b = slope of the line (the rate of change in Y for a given change in X)
When b is positive the slope is upwards, when b is negative, the slope is downwards
X = independent variable (in this case it is time)
To estimate the constants a and b, the following two equations have to be solved simultaneously:
ΣY = na + b ΣX
ΣXY = aΣX + bΣX2
To simplify the calculations, if the midpoint of the time series is taken as origin, then the negative values in the first half of the series balance out the positive values in the second half so that ΣX = 0. In this case, the above two normal equations will be as follows:
ΣY = na
ΣXY = bΣX2
Logarithm y = aebx.
The equation is
y = aebx.
Taking log to the base e on both sides,
We get logy = loga + bx.
Which can be replaced as Y=A+BX,
Where Y = logy, A = loga, B = b and X = x.
Q1. Fit the straight line to the following data.
x  1  2  3  4  5 
y  1  2  3  4  5 
The normal equations are:
Σy = aΣx + nb
And
Σxy = aΣx2 + bΣx
Now,
x  y  x2  Xy 
1  1  1  1 
2  2  4  4 
3  3  9  9 
4  4  16  16 
5  5  25  25 
Σx = 15  Σy = 15  Σx2 = 55  Σxy = 55 
Substituting in the equations,
15 = 15a + 4b and 55 = 55a + 15b
Solving these two equations,
We get a=1 and b=0,
Therefore the required straightline equation is y=x.
Q2. Fit the straightline curve to the following data.
x  75  80  93  65  87  71  98  68  84  77 
y  82  78  86  72  91  80  95  72  89  74 
First drawing the table,
x  y  x2  Xy 
75  82  5625  6150 
80  78  6400  6240 
93  86  8349  7998 
65  72  4225  4680 
87  91  7569  7917 
71  80  5041  5680 
98  95  9605  9310 
68  72  4624  4896 
84  89  7056  7476 
77  74  5929  5698 
798  819  64422  66045 
The normal equation is:
Σy = aΣx + nb
and
Σxy = aΣx2 + bΣx.
Substituting the values, we get,
819 = 798a + 10b
66045 = 64422a + 798b
Solving, we get
a = 0.9288 and b = 7.78155
Therefore, the straightline equation is:
y = 0.9288x + 7.78155.
Q3. Fit a seconddegree parabola to the following data.
x  1  2  3  4  5  6  7  8  9 
y  2  6  7  8  10  11  11  10  9 
Solution:
Here,
x  y  x2  x3  x4  Xy  x2y 
1  2  1  1  1  2  2 
2  6  4  8  16  12  24 
3  7  9  27  81  21  63 
4  8  16  64  256  32  128 
5  10  25  125  625  50  250 
8  11  36  216  1296  66  396 
7  11  49  343  2401  77  539 
8  10  64  512  4096  80  640 
9  9  81  729  6561  81  729 
45  74  285  2025  15333  421  2771 
The normal equations are:
Σy = aΣx2 + b Σx + nc
Σxy = aΣx3 + bΣx2 +c Σx
Σx2y = aΣx4 + bΣx3 + cΣx2
Substituting the values, we get
74 = 285a + 45b + 9c
421 = 2025 a + 285 b + 45 c
2771 = 15333a + 2025 b + 285 c
Solving them, we get the second order equation which is,
y = 0.2673x2 + 3.5232x – 0.9286.
Spearman’s Rank Correlation Coefficient  The Spearman’s Rank Correlation Coefficient is the nonparametric statistical measure used to study the strength of association between the two ranked variables. This method is used for ordinal set of numbers, which can be arranged in order.
Where, P = Rank coefficient of correlation
D = Difference of ranks
N = Number of Observations
The Spearman’s Rank Correlation coefficient lies between +1 to 1.
 +1 indicates perfect association of rank
 0 indicates no association between the rank
 1 indicates perfect negative association between the ranks
When ranks are not given  Rank by taking the highest value or the lowest value as 1
Equal Ranks or Tie in Ranks – in this case ranks are assigned on an average basis. For ex – if three students score of 5, at 5th, 6th, 7th ranks ach one of them will be assigned a rank of 5 + 6 + 7/3= 6.
If two individual ranked equal at third position, then the rank is Calculate as (3+4)/2 = 3.5
Example 1 –
Test 1  8  7  9  5  1 
Test 2  10  8  7  4  5 
Solution
Here, highest value is taken as 1
Test 1  Test 2  Rank T1  Rank T2  d  d2 
8  10  2  1  1  1 
7  8  3  2  1  1 
9  7  1  3  2  4 
5  4  4  5  1  1 
1  5  5  4  1  1 




 8 
R = 1 – (6*8)/5(52 – 1) = 0.60
Example 2 
Calculate Spearman rankorder correlation
English  56  75  45  71  62  64  58  80  76  61 
Maths  66  70  40  60  65  56  59  77  67  63 
Solution
Rank by taking the highest value or the lowest value as 1.
Here, highest value is taken as 1
English  Maths  Rank (English)  Rank (Math)  d  d2 
56  66  9  4  5  25 
75  70  3  2  1  1 
45  40  10  10  0  0 
71  60  4  7  3  9 
62  65  6  5  1  1 
64  56  5  9  4  16 
58  59  8  8  0  0 
80  77  1  1  0  0 
76  67  2  3  1  1 
61  63  7  6  1  1 




 54 
R = 1(6*54)
10(1021)
R = 0.67
Therefore this indicates a strong positive relationship between the rank’s individuals obtained in the math and English exam.
Example 3 –
Find Spearman's rank correlation coefficient between X and Y for this set of data:
X  13  20  22  18  19  11  10  15 
Y  17  19  23  16  20  10  11  18 
Solution
X  Y  Rank X  Rank Y  d  d2 
13  17  3  4  1  1 
20  19  7  6  1  1 
22  23  8  8  0  0 
18  16  5  3  2  2 
19  20  6  7  1  1 
11  10  2  1  1  1 
10  11  1  2  1  1 
15  18  4  5  1  1 




 8 
R =
R = 1 – 6*8/8(82 – 1) = 1 – 48 = 0.90 504
Example 4 – Calculation of equal ranks or tie ranks
Find Spearman's rank correlation coefficient:
Commerce  15  20  28  12  40  60  20  80 
Science  40  30  50  30  20  10  30  60 
Solution
C  S  Rank C  Rank S  d  d2 
15  40  2  6  4  16 
20  30  3.5  4  0.5  0.25 
28  50  5  7  2  4 
12  30  1  4  3  9 
40  20  6  2  4  16 
60  10  7  1  6  36 
20  30  3.5  4  0.5  0.25 
80  60  8  8  0  0 




 81.5 
R = 1 – (6*81.5)/8(82 – 1) = 0.02
Example 5 –
X  10  15  11  14  16  20  10  8  7  9 
Y  16  16  24  18  22  24  14  10  12  14 
Solution
X  Y  Rank X  Rank Y  d  d2 
10  16  6.5  5.5  1  1 
15  16  3  5.5  2.5  6.25 
11  24  5  1.5  3.5  12.25 
14  18  4  4  0  0 
16  22  2  3  1  1 
20  24  1  1.5  0.5  0.25 
10  14  6.5  7.5  1  1 
8  10  9  10  1  1 
7  12  10  9  1  1 
9  14  8  7.5  0.5  0.25 




 24 
R = 1 – (6*24)/10(102 – 1) = 0.85
The correlation between X and Y is positive and very high.
References
 B.N Gupta – Statistics
 S.P Singh – statistics
 Gupta and Kapoor – Statistics
 Yule and Kendall – Statistics method