Degrees of Freedom, Actually Explained - The Geometry of Statistics | Ch. 1 (#SoME4)
Loading YouTube video...
Video ID:VDlnuO96p58
Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.
Transcript
When I first took statistics, I was
baffled by a concept called degrees of
freedom. This came up all the time in
probability distributions like the ki
square, t and f distributions and also
in just calculating the sample variance
where you divide by n minus one instead
of by n.
I wasn't the only one. None of the other
students got it, and I don't think the
teacher did either. This seems to be
true for everybody I've talked to. In
fact, it's almost a right of passage to
make it through a statistics class
without understanding what degrees of
freedom is really all about.
Now, the way it's typically covered goes
something like this. Degrees of freedom
captures the amount of information that
is free to vary in some calculation. For
example, if you collect 10 data points,
that's 10 numbers that could all be well
different numbers. So, we have 10
degrees of freedom.
If you then use these data to calculate
some statistic like maybe the mean, then
you'll have used up a piece of
information and so have one fewer degree
of freedom. When you go to calculate
some other statistic that depends on the
mean like the variance, then you'll need
to take this into account and that's why
you divide by n minus one instead of n
here 9 instead of 10. But this
explanation never felt rigorous or even
very intuitive to me. It just felt like
a lot of handwaving.
While most people might have just moved
on with their lives, this never stopped
bothering me. Then one day, I saw the
answer posted on Twitter.
The rank of the projection matrix inside
the quadratic form in the definition of
a statistic. Gee, that's crystal clear,
don't you think?
Well, this raised many more questions
than answers and sent me down some long
and deep rabbit holes. The good news is
that this answer is actually really
quite satisfying.
Along the way, we'll learn a whole
framework for understanding
geometrically what's going on under the
hood when we do a lot of statistics.
Hence, while this series is about
degrees of freedom, maybe the
appropriate subtitle would be the
geometry of statistics.
[Music]
In terms of background, I'm assuming
that you've taken at least an
introductory statistics course. And in
fact, I'm going to assume that you
absorbed that course pretty well and
that you're comfortable with concepts
like probability distributions and
expected value.
It would also be really great if you
have some linear algebra knowledge. If
you haven't studied that, there's an
excellent YouTube series entitled The
Essence of Linear Algebra by Three Blue
One Brown. and watching that will give
you all the background you need for this
series. But that said, I'm going to do
my best to give brief recaps of the
linear algebra concepts you'll need to
know as they come up.
Speaking of which, for the next 2
minutes or so, I'm going to review
vectors and vector addition. Skip ahead
if you are very comfortable with these
concepts.
We'll think of vectors primarily as
arrows in space. A vector has a length
called its magnitude and it also has a
direction basically what angle it's
pointing.
But importantly, these are all it has.
If we move this vector elsewhere in
space, it's not a different vector. It's
still considered the same one because it
still has the same length and direction.
To represent vectors algebraically, we
can use their components in some
coordinate system. You get these by
moving the tail of the vector to the
origin and then reading off the
coordinates of where its tip lands. It's
customary to write them as a column
vector matrix like this. And we can
assign this to a variable name if we
want like this. We can multiply vectors
by numbers which will change their
length but not their direction. So
multiplying by 1/2 for example will make
the vector half as long. We can multiply
by negative numbers too, which makes the
vector point the opposite way. Since
it's scaling the length of the vector,
this number we multiply by gets called a
scaler.
Finally, we can add and subtract
vectors. Geometrically, we can add two
vectors by moving the tail of the second
vector to the tip of the first and then
drawing the new vector that connects the
dots along both of them like this.
Subtraction just reverses that.
To get the components of the new vector
easily enough, all you have to do is add
the components of the two vectors. For
example, if I want to add this vector 21
to this vector minus1 1, we just add the
components separately to get 1 2.
Okay, review over. Let's get into it.
Suppose we have a random variable X and
we measure some observations and we'll
call these X1, X2 and so on to XN. It
would be pretty standard to represent
this data as dots on a number line maybe
like this.
To start, let's narrow this down so that
we have just two data points.
With only two observations, we have
another option to visualize this which
is as some point in 2D space. We do that
by putting the different possible values
for x1 on the horizontal axis and the
different possible values for x2 on the
vertical axis. As an example, maybe when
we draw our random numbers, we end up
with two for x1 and one for x2, putting
us at this point here.
[Music]
As you might have guessed, we're going
to represent this point as a vector. So
let's add an arrow and write the
components as this column vector.
Because X is a random variable, each
time we sample data points from X, we'll
get different values, which means we'll
get a different point, which means we'll
get a different vector. That's why we
can refer to this vector as a random
vector. For example, maybe on some
different samples, we get other points
like these.
When we talk about degrees of freedom,
what we're talking about is the number
of dimensions of the space that this
vector is free to land in, so to speak,
across these different samples.
For this vector here, although we could
take different samples to get different
vectors, if there are two observations,
then it will always live in this
two-dimensional space. Therefore, this
random vector has two degrees of
freedom. It's got two numbers that are
free to vary across different samples.
Where things get more interesting is
when we start decomposing this vector.
Take a look at this.
We're always free to both add a number
and subtract it since the end result is
just the same as where we started. So,
we're going to both add and subtract the
sample mean of our two observations,
which we'll call xbar.
Then, we can split these into two
different vectors.
The vector on the right just has the
sample mean for both components. So
let's call it the sample mean vector.
On the left is a vector of what we call
residuals. It's what's left over after
we subtract the sample mean from each
data point.
Finally, just to keep things straight,
I'll refer to the original vector as the
data vector.
To clean up a bit, we can factor the xar
out of the mean vector. So we get the
mean of X times a vector of ones.
What we've done here is to decompose our
original random vector into two other
random vectors. Our original random
vector had two degrees of freedom
because it could land anywhere in the 2D
plane. But our new vectors here do not
have two degrees of freedom. In fact,
they have split the original two so that
each of these only has one degree of
freedom.
Why?
Let's go back to the plane.
If this is our random data vector X,
then what we've done is to split it into
two other vectors where this one is the
sample mean vector
and this one is the vector of residuals.
Recalling that adding vectors means
adding them tip to tail, we can recover
our original random vector as the sum of
these two.
First, let's look at the mean vector.
Although the mean could be any number
depending on our data, we saw that the
mean vector is a multiple of the 1 one
vector. Therefore, no matter what data
points we encounter, the mean vector
must lie somewhere on this line. It is
not free to be anywhere in the plane.
We're now limited to this line. And a
line is only one-dimensional.
So that's why we say the sample mean
vector has only one degree of freedom.
As for the residual vector, it also is
constrained to only lie on a particular
line rather than pointing anywhere in
the plane. Why is that? It turns out
it's because the residuals always add to
zero no matter what the data points are.
Intuitively, because the mean lies in
the middle of the data, some of the
observations will be above it having a
positive residual and some will be below
the mean having a negative residual.
These cancel each other out.
If you'd like a more rigorous algebraic
proof that's on the screen now, this
idea that residuals around the mean sum
to zero is going to come back repeatedly
during this video series. So, it may be
worth spending a minute to make sure you
get it.
Anyway, if the residual is always summed
to zero, that means the residual vector
has to be at a point where its
components sum to zero. In 2D, that's
always going to be somewhere on this
line here, where the vertical coordinate
is the negative of the horizontal
coordinate. The possibilities here again
trace out a line which only has one
dimension. So, we say the residuals have
just one degree of freedom. In other
words, if we get told one of the
residuals, then we instantly know the
other since it must just be the negative
of the first one.
So these are the two subspaces we're
dealing with. By going some distance on
this one, we get the mean. And then by
adding in some distance in this residual
direction, we bring in the residuals to
get us to the actual data points. Even
if we draw different random numbers for
our random vector like maybe these ones
here, the mean and residual vector still
have to live somewhere on these
one-dimensional subspaces. And so we say
those vectors have one degree of
freedom.
One important thing to note is that
we're talking about the mean of the
observations,
usually called the sample mean, and
labeled Xbar. And the degrees of freedom
we just came to do not apply for the
mean of the distribution that the
observations came from, usually called
the expected value or population mean
and labeled mew.
That's a bit perplexing at first because
you might think, can't we just do the
same composition we just did, but with
the population mean instead of the
sample mean and get the same answer?
Well, let's try it.
Starting with our random vector X, we
can make the same move of adding and
subtracting mu, then splitting the
vector in two.
To keep the naming clear, I'll call
these new vectors the expected value
vector and the error vector. And that's
because when you take a data point and
subtract off the population mean,
statistitians call that an error instead
of a residual.
But anyway, now it turns out that this
error vector has two degrees of freedom
while the expected value vector has
zero. How can that be?
It's because no matter what random
numbers we generate, if they all come
from the same distribution, then the
expected value vector always points to
the exact same spot. It can't land
elsewhere on the line across different
samples. It can only be on the same
single point right here.
Meanwhile, the error vector to make up
the difference is no longer confined to
a single line, but might point in any
direction in the plane, which you can
see more easily if we don't move it to
the origin.
Another
way to think about it is that unlike the
residuals we were dealing with earlier,
the errors do not have to add to zero
because the random numbers we happen to
draw might not be perfectly centered
around mu.
The next question to ask is how does
this scale up when we have more than two
data points? Let's add a third data
point bringing us to the third
dimension.
Now we'll have three observations of a
random variable and we use those three
numbers as the coordinates for a point
in 3D space. The arrow from the origin
to that point will be our random vector.
Because with different random samples,
this vector could land anywhere in
space. And this space now has three
dimensions. We say this random vector
has three degrees of freedom.
Using the same logic as before, we can
split our random vector into a sample
mean vector and a vector of residuals.
And as before, while our starting vector
has three degrees of freedom, these
component vectors will have fewer than
that. The mean vector will still have
just one degree of freedom. But now the
residual vector will have two degrees of
freedom.
On the graph, our random data vector now
looks like this.
And we again break that vector into the
tiptotail sum of the mean vector plus
the residual vector.
Since the mean vector is just a multiple
of the one one vector, it must always
lie somewhere along this particular
line, no matter where our data vector
lands.
Here's a few examples.
Since it's limited to this
one-dimensional subspace, we say the
sample mean vector has one degree of
freedom.
For the residual vector, although the
vector now has three components since
we're in three dimensions, it still
obeys the constraint that all the
residuals must sum to zero. This is
again because the sample mean lies in
the middle of our observations. So the
negative residuals exactly cancel out
the positive residuals.
That means that the components of the
residual vector have to add to zero. And
that's not true for all points in 3D
space. It's only true for points that
lie inside this plane here. In standard
coordinate terms, you can think of this
as the plane given by x + y + z= 0.
So if we were told what two of the
residuals are, then we automatically
know what the third one is since it must
be whatever adds to zero.
Anyway, across different samples, the
residual vector might land anywhere
along this two-dimensional plane. So we
say it has two degrees of freedom.
Here's a few examples.
So that's our sample mean vector and the
residual vector. If we switch to the
expected value vector and the error
vector, then similarly to before, the
expected value is always in the same
spot across different samples and so has
zero degrees of freedom. Which means the
error vector could point in any
direction and so has three degrees of
freedom. That looks something like this.
And although we can't visualize it, the
same ideas carry over when we have more
observations.
The random vector showing all n data
points is free to point any direction in
n dimensional space. So it has n degrees
of freedom.
The sample mean vector always has one
degree of freedom and so lies somewhere
on a line. And the residual vector
always has n minus one degrees of
freedom because of the fact that the
residuals are constrained to add to
zero.
That means if we are told what the first
n minus one residuals are, then we know
what the last one must be. And thus
there can only be n minus one
independent values. So the residual
vector must land somewhere in an n minus
one dimensional subspace.
Meanwhile, if we were to consider the
error vector and the expected value
vector, the expected value vector still
has zero degrees of freedom because it
always lands in exactly the same spot no
matter which random numbers we draw.
And so the error vector must have all n
degrees of freedom.
So now we've covered the basic idea of
degrees of freedom as capturing how many
different dimensions a random vector can
land in across different samples. But
remember the definition we're working
towards the rank of the projection
matrix inside the quadratic form in the
definition of a statistic. This
explanation is going to take a while and
so this is just the first video in a
series.
[Music]
It will take until chapter 4 to fully
understand all that. Then afterwards,
we'll turn toward the applications of
degrees of freedom in classical
statistics. And finally, we'll look into
some extensions.
To be frank, the voyage we're about to
embark on is not a journey for the faint
of heart, but it offers great treasures
to those who persevere. My promise to
you is that if you follow along, you're
eventually going to really understand
what degrees of freedom is all about and
also where this N minus one stuff really
comes from. And lucky for you, that's
what we're going to turn to in the next
chapter on vessels correction. I'll see
you then.
[Music]
Analytical Tools
Create personalized analytical resources on-demand.