Skip to main content

Statistical Decision Theory

In this post, we will develop a framework for tackling a supervised learning problem. We will also see how this framework leads to one of the most common supervised learning algorithm: K-Nearest Neighbor.

A supervised learning problem involves predicting an output (y) given an input (x). Assuming we are given a joint probability distribution of x and y our goal is to find an approximating function (f) which predicts the value of y given x.


In order to analyze the performance of a prediction function, we define a loss function (L). Typically loss functions are real-valued functions bounded at 0. Let y and y' be two real numbers where $y \neq y'$ then a loss function will hold $L(y, y) \leq L(y, y')$.


Our goal is to select the function f which minimizes the expected value of loss i.e $E[L(y, f(x)]$ under the joint distribution of x and y. Assuming joint distribution to be discrete we get:

\begin{align} &E[L(y, f(x)] = \sum_x \sum_y L(y, f(x)) * P(x,y) \\ \implies &E[L(y, f(x)] = \sum_x \bigg( \sum_y L(y, f(x)) * P(y | x) \bigg) * P(x) \\ \implies &E[L(y, f(x)] = E_x[E_{y|x}[L[y, f(x)]] \\ \end{align}
Let's first see what happens if we try to find a function $g(x)$ which minimizes $E[g(x)]$.
\begin{equation} E[g(x)] = \sum_x g(x) * P(x) \end{equation} In order to minimize $E[g(x)]$ we can see that if we minimize g(x) for each x then we can minimize $E[g(x)]$ as $P(x)$ remains constant.

Similarly to minimize $E_x[E_{y|x}[L[y, f(x)]]$ we minimize $E_{y|x}[L[y, f(x)]$ for each x. Thus we reach a formal definition for selecting a prediction function f.
\begin{equation} f(x) = argmin_f E_{y|x}[L[y, f(x) | x] \end{equation}
Lets consider one of the most commonly used loss function : the squared loss. We will derive a predictive function given this loss function.

\begin{align} &\text{Loss Function : } L(y, f(x)) = ||y - f(X)||^2 \\\\ &f(x) = argmin_f E_{y|x}[(y - f(x))^2 | x] \\ \\ &\text{Differentiating w.r.t to f(x) we get :} \\ \\ &\sum_y 2 * (f(x) - y) * P(y|x) = 0 \\ \implies &\sum_y f(x) * P(y|x) = \sum_y y * P(y | x) \\ \\ &\because \text{f(x) is constant and }E[ y | x] = \sum_y y * P(y | x) \\ \implies &f(x) * \sum_y P(y | x) = E[ y | x] \\ \because & \sum_y P(y | x) = 1 \\ \\ \implies & f(x) = E[ y | x] \\ \end{align}
Our result suggest that the prediction function f at every x is the conditional expectation of y given x.

Under the framework we have just specified we knew the conditional distribution. But that is not true in real life scenarios. In real life scenario, we are given a dataset and we need to approximate the distribution. The KNN algorithm uses the result derived above and makes an approximation for the conditional distribution.

A Monte-Carlo approximation for the conditional distribution is as follows:
\begin{equation} E[y|x] = \frac{\sum_{i = 1}^{N} I[x_i == x] * y_i}{\sum_{i = 1}^{N}  I[x_i == x] }\\ \text{where N is number of data cases and I is an indicator function} \end{equation} The KNN algorithm makes a further approximation that given a data point x the K points nearest to x can be treated to be same as x. Thus we get,

\begin{equation} E[y|x] =\frac{ \sum_{i = 1}^{N} I[x_i == Neighbor(x)] * y_i } {\sum_{i = 1}^{N} I[x_i == Neighbor(x)]} \end{equation}


Comments