Linear Model is one of the simplest model, we use this section as the starting point to introduce some common ideas
in
Data Science.
Least Square Problem
Click on the Canvas to place additional points
Suppose we have $n$ data points, $x_1,\cdots,x_n$, in $\mathbb{R}^p $, denote $$\mathbf{X} \in
\mathbb{R}^{n\times p }$$
the matrix with entries given by $$\mathbf{X}_{i,j}=(x_i)_j$$
For each point, $x_i$, there is a corresponding output $y_i$, we will denote the output vector by
$$\mathbf{y} := \begin{pmatrix} y_1\\\vdots\\y_n \end{pmatrix}$$
Denote the design matrix
$$X:=[\mathbb{1}|\mathbf{X}] \in \mathbb{R}^{n\times (p+1) }$$
The goal of a least square problem is to find the vector $\mathbf{\beta}\in \mathbb{R}^{p+1}$
that minimizes the Sum of Squared Error
$$\mathrm{SSE}(b):=\|X\cdot b - \mathbf{y}\|_2^2 $$
Geometric Approach
Projection Matrix
We start with the geometric way, the one that I found most natural. Recall that we have the following
minimization problem,
$$\min_{b} \|X\cdot b - \mathbf{y}\|_2^2$$
The object $$X\cdot b$$ lives in the column space of $X$, thus to minimizes the above squared distance,
we only need to require
$X\cdot b= \widehat{\mathbf{y}}$ where
$ \widehat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto the column space of $X$. In other words, recall that
the projection can be done via the projection matrix $X( X^T X)^{-1}X^T$, we need to have
$$X\cdot b= \widehat{\mathbf{y}} =X( X^T X)^{-1}X^T \mathbf{y}$$
Which implies
$$ b = ( X^T X)^{-1}X^T \mathbf{y}$$
Since $$\hat{\mathbf{y}} = Xb =X( X^T X)^{-1}X^T \mathbf{y} $$
The following projection matrix is also often called the hat matrix
$$P = X( X^T X)^{-1}X^T$$
Correlations, R Squared and F-statistics
Correlations
Denoting the $i$-th column of $X$ by $X_i$. The standard definition of correlation can be
rewritten as
$$
\begin{aligned}
\mathtt{Corr}\left(\mathbf{y}, X_{i}\right):&=\frac{\sum(\mathbf{y}_j-\overline{\mathbf{y}})
(X_{ji}-\overline{X_{i}})}{\sqrt{\sum(
X_{ji}-\overline{X_{i}})^2\sum(\mathbf{y_j}-\overline{\mathbf{y}})^2}}
\\ &=\frac{\left(\mathbf{y}-\overline{\mathbf{y}}\right)\cdot\left(
X_{i}-\overline{X_{i}}\right)}{\|X_i\|\|\mathbf{y}\|}\\ &=\cos
\left(\angle\left(\mathbf{y}-\overline{\mathbf{y}},
X_{i}-\overline{X_{i}}\right)\right)
\end{aligned}
$$
$R^2$
Similarly, using the decomposition
$$SST=SSE+SSR$$
$$R^{2}:=\frac{SSR}{SST}=\cos ^{2}(\angle(\mathbf{y}-\overline{\mathbf{y}},
\widehat{\mathbf{y}}-\overline{\mathbf{y}}))$$
The loss function can be expanded
$$\mathrm{SSE}(b)=b^TX^TXb-2(Xb)^T\mathbf{y}+\mathbf{y}^T\mathbf{y} $$
Taking derivative (gradient) w.r.t $b$ gives
$$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}$$
At the local minimals, we have
$$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}=0$$
which gives us the, Normal Equation
$$X^TX b= X^T \mathbf{y}$$
When the inverse of $X^TX$ exist, we have, again $$ b = ( X^T X)^{-1}X^T \mathbf{y}$$
For ideas behind the Gradient Descend Algorithm, see the page Optimization
Methods. In the example, we applied the Gradient Descend algorithm to minimize $\mathrm{SSE}(b)$.
The updating rule is given by
$$b_{j+1}=b_{j}-\alpha X^{T}(Xb-Y)$$
The Hessian of $\mathrm{SSE}$ is given by $$H_\mathrm{SSE}(b)=J_{\nabla \mathrm{SSE}}( b )=2X^TX$$
Thus $H_\mathrm{SSE} (b)$ is positive definite, locally by the Derivative
Test, any local optimal of $\mathrm{SSE}(b)$ will be a local minimum.
And globally $H_\mathrm{SSE} (b) $ positive definite for all $b$ implies $\mathrm{SSE}(b)$ is convex , and
thus our local minimum is the unique global minimum.
Parametric Approach
Linear Model Assumptions
When regard Linear Regression Problems as problems of fitting the least square line to the data
points,
it is just a mathematical problem. But when one tries to do Statistical Inference then it becomes a
statistical problem.
And assumptions are needed for meaningful interpretations.
A Linear Model in Regression is a model of the form $$\mathbf{y}=X b+\mathcal{E} $$
Where the residual is a random vector with distribution $\mathcal{E}\sim \mathcal{N}(0,\sigma I)$.
Maximum Likelihood
Let's take a small quiz:
Let $\mathbf{x}$ be a mean $1$ Gaussian Random variable, and we have 4 realizations of
$\mathbf{x}$
Which of the following choice is a more likely distribution for $\mathbf{x}$ ?
The idea of maximum Likelihood essentially formalizes the above example. Let $V$
be a random vector distributed according to the following density, a function in $V$,
$$ p(V|\theta)$$
Given a realization $v$ of $V$, the likelihood function is defined by the following, a function in
$\theta$,
$$L(\theta): = p(V|\theta)$$
And the maximum likelihood estimator for $\theta$ is given by
$$ \mathrm{argmax}\ L(\theta) $$
Since the $\ln$ function is monotonically increasing, and turns products into sums, it is often easier to work
with
the log-likelihood function
$$l(\theta)=\ln L(\theta)$$
And the corresponding optimization problem
$$ \mathrm{argmax}\ l(\theta) $$
In the linear regression model, it turns out that the MLE for $b$ does not depend on $\sigma$, thus for
simplicity
we will assume
$$\mathbf{y}=X b+\mathcal{E}\sim \mathcal{N}(Xb,I) $$
Given $\mathbf{y}$, $X$ fixed, the Likelihood function for $b$ is given by
$$L(b)=(2 \pi)^{-\frac{n}{2}}
e^{-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb)}, $$
and the log-likelihood is given by $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb)
$$
Note that from here we see that the maximization problem for MLE is equivalent to the minimization problem for
$\mathrm{SSE}(b)$.
Both the Derivative Test, and Gradient Descend/Ascend can be used to find or estimate the maximum
likelihood estimator.
But let us use this opportunity to apply Newton's method.
Newton's Method
The idea behind Newton's Method is discussed on the page Optimization
Methods. We go straight to its application to maximizing the log-likelihood function.
To maximize $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb) $$
We apply Newton's method to
$$f(b):=\nabla l(b) = -X^TX b + X^T \mathbf{y}$$
We have, the Jacobian of $f$ given by $$J_f(b) = H_l(b)=-X^TX$$
Note that $J_f(b)$ is negative definite, thus locally by the Derivative
Test, any local optimal of $l(b)$ will be a local maximum.
And globally $J_f(b) $ negative definite for all $b$ implies $l(b)$ is concave , and thus our local
maximum
is the unique global maximum.
Given any initial guess $b $, one step of Newton's Method gives
$$\begin{align}
\beta&=b-J_f(b)^{-1}f(b)\\
& =b+(X^TX)^{-1}(-X^TXb+X^Ty)\\
&=(X^TX)^{-1}X^Ty
\end{align}$$
i.e. only one step is needed.
The application of Newton's method to Linear Regression requires only one step, since $f(b)$ is a affine map in
$b$,
i.e. it is of the form $f(b) = Mb+v$.
A simple nontrivial application of Newton's Method can be found on the page Optimization
Methods.
Logistic Regression
A Linear Model assumes $$y\sim
\mathcal{N}(\mathbf{b}^T\mathbf{x},\sigma)$$ i.e. at any given point $\mathbf{x}$, it assumes
$y= f(\mathbf{x})$ follows a Gaussian distribution, this excludes us from modeling the case where
$y=f(\mathbf{x})$
follows other distributions.
Consider for example, in a binary Classification problem with labels $0$ and $1$. Then at any point
$\mathbf{x}$, it is reasonable to assume that
$$f(\mathbf{x})\sim \mathtt{Bernoulli}(p(\mathbf{x}))$$
And this leads us naturally to the Logistic Regression.
The Gradient Descend algorithm is initially ran for 20 steps.
Use buttons below to train further.
Use the following button to turn on or off the linear regression line.
When use linear regression for classification, we round the prediction to the nearest
label.
In Logistic Regression, we assume $$y\sim
\mathtt{Bernoulli}(p_\theta{(\mathbf{x})})$$
Where by design $$p_\theta{(\mathbf{x})}= \frac{1}{1+\exp(-\theta\cdot
\mathbf{x})}=:\mathtt{sigmoid}(\theta\cdot \mathbf{x})$$
And the goal of logistic regression is try to estimate the probabilities $p_\theta(\mathbf{x})$.
One can check that the inverse of the sigmoid function is given by the logit
function$$\mathtt{logit}(p_\theta{(\mathbf{x})}):=\log{(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}})}=\theta\cdot
\mathbb{x}$$
Since the sigmoid function is not linear, it is more connivent to use MLE to estimate
$\theta$.
To maximize the likelihood function, is equivalent to minimize the negative of the log-likelihood function
$$L(\theta)=-\sum_{i=1}^{n} y^{(i)} \log \left(p_\theta(\mathbf{x}^{(i)})\right)+\left(1-y^{(i)}\right) \log
\left(1-p_\theta(\mathbf{x}^{(i)})\right)$$
The summands are$$\begin{cases}\log \left(p_\theta(\mathbf{x}^{(i)})\right),\qquad &
y^{(i)}=1\\
\log \left(1-p_\theta(\mathbf{x}^{(i)})\right) & y^{(i)}=0 \end{cases}
$$
The above loss function is also known as the Binary Cross Entropy function.
Using Gradient Descend, we get the following updating rule $$\theta_{j+1}=\theta_{j}-\alpha
X^{T}(p_\theta(X)-Y)$$
Where $X$ is the design matrix.
Generalized Linear Models
A Generalized Linear Model is of the form
$$\eta\left(\mathrm{E}\left(Y \mid \bf x\right)\right)= \theta \cdot \bf x',\qquad \bf x' = \begin{pmatrix}1\\
\bf x \end{pmatrix} $$
for some link function $\eta$, and we assume that
the entries of $ Y$ are independently sampled from a Exponential Family.
Examples of GLM
The Linear Model is a GLM , where the link function is the identity function, and $Y$ is from
the Gaussian Distribution (an exponential family).
The Logistic Regression Model is a GLM, where the link function is the logit function, and $Y$
is from the Bernoulli Distribution (another example of exponential family).
If we let $y$ takes on the following PMF
$$\mathtt{Pr}(y=k \mid\mathbf{x})= \begin{cases}\frac{\exp(\beta_k \cdot
\bf x)}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \mathbf{x})} \quad & k = 1,2,\cdots,K-1 \\
\frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \bf x)} & k=K \end{cases} $$
and $$\eta(p) = \ln(\frac{p}{\mathtt{Pr}(y=K| \mathbf{x})}) $$ we get the Multinomial Logistic
Regression.
If we let $Y\sim \mathtt{Poisson}(\lambda)$, and take $\eta = \log$, then we get the Poisson Regression
Model.
Mixed Effect Models
A Linear Mixed Effect Model is of the form $$\mathbf{y} = X\beta + U\gamma +
\mathbf{\epsilon}$$
Where
If $U$ is the zero matrix, then the model is just a GLM.
Neural Networks
Logistic Regression or Linear Model can be viewed as a single layer neural network with sigmoid /
identity activation function
and Cross Entropy loss function. We will discuss the neural network in more detail in a later page Neural Networks.