Linear Models

Linear Model

Linear Model is one of the simplest model, we use this section as the starting point to introduce some common ideas in Data Science.

Least Square Problem

Suppose we have $n$ data points, $x_1,\cdots,x_n$, in $\mathbb{R}^p $, denote $$\mathbf{X} \in \mathbb{R}^{n\times p }$$ the matrix with entries given by $$\mathbf{X}_{i,j}=(x_i)_j$$ For each point, $x_i$, there is a corresponding output $y_i$, we will denote the output vector by $$\mathbf{y} := \begin{pmatrix} y_1\\\vdots\\y_n \end{pmatrix}$$ Denote the design matrix $$X:=[\mathbb{1}|\mathbf{X}] \in \mathbb{R}^{n\times (p+1) }$$ The goal of a least square problem is to find the vector $\mathbf{\beta}\in \mathbb{R}^{p+1}$ that minimizes the Sum of Squared Error $$\mathrm{SSE}(b):=\|X\cdot b - \mathbf{y}\|_2^2 $$

Geometric Approach

Projection Matrix

We start with the geometric way, the one that I found most natural. Recall that we have the following minimization problem, $$\min_{b} \|X\cdot b - \mathbf{y}\|_2^2$$ The object $$X\cdot b$$ lives in the column space of $X$, thus to minimizes the above squared distance, we only need to require $X\cdot b= \widehat{\mathbf{y}}$ where $ \widehat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto the column space of $X$. In other words, recall that the projection can be done via the projection matrix $X( X^T X)^{-1}X^T$, we need to have $$X\cdot b= \widehat{\mathbf{y}} =X( X^T X)^{-1}X^T \mathbf{y}$$ Which implies $$ b = ( X^T X)^{-1}X^T \mathbf{y}$$ Since $$\hat{\mathbf{y}} = Xb =X( X^T X)^{-1}X^T \mathbf{y} $$ The following projection matrix is also often called the hat matrix $$P = X( X^T X)^{-1}X^T$$

Correlations, R Squared and F-statistics

Correlations
Denoting the $i$-th column of $X$ by $X_i$. The standard definition of correlation can be rewritten as $$ \begin{aligned} \mathtt{Corr}\left(\mathbf{y}, X_{i}\right):&=\frac{\sum(\mathbf{y}_j-\overline{\mathbf{y}}) (X_{ji}-\overline{X_{i}})}{\sqrt{\sum( X_{ji}-\overline{X_{i}})^2\sum(\mathbf{y_j}-\overline{\mathbf{y}})^2}} \\ &=\frac{\left(\mathbf{y}-\overline{\mathbf{y}}\right)\cdot\left( X_{i}-\overline{X_{i}}\right)}{\|X_i\|\|\mathbf{y}\|}\\ &=\cos \left(\angle\left(\mathbf{y}-\overline{\mathbf{y}}, X_{i}-\overline{X_{i}}\right)\right) \end{aligned} $$
$R^2$
Similarly, using the decomposition $$SST=SSE+SSR$$ $$R^{2}:=\frac{SSR}{SST}=\cos ^{2}(\angle(\mathbf{y}-\overline{\mathbf{y}}, \widehat{\mathbf{y}}-\overline{\mathbf{y}}))$$
$F$ Statistics
The $F$-statistics for nested models $$ F=\frac{\frac{\mathrm{RSS}_{\text {Full }}-\mathrm{RSS}_{\text {Reduced }}}{\mathrm{df}_{\text {Reduced }}-\mathrm{df}_{\text {Full }}}}{\frac{\mathrm{RSS}_{\text {Full }}}{\mathrm{df}_{\text {Full }}}} \propto \cot ^{2}(\angle\left(\widehat{Y}_{\mathrm{F}}-\bar{Y}, \widehat{Y}_{\mathrm{R}}-\bar{Y}\right)) $$

Analytic Approaches

Normal Equation

The loss function can be expanded $$\mathrm{SSE}(b)=b^TX^TXb-2(Xb)^T\mathbf{y}+\mathbf{y}^T\mathbf{y} $$ Taking derivative (gradient) w.r.t $b$ gives $$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}$$ At the local minimals, we have $$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}=0$$ which gives us the, Normal Equation $$X^TX b= X^T \mathbf{y}$$ When the inverse of $X^TX$ exist, we have, again $$ b = ( X^T X)^{-1}X^T \mathbf{y}$$
$$ \begin{aligned} \nabla b^TX^TXb &= \left[\lim_{\epsilon \to 0} \frac{(b+\epsilon e_i)^TX^TX(b+\epsilon e_i)-b^TX^TXb}{\epsilon}\right] \\ &= \left[\lim_{\epsilon \to 0} \frac{b^TX^TXb+2\epsilon b^TX^TXe_i+\epsilon^2 e_i^TX^TXe_i-b^TX^TXb}{\epsilon} \right]\\ &=\left[ \lim_{\epsilon \to 0} \frac{2\epsilon b^TX^TXe_i+\epsilon^2 e_i^TX^TXe_i}{\epsilon}\right] \\ &= 2X^TXb \end{aligned} $$

Gradient Descend

For ideas behind the Gradient Descend Algorithm, see the page Optimization Methods. In the example, we applied the Gradient Descend algorithm to minimize $\mathrm{SSE}(b)$. The updating rule is given by $$b_{j+1}=b_{j}-\alpha X^{T}(Xb-Y)$$

The Hessian of $\mathrm{SSE}$ is given by $$H_\mathrm{SSE}(b)=J_{\nabla \mathrm{SSE}}( b )=2X^TX$$ Thus $H_\mathrm{SSE} (b)$ is positive definite, locally by the Derivative Test, any local optimal of $\mathrm{SSE}(b)$ will be a local minimum. And globally $H_\mathrm{SSE} (b) $ positive definite for all $b$ implies $\mathrm{SSE}(b)$ is convex , and thus our local minimum is the unique global minimum.

Parametric Approach

Linear Model Assumptions

When regard Linear Regression Problems as problems of fitting the least square line to the data points, it is just a mathematical problem. But when one tries to do Statistical Inference then it becomes a statistical problem. And assumptions are needed for meaningful interpretations.

A Linear Model in Regression is a model of the form $$\mathbf{y}=X b+\mathcal{E} $$ Where the residual is a random vector with distribution $\mathcal{E}\sim \mathcal{N}(0,\sigma I)$.

Maximum Likelihood

Let's take a small quiz:

Let $\mathbf{x}$ be a mean $1$ Gaussian Random variable, and we have 4 realizations of $\mathbf{x}$

Which of the following choice is a more likely distribution for $\mathbf{x}$ ?

The idea of maximum Likelihood essentially formalizes the above example. Let $V$ be a random vector distributed according to the following density, a function in $V$, $$ p(V|\theta)$$ Given a realization $v$ of $V$, the likelihood function is defined by the following, a function in $\theta$, $$L(\theta): = p(V|\theta)$$ And the maximum likelihood estimator for $\theta$ is given by $$ \mathrm{argmax}\ L(\theta) $$ Since the $\ln$ function is monotonically increasing, and turns products into sums, it is often easier to work with the log-likelihood function $$l(\theta)=\ln L(\theta)$$ And the corresponding optimization problem $$ \mathrm{argmax}\ l(\theta) $$

In the linear regression model, it turns out that the MLE for $b$ does not depend on $\sigma$, thus for simplicity we will assume $$\mathbf{y}=X b+\mathcal{E}\sim \mathcal{N}(Xb,I) $$ Given $\mathbf{y}$, $X$ fixed, the Likelihood function for $b$ is given by $$L(b)=(2 \pi)^{-\frac{n}{2}} e^{-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb)}, $$ and the log-likelihood is given by $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb) $$ Note that from here we see that the maximization problem for MLE is equivalent to the minimization problem for $\mathrm{SSE}(b)$.

Both the Derivative Test, and Gradient Descend/Ascend can be used to find or estimate the maximum likelihood estimator. But let us use this opportunity to apply Newton's method.

Newton's Method

The idea behind Newton's Method is discussed on the page Optimization Methods. We go straight to its application to maximizing the log-likelihood function. To maximize $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb) $$ We apply Newton's method to $$f(b):=\nabla l(b) = -X^TX b + X^T \mathbf{y}$$ We have, the Jacobian of $f$ given by $$J_f(b) = H_l(b)=-X^TX$$ Note that $J_f(b)$ is negative definite, thus locally by the Derivative Test, any local optimal of $l(b)$ will be a local maximum. And globally $J_f(b) $ negative definite for all $b$ implies $l(b)$ is concave , and thus our local maximum is the unique global maximum.

Given any initial guess $b $, one step of Newton's Method gives $$\begin{align} \beta&=b-J_f(b)^{-1}f(b)\\ & =b+(X^TX)^{-1}(-X^TXb+X^Ty)\\ &=(X^TX)^{-1}X^Ty \end{align}$$ i.e. only one step is needed. The application of Newton's method to Linear Regression requires only one step, since $f(b)$ is a affine map in $b$, i.e. it is of the form $f(b) = Mb+v$. A simple nontrivial application of Newton's Method can be found on the page Optimization Methods.

Logistic Regression

A Linear Model assumes $$y\sim \mathcal{N}(\mathbf{b}^T\mathbf{x},\sigma)$$ i.e. at any given point $\mathbf{x}$, it assumes $y= f(\mathbf{x})$ follows a Gaussian distribution, this excludes us from modeling the case where $y=f(\mathbf{x})$ follows other distributions. Consider for example, in a binary Classification problem with labels $0$ and $1$. Then at any point $\mathbf{x}$, it is reasonable to assume that $$f(\mathbf{x})\sim \mathtt{Bernoulli}(p(\mathbf{x}))$$ And this leads us naturally to the Logistic Regression.

In Logistic Regression, we assume $$y\sim \mathtt{Bernoulli}(p_\theta{(\mathbf{x})})$$ Where by design $$p_\theta{(\mathbf{x})}= \frac{1}{1+\exp(-\theta\cdot \mathbf{x})}=:\mathtt{sigmoid}(\theta\cdot \mathbf{x})$$ And the goal of logistic regression is try to estimate the probabilities $p_\theta(\mathbf{x})$.

One can check that the inverse of the sigmoid function is given by the logit function$$\mathtt{logit}(p_\theta{(\mathbf{x})}):=\log{(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}})}=\theta\cdot \mathbb{x}$$ Since the sigmoid function is not linear, it is more connivent to use MLE to estimate $\theta$. To maximize the likelihood function, is equivalent to minimize the negative of the log-likelihood function $$L(\theta)=-\sum_{i=1}^{n} y^{(i)} \log \left(p_\theta(\mathbf{x}^{(i)})\right)+\left(1-y^{(i)}\right) \log \left(1-p_\theta(\mathbf{x}^{(i)})\right)$$ The summands are$$\begin{cases}\log \left(p_\theta(\mathbf{x}^{(i)})\right),\qquad & y^{(i)}=1\\ \log \left(1-p_\theta(\mathbf{x}^{(i)})\right) & y^{(i)}=0 \end{cases} $$ The above loss function is also known as the Binary Cross Entropy function. Using Gradient Descend, we get the following updating rule $$\theta_{j+1}=\theta_{j}-\alpha X^{T}(p_\theta(X)-Y)$$ Where $X$ is the design matrix.

Generalized Linear Models

A Generalized Linear Model is of the form $$\eta\left(\mathrm{E}\left(Y \mid \bf x\right)\right)= \theta \cdot \bf x',\qquad \bf x' = \begin{pmatrix}1\\ \bf x \end{pmatrix} $$ for some link function $\eta$, and we assume that the entries of $ Y$ are independently sampled from a Exponential Family.

Examples of GLM

  1. The Linear Model is a GLM , where the link function is the identity function, and $Y$ is from the Gaussian Distribution (an exponential family).
  2. The Logistic Regression Model is a GLM, where the link function is the logit function, and $Y$ is from the Bernoulli Distribution (another example of exponential family).
  3. If we let $y$ takes on the following PMF $$\mathtt{Pr}(y=k \mid\mathbf{x})= \begin{cases}\frac{\exp(\beta_k \cdot \bf x)}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \mathbf{x})} \quad & k = 1,2,\cdots,K-1 \\ \frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \bf x)} & k=K \end{cases} $$ and $$\eta(p) = \ln(\frac{p}{\mathtt{Pr}(y=K| \mathbf{x})}) $$ we get the Multinomial Logistic Regression.
  4. If we let $Y\sim \mathtt{Poisson}(\lambda)$, and take $\eta = \log$, then we get the Poisson Regression Model.

Mixed Effect Models

A Linear Mixed Effect Model is of the form $$\mathbf{y} = X\beta + U\gamma + \mathbf{\epsilon}$$ Where If $U$ is the zero matrix, then the model is just a GLM.

Neural Networks

Logistic Regression or Linear Model can be viewed as a single layer neural network with sigmoid / identity activation function and Cross Entropy loss function. We will discuss the neural network in more detail in a later page Neural Networks.