Linear Models

Suppose we have $n$ data points, $x_1,\cdots,x_n$, in $\mathbb{R}^p $, denote $$\mathbf{X} \in \mathbb{R}^{n\times p }$$ the matrix with entries given by $$\mathbf{X}_{i,j}=(x_i)_j$$ For each point, $x_i$, there is a corresponding output $y_i$, we will denote the output vector by $$\mathbf{y} := \begin{pmatrix} y_1\\\vdots\\y_n \end{pmatrix}$$ Denote the design matrix $$X:=[\mathbf{X}|\mathbb{1}] \in \mathbb{R}^{n\times (p+1) }$$ The goal of a least square problem is to find the vector $\mathbf{\beta}\in \mathbb{R}^{p+1}$ that minimizes the Sum of Squared Error $$\mathrm{SSE}(w):=\|X w - \mathbf{y}\|_2^2 $$

Geometric Approach

Projection Matrix

Correlations, R Squared and F-statistics

Correlations

Denoting the $i$-th column of $X$ by $X_i$. $$ \begin{aligned} \mathtt{Corr}\left(\mathbf{y}, X_{i}\right):&=\frac{\sum(\mathbf{y}_j-\overline{\mathbf{y}}) (X_{ji}-\overline{X_{i}})}{\sqrt{\sum( X_{ji}-\overline{X_{i}})^2\sum(\mathbf{y_j}-\overline{\mathbf{y}})^2}} \\ &=\frac{\left(\mathbf{y}-\overline{\mathbf{y}}\right)\cdot\left( X_{i}-\overline{X_{i}}\right)}{\|X_i\|\|\mathbf{y}\|}\\ &=\cos \left(\angle\left(\mathbf{y}-\overline{\mathbf{y}}, X_{i}-\overline{X_{i}}\right)\right) \end{aligned} $$

$R^2$

$$SST=SSE+SSR$$ $$R^{2}:=\frac{SSR}{SST}=\cos ^{2}(\angle(\mathbf{y}-\overline{\mathbf{y}}, \widehat{\mathbf{y}}-\overline{\mathbf{y}}))$$

$F$ Statistics

The $F$-statistics for nested models $$ F=\frac{\frac{\mathrm{RSS}_{\text {Full }}-\mathrm{RSS}_{\text {Reduced }}}{\mathrm{df}_{\text {Reduced }}-\mathrm{df}_{\text {Full }}}}{\frac{\mathrm{RSS}_{\text {Full }}}{\mathrm{df}_{\text {Full }}}} \propto \cot ^{2}(\angle\left(\widehat{Y}_{\mathrm{F}}-\bar{Y}, \widehat{Y}_{\mathrm{R}}-\bar{Y}\right)) $$

Analytic Approaches

Normal Equation

The loss function can be expanded $$\mathrm{SSE}(b)=b^TX^TXb-2(Xb)^T\mathbf{y}+\mathbf{y}^T\mathbf{y} $$ Taking derivative (gradient) w.r.t $b$ gives $$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}$$ At the local minimals, we have $$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}=0$$ which gives us the, Normal Equation $$X^TX b= X^T \mathbf{y}$$ When the inverse of $X^TX$ exist, we have, again $$ b = ( X^T X)^{-1}X^T \mathbf{y}$$

$$ \begin{aligned} \nabla b^TX^TXb &= \left[\lim_{\epsilon \to 0} \frac{(b+\epsilon e_i)^TX^TX(b+\epsilon e_i)-b^TX^TXb}{\epsilon}\right] \\ &= \left[\lim_{\epsilon \to 0} \frac{b^TX^TXb+2\epsilon b^TX^TXe_i+\epsilon^2 e_i^TX^TXe_i-b^TX^TXb}{\epsilon} \right]\\ &=\left[ \lim_{\epsilon \to 0} \frac{2\epsilon b^TX^TXe_i+\epsilon^2 e_i^TX^TXe_i}{\epsilon}\right] \\ &= 2X^TXb \end{aligned} $$

Gradient Descend

For ideas behind the Gradient Descend Algorithm, see the page Optimization Methods. In the example, we applied the Gradient Descend algorithm to minimize $\mathrm{SSE}(b)$. The updating rule is given by $$b_{j+1}=b_{j}-\alpha X^{T}(Xb-Y)$$

Parametric Approach

Linear Model Assumptions

Maximum Likelihood

Let $\mathbf{x}$ be a mean $1$ Gaussian Random variable, and we have 4 realizations of $\mathbf{x}$

Which of the following choice is a more likely distribution for $\mathbf{x}$ ?

$N(1,3)$ $N(1,10)$

Newton's Method

Regularizations

Ridge Regression & Lasso Regression

Lasso Regression Uses absolute value penalty for coefficient sparsity: $$\text{Regularized Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{p} |\beta_i|$$ Key properties:

Feature selection: Drives irrelevant coefficients to exactly zero.
Scale sensitivity: Requires feature standardization for proper shrinkage.
Sparse solutions: Ideal for "large p, small n" problems.

Best for:

Interpretable models with clear feature importance.
Scenarios needing automatic dimensionality reduction.

Ridge Regression Adds a penalty proportional to the square of coefficients to the loss function: $$\text{Regularized Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{p} \beta_i^2$$ Key properties:

Smooth shrinkage: Reduces coefficient magnitudes without eliminating features.
Multicollinearity handling: Distributes weights among correlated variables.
Closed-form solution: Computationally efficient with $$\hat{\beta} = (X^TX + \lambda I)^{-1}X^Ty$$

Best for:

Prediction-focused models with many small/moderate effects.
High-dimensional data with strong feature correlations.

Elastic Net Regression

Elastic Net Combines L1 and L2 penalties through convex combination: $$\text{Regularized Loss} = \text{Original Loss} + \lambda \left( \alpha \sum_{i=1}^{p} |\beta_i| + (1 - \alpha) \sum_{i=1}^{p} \beta_i^2 \right)$$ Key advantages:

Grouping effect: Selects correlated variables simultaneously.
Adaptive shrinkage: Balances sparsity and coefficient magnitude control.
Hierarchical recovery: Outperforms pure Lasso when $p > n$.

Implementation considerations:

$\alpha = 1$ gives Lasso, $\alpha = 0$ yields Ridge.
Requires tuning both $\lambda$ (strength) and $\alpha$ (mix ratio).

Logistic Regression

log-odds

The Gradient Descend algorithm is initially ran for 20 steps. Use buttons below to train further.
Use the following button to turn on or off the linear regression line.

When use linear regression for classification, we round the prediction to the nearest label.

In Logistic Regression, we assume for some parameter $\theta$, $$y\sim \mathtt{Bernoulli}(p_\theta{(\mathbf{x})})$$ Recall that the odds of $y=1$ conditioned on $\mathbf{x}$ is given by $$\frac{\mathtt{Pr}(y=1\mid \mathbf{x})}{\mathtt{Pr}(y=0\mid \mathbf{x})}= \frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\geq 0$$ Consider apply the monotonic function $\log$ to the above, we get the log-odds of $y=1$ or equivalently the logit function on $p_\theta{(\mathbf{x})}$, i.e. $$\mathtt{logit}(p_\theta{(\mathbf{x})}):=\log\left(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\right)\in \mathbb{R}$$ and a naive additional assumption is that the log-odds is a linear function of $\mathbf{x}$, i.e. $$\log\left(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\right)=\theta\cdot \mathbf{x}$$ thus by the above assumptions, we have $$p_\theta{(\mathbf{x})}= \frac{1}{1+\exp(-\theta\cdot \mathbf{x})}=:\mathtt{sigmoid}(\theta\cdot \mathbf{x})$$ Where sigmoid is the inverse of the logit function.

And the goal of logistic regression is try to estimate the probabilities $p_\theta(\mathbf{x})$. Since the sigmoid function is not linear, it is more connivent to use MLE to estimate $\theta$. To maximize the likelihood function, is equivalent to minimize the negative of the log-likelihood function $$L(\theta)=-\sum_{i=1}^{n} y^{(i)} \log \left(p_\theta(\mathbf{x}^{(i)})\right)+\left(1-y^{(i)}\right) \log \left(1-p_\theta(\mathbf{x}^{(i)})\right)$$ The summands are$$\begin{cases}\log \left(p_\theta(\mathbf{x}^{(i)})\right),\qquad & y^{(i)}=1\\ \log \left(1-p_\theta(\mathbf{x}^{(i)})\right) & y^{(i)}=0 \end{cases} $$ The above loss function is also known as the Binary Cross Entropy function. Using Gradient Descend, we get the following updating rule $$\theta_{j+1}=\theta_{j}-\alpha X^{T}(p_\theta(X)-Y)$$ Where $X$ is the design matrix.

Model Coefficients

Generalized Linear Models

A Generalized Linear Model is of the form $$\eta\left(\mathrm{E}\left(Y \mid \bf x\right)\right)= \theta \cdot \bf x',\qquad \bf x' = \begin{pmatrix}1\\ \bf x \end{pmatrix} $$ for some link function $\eta$, and we assume that the entries of $ Y$ are independently sampled from a Exponential Family.

Examples of GLM

Mixed Effect Models

A Linear Mixed Effect Model is of the form $$\mathbf{y} = X\beta + U\gamma + \mathbf{\epsilon}$$ Where

$X,U$ are fixed design matrices
$\beta$ is an unknown fixed vector
$\gamma\sim \mathscr{N}{(0,G)}$, $\epsilon\sim \mathscr{N}{(0,R)}$

If $U$ is the zero matrix, then the model is just a GLM.


import statsmodels.api as sm
import statsmodels.formula.api as smf

# Example data
data = sm.datasets.get_rdataset("dietox", "geepack").data

# Fit model
model = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
results = model.fit()

# Predict fixed-effects only
fixed_predictions = results.predict(data)

# Predict with random effects
random_effects = results.random_effects
group_predictions = fixed_predictions + data["Pig"].map(random_effects)