Gradient Descend

Gradient Descend

The idea behind Gradient Descend is simple, say we have a single valued function $$f:\mathbb{R}^n\to \mathbb{R}$$ and we are standing at a point $x $ in the domain $\mathbb{R}^n$, and we are about to take a step in the direction of some unit vector $v$ with step size $\delta$.

When we want to maximize our output, we go in the direction of $v=\nabla f(x)$.

And similarly, if we want to minimize our output, we go in the opposite direction of the gradient, $-\nabla f(x)$. For the mathematical justification, see my notes at 🖇️.
  1. Start with some initial point $x_0$
  2. At step $i$ : set $$ x_i = x_{i-1} - \alpha\cdot \nabla f(x_{i-1}) $$
  3. Repeat the previous part, until $\alpha \cdot \nabla f(x_{i-1})$ is sufficiently small.

Stochastic Gradient Descend

Optimization problems are often of the form $$\text{min}_\beta \sum_{x\in \mathscr{D}} f_i(x) $$ where the summation is typically done over a set of training data, $\beta$ is the tunnable vector of parameters. When the training set is huge, every step in the algorithm can be expensive. To make the computational cost more reasonable, Stochastic Gradient Descend only perform propagations and updates using a random subset $S\subset \mathscr{D}$ at each training step.

Proximal Gradient Descend

If $$f: \underbrace{\ms{H}}_{\text{Hilbert Space}} \to [-\infty,\infty]$$ the proximal operator associated with $f$ is defined as $$\text{prox}_{\lambda f}(x) =\argmin{y\in \ms{H}} \left\{ f(y) + \lambda \|x-y\|^2 \right\}$$ When $$ f(x) = \begin{cases} 0 & \text{if } x\in \mathscr{D} \\ \infty & \text{otherwise} \end{cases} $$ the proximal operator is the projection onto $\mathscr{D}$.
  1. Start with some initial point $x_0$
  2. At step $i$ : set $$ x_i = \text{prox}_{\lambda f}(x_{i-1} - \alpha\cdot \nabla f(x_{i-1})) $$
  3. Repeat the previous part, until termination conditions are met.

Analytical Methods

Derivative Test

When $f$ can be restricted to a bounded domain or $f$ is concave / convex, global optimum can be obtained via derivative tests.
Let $f(x):\mathbb{R}^n\to\mathbb{R} $, if $f\in C^2$, let $H_f$ be the Hessian matrix of $f$. Suppose $$\nabla f(x_0) =\mathbf{0}$$
  • if $H_f(x_0)$ is positive definite , then $x_0$ is a local minimum,
  • if $H_f(x_0)$ is negative definite, then $x$ is a local maximum,
  • otherwise, $x_0$ is point is saddle point.
See my notes at 🖇️.
  • If $f$ is convex, then $f$ has a unique global maximum.
  • If $f$ is concave, then $f$ has a unique global minimum.
For definitions and disscussion, see my notes at 🖇️.

Newton's Method

  • $f$ is colored blue
  • $f'$ is colored red
To find points critical points $x_0$, where $$\nabla f(x_0) =\mathbf{0}$$ one can use the Newton's Method. Let $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^m$. Newton's Method is an algorithm for finding a root of $g$ (if it exist). We go over two different views of the Newton's Method.

Linear Approximation

To derive Newton's Method, we start with the first-order linear approximation, $$\vec{0}\approx J_g(x_0)(x-x_0)+g(x_0)$$ where, $x$ is a root that we are seeking, $x_0$ is a point near the root, $J$ is the Jacobian $$J_g(x_0)=\left[\begin{array}{ccc} \frac{\partial g_{1}}{\partial x_{1}} & \cdots & \frac{\partial g_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_{m}}{\partial x_{1}} & \cdots & \frac{\partial g_{m}}{\partial x_{n}} \end{array}\right]$$ The linear approximation implies $$x \approx x_0-J^{-1}_g(x_0)g(x_0)$$ And this gives us the Newton's Method
  1. Start with an initial guess $x_0$
  2. Repeat until convergence $$x_i = x_{i-1}- J^{-1}_g(x_{i-1})g(x_{i-1})$$
If $g(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^n$ is affine, i.e. it is of the form $$g(x)=Mx+b$$ then the first-order linear approximation will be exact, and Newton's method requires only one step to get the root $$x_1 = x_0 - M^{-1} (Mx_0+b) = -M^{-1}b $$

Fixed Point Equation

A fixed point equation is an equation of the form $$x=f(x)$$ when $f(x)$ is a contraction, a solution to the equation can be found efficiently through a iterative procedure
Let $(X, d)$ be a non-empty complete metric space with a contraction mapping $$T: X \rightarrow X$$ Then $T$ admits a unique fixed-point $x^{*}$ in $X$ (i.e. $T\left(x^{*}\right)=x^{*}$ ). Furthermore, $x^{*}$ can be found as follows:
  1. start with an arbitrary element $x_{0} \in X$ and define a sequence $\left(x_{n}\right)_{n \in \mathbb{N}}$ by $$x_{n}=T\left(x_{n-1}\right)$$ for $n \geq 1$.
  2. Then $$\lim _{n \rightarrow \infty} x_{n}=x^{*}$$
Newton's Method is a special case of Fixed Point Iteration where we have the following fixed point equation $$g(x)=x-J^{-1}_g(x)g(x)$$

Derivative-Free Methods

Random Search

  1. Initialize $x$ with a random position
  2. Until termination condition is met
    1. Generate a random direction $d$
    2. Move $x$ in the direction $d$ by a random step size $\alpha$
    3. If $f(x+\alpha d) < f(x)$, then set $x=x+\alpha d$

Genetic Algorithm

  1. Initialize a population of $N$ individuals $$P_0 = {x_0^1,\cdots,x_0^N}$$ With fitness $$\{f(x^i)\}$$
  2. Until termination condition is met
    1. Select $x_k^i,x_k^j\in P_k$, create an offspring $$d = D(x_k^i,x_k^j)$$
    2. With probability $p$, mutate $d$, $$d = M(d)$$
    3. add $d$ to the population $$P_{k+1} = P_k\cup\{d\}$$


Given a function $f:\mathbb{R}^n\to \mathbb{R}$,
  1. Initialize a simplex $S_0$ with $n+1$ vertices $$S_0 = \{x_0,\cdots,x_{n}\}$$ $\delta_{e}\in (1,\infty)$ $\delta_{\text{c}}\in (0,1) $ $\gamma\in (0,1) $
  2. Order $S_k$ by $f(x_0)\leq \cdots \leq f(x_{n})$ and set $f_{best}^k = f(x_0)$
  3. Set $$x_{\text{o}} = \frac{1}{n}\sum_{i=0}^{n-1} x_i$$ $$x_{\text{r}} = x_{\text{o}} + (x_{\text{o}}-x_n)$$ $$ f_{\text{r}} = f(x_{\text{r}})$$
    • Expansion : If $f_{\text{r}} < f_{\text{best}}^k$, then $$x_{\text{e}}=x_{\text{o}} + \delta_{e}(x_\text{r}-x_{\text{o}})$$ $$ f_{\text{e}}=f(x_{\text{e}})$$ If $f_{\text{e}} < f_{\text{r}}$, then $$x_n=x_{\text{e}}$$ else $$x_n=x_{\text{r}}$$
    • Reflection : If $f_{\text{best}}^k \le f_{\text{r}} \le f(x_{n-1}) $, then $$x_n = x_{\text{r}}$$
    • Contraction :
      • If $f(x_{n-1}) \le f_{\text{r}} \le f(x_n)$, then $$x_{\text{c}} = x_{\text{o}} + \delta_{\text{c}}(x_{\text{r}}-x_{\text{o}} )$$ $$ f_{\text{c}} = f(x_{\text{c}})$$ If $f_{\text{c}} < f_{\text{r}} $, then $$x_n=x_{\text{c}}$$ else go to step 4
      • If $ f(x_n)\le f({\text{r}}) $, then $$x_{\text{c}} = x_{\text{o}} + \delta_{\text{c}}(x_n-x_{\text{o}})$$ $$ f_{\text{c}} = f(x_{\text{c}})$$ If $f_{\text{c}} < f(x_n)$, then $$x_n=x_{\text{c}}$$ else go to step 4
    • Increase k by 1 and go to step 2
  4. Shrink Set $S_{k+1}= \{x_0, x_0+\gamma(x_1-x_0),\cdots,x_0+\gamma(x_n-x_0)\}$ and go to step 2
Consider the simple example $$f(x)=x^2$$ with $$\delta_{e}=1.5$$
  1. Case 1 : If we start with the initial simplex (ordered) $$S_0 = \{(-4),(-5) \}$$ Then $$f^0_{\text{best}} = f(-4) = 16$$ $$x_{\text{o}} = \frac{1}{2}(-5-4) = -4.5$$ $$x_{\text{r}} = -4.5 + (-4.5+5) = -3.5$$ $$f_{\text{r}} = f(-3.5) = 12.25$$ Since $f_{\text{r}} < f_{\text{best}}^0 $ $$x_{\text{e}}=-4.5 + 1.5(-4.5+5)=-2.25$$ $$f_{\text{e}}=f(-2.25)=5.0625 < f_{\text{r}} $$ So we set $$x_1=x_{\text{e}}=-2.25$$