Discussion 4

Grading policy for Module 1 (coding portion)

Loss function:

loss <- function(t, x) {
  if (t < x) {
    return((t-x)^2)
  } 
  if(t <= x+10)
    return(1000)
  if(t <= x+20)
    return(5000)
  if(t <= x+30)
    return(10000)
  if(t <= x+40)
    return(20000)
  if(t <= x+50)
    return(40000)
  return((x/2)^2)
}

How did we generate the 9-dim vector in the report:

The first three elements indicates whether your code throws error for the three dataset we use. If not, you get 1. Otherwise, you get 0. If your code doesn’t throw error for any of them, at least you will get 2 points for the coding portion.

The fourth through sixth elements are generated by

loss(your_function(y_t,t,maxcap), trueTime)

The last three elements represents the running time of your code, generated by the following code:

codetime = rep(0,10)
for(i in 1:10) {
  start=Sys.time()
  output = your_function(y_t,t,maxcap) 
  end=Sys.time()
  codetime[i] = as.numeric(end-start)
}

You will get more points for lower loss ($\le 10000$), quick running time (finishes within $10^{-3}s$) and good-looking scalability plot.

Also, the score for the coding portion only reflects how well your predictions are on those three specific datasets. Don’t be upset if you didn’t get a satisfied score, because

The second and thrid worth much more points than the first one
The grade distribution of this course based on the history is generous (nearly all the students got an A)

If you have any concerns for the grade of the coding portion, feel free to talk with me during my office hours.

Gradient descent

Most of the time, when training a machine learning model, we need to perform continuous nonlinear opttimization over an objective function. In this discussion, I will cover the gradient descent algorithm and its property.

Throughout the discussion, we consider the unconstrained minimization of a smooth convex function: \[\min_{x\in\mathbb R^n} f(x)\] Most of the following materials are from Nocedal and Wright (2006).

Foundations

Taylor’s Theorem

Theorem 1. Given a continuously differentiable function $f : \mathbb R^n\rightarrow \mathbb R$, and given $x, p \in \mathbb R^n$, we have \[f(x+p) = f(x) + \int_0^1\nabla f(x+\gamma p)^Tpd\gamma \] \[f(x+p) = f(x) + \nabla f(x+\gamma p)^Tp, \hspace{5mm} \text{some }\gamma\in[0,1] \]

We sometimes call the first equation the “integral form” and the second the “mean-value form” of Taylor’s theorem.

A crucial quantity in optimization is the Lipschitz constant $L$ for the gradient of $f$, which is defined to satisfy

Definition 1. A continuously differentiable function $f$ is called L-smooth or has L-Lipschitz gradients if \[||\nabla f(x) - \nabla f(y)||\le L||x-y|| \]

Descent direction

Definition 2. $d$ is a descent direction for $f$ at $x$ if $f(x + td) < f(x)$ for all $t > 0$ sufficiently small.

Proposition 1. If $f$ is continuously differentiable in a neighborhood of $x$, then any $d$ such that $d^T \nabla f(x) < 0$ is a descent direction.

Proof: By continuity of $\nabla f$, we can identify $\bar t > 0$ such that $\nabla f(x + td)^Td < 0$ for all $t \in [0, \bar t]$. Hence, for any $t\in [0, \bar t]$, by the “mean-value form” of Taylor’s theorem, we have \[f(x+td) = f(x) + t\nabla f(x+t\gamma d)^Td \hspace{5mm} \text{for some }\gamma\in[0,1]\] Since $t\gamma\in[0, \bar t]$, we have $\nabla f(x+t\gamma d)^Td < 0$. Therefore, $f(x+td) < f(x)$ for $t$ sufficiently small, which indicates $d$ is a descent direction. $\blacksquare$

Note that among all directions with unit norm, \[\inf_{||d||=1}d^T\nabla f(x) = -\nabla f(x) \hspace{5mm} \text{achieved when } d = -\dfrac{\nabla f(x)}{||\nabla f(x)||}.\]

For this reason, we refer to $-\nabla f(x)$ as the direction of steepest descent.

Since this direction always provides a descent direction, the simplest method for optimization of a smooth function has the iterations \[x_{k+1} = x_k - \alpha_k\nabla f(x). \] If $f$ is convex, we will get a global minimizer of $f$. This algorithm is called the the method of steepest descent. We will then analyze how many iterations are required to ﬁnd points where the gradient nearly vanishes.

The simplest protocol is to set $\alpha_k \equiv \alpha$ and simply iterate as follows: \[x_{k+1} = x_k - \alpha\nabla f(x).\] There is a need to choose the correct $\alpha$ to make the algorithm convergent efficiently. For example, if we choose $\alpha$ too big, the following situation may happen:

Different assumptions will lead to difference convergence rates. We will investigate them one by one.

Properties of steepest descent

1. General case (L-smooth):

Using the integral form of the Taylor Theorem, by setting $p = \alpha d$, we have \begin{equation} \begin{aligned} f(x+\alpha d) &= f(x) + \alpha\nabla f(x)^Td+\alpha\int_0^1[\nabla f(x+\gamma\alpha d) - \nabla f(x)]^Td \text{ d} \gamma\newline &\le f(x) + \alpha\nabla f(x)^Td+\alpha\int_0^1||\nabla f(x+\gamma\alpha d) - \nabla f(x)||\cdot||d||\text{ d}\gamma\newline &\le f(x) + \alpha\nabla f(x)^Td+\alpha\int_0^1L||\gamma\alpha d||\cdot||d||\text{ d}\gamma\newline &= f(x) + \alpha\nabla f(x)^Td+\alpha^2L||d||^2\int_0^1\gamma\text{ d}\gamma\newline &= f(x) + \alpha\nabla f(x)^Td+\alpha^2\dfrac{L}{2}||d||^2. \end{aligned} \end{equation}

For $x = x_k$ and $d = −\nabla f(x_k)$, the value of $\alpha$ that minimizes the expression on the right-hand side is $\alpha = 1/L$. By substituting these values, we obtain \[f(x_{k+1}) = f(x_k - \nabla f(x_k)/L)\le f(x_k) - \dfrac{1}{2L}||\nabla f(x_k)||^2. \] Therefore, for the $T$-th iteration, we have \[f(x_T) \le f(x_0) - \dfrac{1}{2L}\sum_{k = 0}^{T-1}||\nabla f(x_k)||^2. \]
By assuming $f(x) \ge \bar f$ for all $x$, we have \[\sum_{k = 0}^{T-1}||\nabla f(x_k)||^2 \le 2L[f(x_0) - \bar f] \] Notice that \[ \min_{0\le k\le T-1}||\nabla f(x_k)||^2\le \dfrac{1}{T}\sum_{k=0}^{T-1}||\nabla f(x_k)||^2, \] we have \[\min_{0\le k\le T-1}||\nabla f(x_k)||\le\sqrt{\dfrac{2L[f(x_0) - \bar f]}{T}} \] Thus, we have shown that after $T$ steps of steepest descent, we can ﬁnd a point $x$ satisfying \[||\nabla f(x_k)|| \le \sqrt{\dfrac{2L[f(x_0) - \bar f]}{T}}\]

2. Convex case:

Theorem 2. Suppose that $f$ is convex and $L$-smooth, and $x^* $ is a global minimizer of $f$. Then the steepest descent method with stepsize $\alpha \equiv 1/L$ generates a sequence {$x_k$} that satisfies \[f(x_T) - f(x^* ) \le \dfrac{L}{2T}||x_0 - x^* ||^2, \text{ for } T = 1,2,…\] You can try to prove this theorem by using the inequality $f(x^* ) \ge f(x_k) + \nabla f(x^k)^T(x^* - x_k)$.

3. Strongly convex case:

Definition 3. The smooth function $f: \mathbb R^d\rightarrow \mathbb R$ is strongly convex with modulus $m$ if there is a scalar $m > 0$ such that \[f(z)\geq f(x) + \nabla f(x)^T(z-x) + \dfrac{m}{2}||z-x||^2. \]

If $f$ is a $L$-smooth function, then \[f_\mu(x) = f(x) + \mu||x||^2\] is strongly convex for $\mu$ large enough.

Theorem 3. Suppose that $f$ is strongly convex with modulus $m$ and $L$-smooth, and $x^* $ is a global minimizer of $f$. Then the steepest descent method with stepsize $\alpha \equiv 1/L$ generates a sequence {$x_k$} that satisfies \[f(x_T) - f(x^* ) \le (1-\dfrac{m}{L})^T||f(x_0) - f(x^* ) ||^2, \text{ for } T = 1,2,…\]

4. Comparison between rates:

For the general case where we only assume $L$-smoothness, an iteration $k$ can be found such that $||\nabla f(x_k)||\leq \epsilon$ when \[k\geq \dfrac{2L[f(x_0) - \bar f]}{\epsilon^2}.\]

For the weakly convex case, we have $f(x_k) - f(x^* )\leq \epsilon$ when \[k\geq \dfrac{L||x_0 - x^* ||^2}{2\epsilon}.\]

For the strongly convex case, we have $f(x_k) - f(x^* )\leq \epsilon$ when \[k\geq \dfrac{L}{m}\log\dfrac{f(x_0) - f(x^* )}{\epsilon}\]

When $\epsilon$ is small (for example $\epsilon<10^{-6}$), the third case would convergence dramatically faster.

Exercise:

Write a code to apply various ﬁrst-order methods to the convex quadratic function $f(x) = (1/2)x^TAx$, where the positive definite matrix $A$ (of dimension $100\times 100$) is generate by the following code to have eigenvalues randomly distributed in a range of $[m, L]$, with $0< m < L$:

set.seed(0)
mu = 0.01; L = 1; kappa = L/mu
n = 100
D = runif(n); D = 10^D; Dmin = min(D); Dmax = max(D)
D = (D-Dmin) / (Dmax-Dmin)
D = mu + D*(L-mu)
A = diag(D) 
x0 = runif(n)
x_star = rep(0, 100)

It’s obvious that $x^* $ is obviously a global minimizer of $f$. In all cases, start from a point $x_0$ generated by the R command runif(n) and run until $f(x_k) - f(x^* ) \le 10^{-6}$.

Implement the following methods:

Steepest descent with $\alpha_k \equiv 1/L$, that is, \[x_{k+1} = x_{k} - (1/L)\nabla f(x_k).\]
Steepest descent with $\alpha_k \equiv 1/(5L)$.
Nesterov’s accelerated gradient descent, with the following updates: \[y_k = x_k + \beta_k(x_k - x_{k-1}),\] \[x_{k+1} = y^k - \alpha_k\nabla f(y_k),\] with optim choice of $\alpha_k = 1/L$ and $\beta_k = (\sqrt L - \sqrt m)/(\sqrt L + \sqrt m)$ and $x_{-1} = x_0$.
Newton’s method, that is, \[x_{k+1} = x_{k} - \nabla^2 f(x_k)^{-1}\nabla f(x_k).\] Draw a plot of the convergence behavior on the run, plotting the iteration number against $\log_{10}[f(x_k ) − f(x^* )]$. Use a single plot, with different colors for the different algorithms.

You can use the following code as a starting point:

f <- function(x) {
  # Write the function f here
}

df <- function(x) {
  # Write the fist order derivative here
}

GradientDescent <- function(x0, x_star, L, f, df, e = 1e-6) {
  iter = 0
  value = f(x0)
  x1 = x0
  while (f(x1) - f(x_star) > e) {
    # Write your updates here:
    # x1 = 
    iter = iter + 1
    value = c(value, f(x1))
  }
  return(list(x0, iter, value))
}

Nesterov <- function(x0, x_star, L, m, f, df, e = 1e-6) {
  # Write your function here
  return(list(x0, iter, value))
}

Also, you can utilize the following codes for plotting:

# Get optimization results:
out_GD_1 <- GradientDescent(x0, x_star, L = 1, f, df)
out_GD_2 <- GradientDescent(x0, x_star, L = 5, f, df)
out_Nest <- Nesterov(x0, x_star, L = 1, m = 0.01, f, df)
out_Newton <- Newton(x0, x_star, f, df)

# Plots:
plot(0, type="n", xlab="", ylab="", xlim=c(0, 500), ylim=c(-7, 1))
lines(log(out_GD_1[[3]], 10))
lines(log(out_GD_2[[3]], 10), col = 'blue')
lines(log(out_Nest[[3]], 10), col = 'red')
lines(log(out_Newton[[3]], 10), col = 'purple')
legend("topright", legend = c("GD1", "GD2", "Nest", "Newton"), lty = rep(1,4), col = c("black", "blue", "red", "purple"))

If you are interested in the convergence property of the Nesterov’s method, you can refer to page 32-40 of this book.

My solution can be found here: optimization.R