Show that for function f(x) = f(x1, x2) = ax2 1 + x2 2 for x ∈ R2 and a > 0, if we use
Adagrad to optimize f with learning rate η = 1, derive the ratio (as a function of a) of the
pre-conditioner: given a fixed initial point x0 = (c, c) where c ̸= 0.

2 Convex optimization but the minimizer is at infinity

Li Hua claims that he has spotted a huge bug in the convergence proof of gradient descent
convex functions in the lecture. He claims that the proof can not be applied to logistic
regression, where the optimizer is at infinity. Hence gradient descent might take forever to
converge.

Now, you want to show him some math:

1. (5 points) For logistic function h(x) = log(1 + ex) as a function h : R → R, show that
the minimizer of h is indeed at infinity.
Solution

2. (5 points) However, show that for every ϵ > 0, there is an x such that |x| = O(log(1/ϵ))
with h(x) ≤ ϵ.
Solution

3. (30 points) Now, for general function f : Rd → R, suppose f is L-smooth convex
function, and for an ϵ > 0, there is an x with ∥x∥2 ≤ R such that f(x) ≤ ϵ.
Now, show that starting from x0 = 0, gradient descent update with η ≤ L1 : converges to f(xT ) ≤ 2ϵ in T ≤ poly(R, L, 1/η, 1/ϵ) many iterations. (Hint: Apply
Mirror Descent Lemma on x)

3 Random Query for Convex Optimization (20 points)

This problem studies the efficiency of the following optimization algorithm for a d-dimensional
convex function f:

1. Fix a probability distribution D, which does not depend on f. You don’t know f, but
let’s say you somehow know that the optimum x∗ has nonzero probability density over
D.

2. Sample x1, x2, …xN ∼ D

3. Return xm = argminx1,…xN f(x)

Let D be the uniform distribution over the cube Q = {x | −1 ≤ xi ≤ 1} = {x | ∥x∥∞ ≤ 1}.
Show that for every fixed x0 ∈ Rd, if we sample N = dO(1) points x1, x2, · · · , xN uniformly
from Q, then with probability lower bounded by 1−e−Ω(d), the following holds for all i ∈ [N]: EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

E-mail: easydue@outlook.com  微信:easydue

EasyDue™是一个服务全球中国留学生的专业代写公司