ST3901: Statistical Learning and Decision Making
Exercise 5

2是根据混合高斯模型生成的
pX | Y（x | i）=
n集群
X
j = 1
cj
·N（x; µi，j，I2），i = 0，1

Ĵ
cj =1。生成数据的典型图，nClusters = 10

1个
PY | X（i | x;θ）= e

P

Ë

y = i∈Y（1）

X

PY | X（1 | x）= e
F（x）
Ë
F（x）+ e
-F（x）
，PY | X（0 | x）= e
-F（x）
Ë
F（x）+ e
-F（x）
（2）

F（x）= X
k−1
j = 0

2
，j = 0，。 。 。 ，k − 1）

θ的似然估计
ˆθML = arg最大值
θ
1个
ñ
Xn-1
i = 0

（4）

2

b（x，中心）=（

2 <r2
0其他

2 = 20，因此大多数样本x

b（x，中心）=（
[R
2 − kx −中心k
2

2 <r2

p l t . f i g u r e ( )
p l t . rcParams [ ’ a xe s . f a c e c o l o r ’ ] = ’ gray ’
p l t . s c a t t e r ( c o e f [ : , 0 ] , c o e f [ : , 1 ] , s = c o e f [: , −1] ∗ 4 0 0 0 , c=w ei g h t s [ : ] ,
cmap = ’ bwr ’ , alph a = 0 . 5 )
p l t . s c a t t e r ( x t r a i n [ : , 0 ] , x t r a i n [ : , 1 ] , c = y t r a i n f l a t ,
cmap = ’ bwr ’ , alph a = 0 . 2 )
p l t . a x i s ( [ xmin , xmax , ymin , ymax ] )
p l t . show ( )

•散点图可能是查看所有样本以及中心的正确选择

•我使用颜色指示样本属于哪个类别，以及权重如何对应于类别0和类别1。使用了颜色图“ bwr”，以便使用更多的蓝色和红色

3
•我通过绘图中的alpha参数使绘图变得有些透明，因此

•固定绘图轴很有帮助。

Without loss of generality, let’s say that we would like to update θ0 = (center[0, :] ∈
R
2
, weights). For convenience, let’s write
F
old(x) = X
j6=0
weights[j] · b(x, center[j, :])
f(x; θ0) = weight · b(x, center[0, :])
and our goal is to find a θ0 such that when we form F(x) = F
old(x) +f(x; θ0), and after that
PY |X from (2), the updated PY |X fits the data better.
Let’s consider the log likelihood of the training data, where we use PbXY as the joint
empirical distribution:
EPbXY
[log PY |X(Y |X)] = EPbXY h
log(PY |X(1|X)
Y
PY |X(0|X)
1−Y
)
i
= EPbXY 
Y log e
F(X)
e
F(X) + e
−F(X)
+ (1 − Y ) log e
−F(X)
e
F(X) + e
−F(X)

= EPbXY

2Y F(X) − log(1 + e
2F(X)
)

With the definition F(x) = F
old(x)+f(x) to fix all but one of the base functions in F(x),
we solve in each iteration
f
update = arg max
f
EPbXY h
2Y (F
old(X) + f(X)) − log(1 + e
2(F
old(X)+f(X)))
i
(5)
and then make the update F
new ← F
old + f
update
.
The key is to solve (5). For convenience, let’s define
l
(F
old+f)
(x, y)
∆= 2y(F
old(x) + f(x)) − log(1 + e
2(F
old(x)+f(x)))
so that (5) can be written in a shorter form as maximizing
EPbXY
[l
(F
old+f)
(X, Y )] = 1
n
Xn−1
i=0
l
(F
old+f)
(xi
, yi)
When we don’t have any other constraint, and are allowed to choose any function f(·),
then we can choose the value f(x) for each value of x separately, by solving
f
update(x) = arg max
a
X
i
1xi=x ·
h
2yi(F
old(xi) + a) − log(1 + e
2(F
old(xi)+a)
)
i
4
This is actually simpler than it looks. Since xi
’s take real values, we can safely assume
that no two samples in our data set has exactly the same xi
, which means we can just do
the above optimization for each sample:
f
update(xi) = arg max
a
h
2yi(F
old(xi) + a) − log(1 + e
2(F
old(xi)+a)
)
i
In reality, we have some constraints of f
update that has to be some scaled version of a
single base function from the set of base functions we have chosen. This means we cannot
arbitrarily choose f(x) for every x as we wish, but instead have to compromise between the
ideal f(x) values at different x, which is really a fitting problem.
Fitting a Single Base Function
We start by taking the derivative of the objective function w.r.t. a particular f(xi), at the
point that the entire function f(·) = 0, (Notice that we should take derivative w.r.t. f(x)
for every x, but since the objective function is evaluated at a set of samples, so only the
derivative w.r.t. f(xi) for xi
in the sample set matters.)

∂f(xi)

1
n
Xn−1
i=0
l
(F
old+f)
(xi
, yi)
!

f(·)=0
=
2
n

yi −
e
F
old(xi)
e
Fold(xi) + e
−Fold(xi)
!
A gradient descend algorithm (ascend in this case as we are maximizing the objective
function) would now pick
f
update(xi) ∝
1
n

yi −
e
F
old(xi)
e
Fold(xi) + e
−Fold(xi)
!
and use that for the update F
new = F
old + f
update
.
We should stop here at take a look of this gradient function we just calculated and make
some sense out of it. The term e
F
old(xi)/(e
F
old(xi) +e
−F
old(xi)
) looks familiar! It is by definition
(2) the model PY |X(1|xi) based on our old function F
old. The conditioning on xi means that
based on F
old, we use our model and predict the probability Yi = 1 as this value. Then yi
minus that is in a sense the error in our current prediction, on this pair (xi
, yi). Clearly, we
would like to use f
update to correct them. For convenience, we define a Z-value
Z(x, y)
∆=

y −
e
F
old(x)
e
Fold(x) + e
−Fold(x)
!
and we hope the update function f
update to resemble the empirical average of Z-values
f
update(xi) ∝ Z(xi
, yi) (6)
5
We need to do that with the constraint that f
update can only be chosen as one of our
base function. So we cannot make the above choice precise, but have to do it approximately.
Let’s try to find a f
update, which is a function for all x, but within the family of functions
that we can choose from, and hope that it is close to the desired values at all xi
’s, as close
as possible. For that, we can change the problem into minimizing the mean square error
f
update = arg min
f
1
n
Xn
i=1
[Z(xi
, yi) − f(xi)]2
where the optimization is a constrained optimization, where f is chosen from the allowed
base functions.
This is the step that we are a little bit hand waving. Why should we use the MSE as a
way to see if the chosen f is “close” to the ideal gradient in (6)? Could we use some other
error measure for this fitting? These are really good questions that we actually can answer,
but with some more analysis. You are encouraged to explore along this line. But to make
this project work, implementing this MSE fitting should suffice.
Task 3: Design an algorithm to do this!
First, assuming we already find the correct base function b(x, center[0, :]) for f. Then,
the optimal weight for the least square problem should be (try to convince yourself)
weight =
EPbXY
[Z(x, y) · b(x, center[0, :])]
EPbXY
[b(x, center[0, :])2
]
.
Now let’s consider how to find the new base function (find its center). One can do a brute
force search for this center. That is, one can try to put a base function centered at every
possible point on the 2-D space, and see how much the square error will be if this center is
chosen.
6
This is a simple project designed for you to practice using iterative algorithms to adjust
one small part of your model at a time. If successful, you should observe that the neural
network on the same data set can have roughly the same level of performance as your own
algorithm. In a sense, neural networks, as well as any other learning algorithms that work
on this data must be doing roughly the same thing. Your final presentation of this should
contain not only the performance of your code, but also some reflections on what you have
done in this project: what creative steps you have taken? what you have been thinking and
made connections in this process? how you would propose to improve if you had more time?
We do not want to just read your code at the end, but would like you to start learning how
to be a thoughtful and creative engineer: thoughtful means you are equipped with the right
mathematical tools and good intuitions on the overall procedure, creative means you can
come up with good solutions that surprises the others.
The following figure is what we should see from the learning process. EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

E-mail: easydue@outlook.com  微信:easydue

EasyDue™是一个服务全球中国留学生的专业代写公司