这个作业是创建统计数据模型并进行决策分析
ST3901: Statistical Learning and Decision Making
Exercise 5
这是本课程的最终项目。我们的任务是写我们自己的学习和
关于作业4中使用的综合数据集的分类程序。提醒一下,
这些数据样本x∈R
2是根据混合高斯模型生成的
pX | Y(x | i)=
n集群
X
j = 1
cj
·N(x; µi,j,I2),i = 0,1
其中µi,j是一些随机选择的聚类中心,I2是2×2恒等式协方差矩阵,
和cj
满足P
Ĵ
cj =1。生成数据的典型图,nClusters = 10
如下图所示。我们在作业4中表明,该数据集可以是
通过简单的神经网络学习和分类。这个项目的目的是写我们的
自己的学习算法,基于参数的迭代更新,可以完成相同的任务,
这样我们就可以了解神经元内部的关键成分是什么
网络。
模型族和参数
首先,我们定义一个模型族。当然我们可以定义要混合的家庭
高斯主义者,但这似乎在作弊。给出了常用的模型族
如下所示。
1个
PY | X(i | x;θ)= e
固定)
P
一世
Ë
固定)
y = i∈Y(1)
对于Y可以采用的每个值i的一组函数Fi(x)。现在快速观察一下
在每个Fi(x)上增加一个常数不会改变该比率,因此w.o.l.g.我们经常
方便约束
X
一世
对于所有x,Fi(x)= 0。
在此项目中,我们仅考虑Y = {0,1}的简单情况,即,我们只有
两个类,所以我们有F1(x)= -F0(x)= F(x)。因此,我们将以更简单的方式工作
建模为
PY | X(1 | x)= e
F(x)
Ë
F(x)+ e
-F(x)
,PY | X(0 | x)= e
-F(x)
Ë
F(x)+ e
-F(x)
(2)
并且在不引起混淆的情况下,我将在下面将矢量符号放在X上。
我们假设
F(x)= X
k−1
j = 0
权重[j]·b(x,中心[j])(3)
即一些易于计算且易于实现的基本函数b(x,center)的线性组合
适合,我们将在后面给出一个示例。我们需要担心的是找到这些权重
以及基本函数的中心系数。我们总共允许其中k个
模型的复杂性。我们写θ=(权重[j]∈R,中心[j]∈R
2
,j = 0,。 。 。 ,k − 1)
作为分布的参数。 θ是我们需要从数据中学到的东西。什么时候我们
给出了一组训练样本(x [i],y [i]),i = 0,。 。 。 ,n − 1,我们需要找到最大值
θ的似然估计
ˆθML = arg最大值
θ
1个
ñ
Xn-1
i = 0
记录PY | X(y [i] | x [i])!
(4)
其中,PY | X取决于(2)和(3)中定义的参数θ。
简短讨论1您可能会觉得(2)中的形式有些奇怪。我们为什么不
直接将分布写为简单函数的线性组合?原来是
当我们在指数上进行线性组合的这种特殊形式很方便
有很多样本,我们希望将它们的对数似然值相加。这个表格是
与“指数家庭”的概念密切相关,这一概念非常重要,但我们将
在本课程中不做详细介绍。我们当然鼓励您尝试其他形式的
分布家庭。
2
基本函数b(x,center)可以选择为易于实现的单峰函数
计算。我们希望x接近中心时函数值更大而更小
除此以外。例如,一个简单的选择可以是
b(x,中心)=(
如果kx − centerk为1
2 <r2
0其他
将r作为全局参数。我的明智选择是
2 = 20,因此大多数样本x
由〜N(center,I2)生成的b(x,center)= 1。
强调靠近中心的样本并逐渐减小基函数值
是一个截断的二次函数
b(x,中心)=(
[R
2 − kx −中心k
2
如果kx − centerk
2 <r2
否则为0
这应该与神经网络进行比较,在神经网络中,最后一层的输出
节点可以被认为是这样的基本功能,但是可以从
激活功能的前几层可以生成的功能。最后一层
组合对应于使用权重来形成新的线性组合。原来
当我们使用正确的损失函数时,优化与匹配完全相同
通过使用指数上的线性组合进行分布。
任务1.您需要实现一种有效的方法来评估基本功能。特别是,
输入x为样本数组[nSamples,2],中心为二维数组,
该函数应一起返回所有样本的基本函数值的数组。
任务2。为了让我们以后进行调试并监视如何学习正确的参数集,
我们应该有一种可视化所选功能的方法。这需要一些绘图技巧,并且
您应该以自己选择的方式执行此操作。示例代码如下。
p l t . f i g u r e ( )
p l t . rcParams [ ’ a xe s . f a c e c o l o r ’ ] = ’ gray ’
p l t . s c a t t e r ( c o e f [ : , 0 ] , c o e f [ : , 1 ] , s = c o e f [: , −1] ∗ 4 0 0 0 , c=w ei g h t s [ : ] ,
cmap = ’ bwr ’ , alph a = 0 . 5 )
p l t . s c a t t e r ( x t r a i n [ : , 0 ] , x t r a i n [ : , 1 ] , c = y t r a i n f l a t ,
cmap = ’ bwr ’ , alph a = 0 . 2 )
p l t . a x i s ( [ xmin , xmax , ymin , ymax ] )
p l t . show ( )
在设计自己的可视化时,您需要几个组件:
•散点图可能是查看所有样本以及中心的正确选择
在上面的代码中存储在coef数组(大小为k×3)中的基本函数,
即coef [:,0]和coef [:,1]。 coef [:,-1]用于设置分散的大小。
•我使用颜色指示样本属于哪个类别,以及权重如何对应于类别0和类别1。使用了颜色图“ bwr”,以便使用更多的蓝色和红色
用于表示两个不同的类。在此颜色图中,“白色”表示“不是
确定哪个类别”,因此我不得不将背景更改为灰色。
3
•我通过绘图中的alpha参数使绘图变得有些透明,因此
当样本和函数一起绘制时,我仍然可以看到两者。
•固定绘图轴很有帮助。
更新参数
显然,我们首先随机选择所有参数。我们将使用迭代方法
找出这些参数的最大似然估计。也就是说,在每个步骤中,我们修复
除了一个参数以外的所有参数,并更新该参数以增加可能性。
Without loss of generality, let’s say that we would like to update θ0 = (center[0, :] ∈
R
2
, weights[0]). For convenience, let’s write
F
old(x) = X
j6=0
weights[j] · b(x, center[j, :])
f(x; θ0) = weight[0] · b(x, center[0, :])
and our goal is to find a θ0 such that when we form F(x) = F
old(x) +f(x; θ0), and after that
PY |X from (2), the updated PY |X fits the data better.
Let’s consider the log likelihood of the training data, where we use PbXY as the joint
empirical distribution:
EPbXY
[log PY |X(Y |X)] = EPbXY h
log(PY |X(1|X)
Y
PY |X(0|X)
1−Y
)
i
= EPbXY
Y log e
F(X)
e
F(X) + e
−F(X)
+ (1 − Y ) log e
−F(X)
e
F(X) + e
−F(X)
= EPbXY
2Y F(X) − log(1 + e
2F(X)
)
With the definition F(x) = F
old(x)+f(x) to fix all but one of the base functions in F(x),
we solve in each iteration
f
update = arg max
f
EPbXY h
2Y (F
old(X) + f(X)) − log(1 + e
2(F
old(X)+f(X)))
i
(5)
and then make the update F
new ← F
old + f
update
.
The key is to solve (5). For convenience, let’s define
l
(F
old+f)
(x, y)
∆= 2y(F
old(x) + f(x)) − log(1 + e
2(F
old(x)+f(x)))
so that (5) can be written in a shorter form as maximizing
EPbXY
[l
(F
old+f)
(X, Y )] = 1
n
Xn−1
i=0
l
(F
old+f)
(xi
, yi)
When we don’t have any other constraint, and are allowed to choose any function f(·),
then we can choose the value f(x) for each value of x separately, by solving
f
update(x) = arg max
a
X
i
1xi=x ·
h
2yi(F
old(xi) + a) − log(1 + e
2(F
old(xi)+a)
)
i
4
This is actually simpler than it looks. Since xi
’s take real values, we can safely assume
that no two samples in our data set has exactly the same xi
, which means we can just do
the above optimization for each sample:
f
update(xi) = arg max
a
h
2yi(F
old(xi) + a) − log(1 + e
2(F
old(xi)+a)
)
i
In reality, we have some constraints of f
update that has to be some scaled version of a
single base function from the set of base functions we have chosen. This means we cannot
arbitrarily choose f(x) for every x as we wish, but instead have to compromise between the
ideal f(x) values at different x, which is really a fitting problem.
Fitting a Single Base Function
We start by taking the derivative of the objective function w.r.t. a particular f(xi), at the
point that the entire function f(·) = 0, (Notice that we should take derivative w.r.t. f(x)
for every x, but since the objective function is evaluated at a set of samples, so only the
derivative w.r.t. f(xi) for xi
in the sample set matters.)
∂
∂f(xi)
1
n
Xn−1
i=0
l
(F
old+f)
(xi
, yi)
!
f(·)=0
=
2
n
yi −
e
F
old(xi)
e
Fold(xi) + e
−Fold(xi)
!
A gradient descend algorithm (ascend in this case as we are maximizing the objective
function) would now pick
f
update(xi) ∝
1
n
yi −
e
F
old(xi)
e
Fold(xi) + e
−Fold(xi)
!
and use that for the update F
new = F
old + f
update
.
We should stop here at take a look of this gradient function we just calculated and make
some sense out of it. The term e
F
old(xi)/(e
F
old(xi) +e
−F
old(xi)
) looks familiar! It is by definition
(2) the model PY |X(1|xi) based on our old function F
old. The conditioning on xi means that
based on F
old, we use our model and predict the probability Yi = 1 as this value. Then yi
minus that is in a sense the error in our current prediction, on this pair (xi
, yi). Clearly, we
would like to use f
update to correct them. For convenience, we define a Z-value
Z(x, y)
∆=
y −
e
F
old(x)
e
Fold(x) + e
−Fold(x)
!
and we hope the update function f
update to resemble the empirical average of Z-values
f
update(xi) ∝ Z(xi
, yi) (6)
5
We need to do that with the constraint that f
update can only be chosen as one of our
base function. So we cannot make the above choice precise, but have to do it approximately.
Let’s try to find a f
update, which is a function for all x, but within the family of functions
that we can choose from, and hope that it is close to the desired values at all xi
’s, as close
as possible. For that, we can change the problem into minimizing the mean square error
f
update = arg min
f
1
n
Xn
i=1
[Z(xi
, yi) − f(xi)]2
where the optimization is a constrained optimization, where f is chosen from the allowed
base functions.
This is the step that we are a little bit hand waving. Why should we use the MSE as a
way to see if the chosen f is “close” to the ideal gradient in (6)? Could we use some other
error measure for this fitting? These are really good questions that we actually can answer,
but with some more analysis. You are encouraged to explore along this line. But to make
this project work, implementing this MSE fitting should suffice.
Task 3: Design an algorithm to do this!
First, assuming we already find the correct base function b(x, center[0, :]) for f. Then,
the optimal weight for the least square problem should be (try to convince yourself)
weight[0] =
EPbXY
[Z(x, y) · b(x, center[0, :])]
EPbXY
[b(x, center[0, :])2
]
.
Now let’s consider how to find the new base function (find its center). One can do a brute
force search for this center. That is, one can try to put a base function centered at every
possible point on the 2-D space, and see how much the square error will be if this center is
chosen.
6
Task 5: Think!
This is a simple project designed for you to practice using iterative algorithms to adjust
one small part of your model at a time. If successful, you should observe that the neural
network on the same data set can have roughly the same level of performance as your own
algorithm. In a sense, neural networks, as well as any other learning algorithms that work
on this data must be doing roughly the same thing. Your final presentation of this should
contain not only the performance of your code, but also some reflections on what you have
done in this project: what creative steps you have taken? what you have been thinking and
made connections in this process? how you would propose to improve if you had more time?
We do not want to just read your code at the end, but would like you to start learning how
to be a thoughtful and creative engineer: thoughtful means you are equipped with the right
mathematical tools and good intuitions on the overall procedure, creative means you can
come up with good solutions that surprises the others.
The following figure is what we should see from the learning process.