林軒田機器學習基石筆記 - 第七講

發表於 2019-10-01 更新於 2021-08-30 分類於課程筆記 Course ，林軒田機器學習基石閱讀次數： Disqus：

本文為一系列課程之筆記，建議從" 機器學習基石筆記-1 "開始閱讀

本文討論內容請參考: 機器學習基石第七講 : The VC Dimension

本篇所有圖片部分由筆者製作，其它均為機器學習基石課程內容講義

我們先來看一下上一篇的結論 : 存在 \(Break\ point\) ( 好的\(\mathbb{H}\) ) 且 \(N\) 夠大的情況下 ( 好的\(\mathbb{D}\) )，即使 \(infinite\ \mathbb{H}\)，演算法從\(E_{in}\)很小的條件下挑出一個 \(g\) ( 好的\(algorithm\) )，都一定會有一個有限上界可以確保\(E_{in}\)跟\(E_{out}\)夠接近。

About \(\mathbb{H}\)

在我們往下討論這一講的內容前，我覺得我們必須花一點時間了解 \(\mathbb{H}\)。在機器學習基石筆記-1的時候，我們曾經有說，在探討 \(\mathbb{H}\) 時，我們會把重點放在 \(\mathbb{W}\)的討論上，但是我們會什麼要探討 \(\mathbb{H}\) 呢? \(\mathbb{H}\) 對於我們的學習有什麼意義?

\(\mathbb{H}\) 是一個 hypothesis set，但是這些假設都是怎麼來的呢?

從一個二元分類問題來看，我們曾經有說過，一個平面中，任何一條「直線」都是一個 hypothesis \(h\in\mathbb{H}\)，所以 \(\mathbb{H}\) 裡面就是裝著無限多條「直線」。因為我們預先假定了這個分類器是「直線」，所以 \(\mathbb{H}\) 裡面不會有非線性的hypothesis在裡面。

所以，在我們進行 Machine Learning 時，我們會決定我們要用什麼樣子的分類器，而這也決定了我們的 \(\mathbb{H}\) 長什麼樣子，當然， \(\mathbb{W}\) 也在這時候決定了。

\(Algorithm\ (model)\longleftrightarrow \mathbb{H}\longleftrightarrow \mathbb{W}\)

所以這就是為什麼往後的課程都會討論 \(\mathbb{H}\) ，因為這樣的一個 hypothesis set 也足以代表了你的分類器。

VC dimension

OK，了解了 \(\mathbb{H}\) 所代表的意義後，我們需要有一個衡量這個 \(\mathbb{H}\) 的方法，於是有了以下 VC dimension 的定義 :

Vapnik-Chervonenkis Dimension \(d_{vc}(\mathbb{H})\ or\ VC(\mathbb{H}):\) \(The\ largest\ value\ of\ N\ s.t.\ m_\mathbb{H}(N)=2^N\) \(d_{vc}(\mathbb{H})=\) 能被shatter的最大N值 \(=\ break point\ k-1\)

根據這樣的定義我們可以知道以下幾個特性 :

\(N\leq d_{vc}\ \implies\ For\ some\ detaset\ \mathbb{D}\ with\ size\ N\ can\ be\ shatter\ by\ \mathbb{H}\)
\(N>d_{vc}\implies\ \forall\ dataset\ \mathbb{D}\ with\ size\ N\ can't\ be\ shatter\ by\ \mathbb{H}\)
\(N\geq 2\ , d_{vc}\geq 2\implies\ m_{\mathbb{H}}(N)\leq 2^{d_{vc}}\)

資料空間 &分類型態	Break Point	\(m_\mathbb{H}(N)\)	O(N)	\(d_{vc}\)
1-D Positive ray	2	N+1	\(O(N)\)	1
1-D Positive interval	3	\(\frac{1}{2}N^2+\frac{1}{2}N+1\)	\(O(N^2)\)	2
Convex Set	\(\infty\)	\(2^N\)	--	\(\infty\)
2-D Perceptron	4	\(<2^N\) (in some cases)	\(O(N^3)\)	3

休息一下，我們整理一下至今的幾個概念 : \(m_{\mathbb{H}}(N)=\) N筆資料可以被切出幾個 \(dichotomies\) \(B(N,K)=\) N筆資料，且break point=K 的情況下，最大的\(m_{\mathbb{H}}(N)\) \(d_{vc}(\mathbb{H})=\ hypotesis\ \mathbb{H}\) 可以shatter的最大N值

前兩個概念，基本上是一個過渡概念，主要為了證明出 VC Bound ，真正的著眼點應該在\(d_{vc}(\mathbb{H})\)，因為它可以確實對 \(\mathbb{H}\)做出量化測量。一個真正夠好的 \(hypothesis\ \mathbb{H}\) \(\implies d_{vc}(\mathbb{H})\ is\ finite\implies K\ exists\implies VC\ Bound\ exists\implies E_{in}\approx E_{out}\)

在這裡，VC Dimension與 Algorithm , Distribution 或是 target function都無關

不管是直觀¹或經由繁雜的證明²，我們都不難確認 \(d_{vc}(\mathbb{H})= d+1\) ( \(d\) 為資料 \(\mathbb{X}\) 的維度 )

總的來說， \(h\in \mathbb{H}\ ,\ h(\mathbb{X})=Sign(\sum\limits_{i=1}^{d}w_ix_i+w_0)\) * \(h\) 是由 \((w_1,w_2,w_3,...,w_d)\) 所決定，我們可以將 \(d_{vc}(\mathbb{H})=d+1\)視為自由度 * \(d_{vc}(\mathbb{H})\) 即為 \(\mathbb{H}\) 的能力指標 * \(\mid\mathbb{H}\mid=M\ is\ finite\implies\) 可以利用 \(M\) 來控制VC bound \(\mid\mathbb{H}\mid=M\ is\ infinite\implies\) 可以利用 \((2N)^{d_{vc}}\) 取代 \(M\) 來控制 VC bound

[ Remark ] 讓我們來重新審視一下 VC Bound

\(\mathbb{P}_{\mathbb{D}}[\\![\mid E_{in}(g)-E_{out}(g)\mid>\epsilon]\\!]\leq 4(2N)^{d_{vc}}e^{-\frac{1}{8}\epsilon^2N}\)

\(\Longleftrightarrow\mathbb{P}_{\mathbb{D}}[\\![\mid E_{in}(g)-E_{out}(g)\mid\leq\epsilon]\\!]\geq 1-\delta\) , where \(\delta=4(2N)^{d_{vc}}e^{-\frac{1}{8}\epsilon^2N}\)

有很高的機率 ( \(\geq1-\delta\) )，\(E_{in}\) 與 \(E_{out}\) 會很接近 ( \(\mid E_{in}(g)-E_{out}(g)\mid\leq\epsilon\) )

\(\because\delta=4(2N)^{d_{vc}}e^{-\frac{1}{8}\epsilon^2N}\) \(\therefore\epsilon=\sqrt{\frac{8}{N}\ln(\frac{4(2N)^{d_vc}}{\delta})}\)

\(\therefore\ \mid E_{in}(g)-E_{out}(g)\mid\leq\epsilon=\sqrt{\frac{8}{N}\ln(\frac{4(2N)^{d_vc}}{\delta})}\overset{defn}{=}\Omega(N,\mathbb{H},\delta)=penalty\ for\ model\)

\(\Longrightarrow \begin{matrix} \underbrace{ E_{in}(g)-\sqrt{\frac{8}{N}\ln(\frac{4(2N)^{d_vc}}{\delta})} } \\ We\ don't\ care\ this\ part\end{matrix}\leq E_{out}(g)\leq E_{in}(g)+\sqrt{\frac{8}{N}\ln(\frac{4(2N)^{d_vc}}{\delta})}\)

\(\overset{High\ Prob.}{\Longrightarrow}E_{out}(g)\leq E_{in}(g)+\sqrt{\frac{8}{N}\ln(\frac{4(2N)^{d_vc}}{\delta})}\)

從上面推導出的不等式，配合 \(Error - d_{vc}\) 圖來看，我們會發現最好的解出現在中間。

註釋

\(\forall\ \mathbb{Y}=(y_0,y_1,...,y_d)\ where\ y_i\in\left\{+1,-1\right\}\) \(\exists\ \mathbb{W}=(y_1-y_0,y_2-y_0,...,y_d-y_0)\) \(s.t.\ h(\mathbb{X}_i)=Sign(\mathbb{W}^T \mathbb{X}_i + y_0)=Sign(y_i-y_0+y_0)=y_i\ ,\ \forall i=0,1,...,d\) \(\implies\)無論這個 \(\mathbb{D}\) 怎麼分類 ( 意即不管 \(\mathbb{X}_i\) 對應到怎樣的 \(y_i\))，我們都能找到一個 \(h\) \(\implies\) \(\mathbb{D}\)可以被\(\mathbb{H}\) shatter \(d_{vc}(\mathbb{H})\geq d+1\)

\(Claim:d_{vc}\leq d+1\implies\mathbb{D}\ with\ size-d+2\ cannot\ be\ shatter\) \(Suppose\ to\ the\ contrary\ that\) \(\exists \mathbb{D}=\left\{\mathbb{X}_0,\mathbb{X}_1,...,\mathbb{X}_d,\mathbb{X}_{d+1},\mathbb{X}_{d+2}\right\}\ can\ be\ shattered\)

\(\because\mathbb{D}\subseteq\mathbb{R}^d\) \(\therefore \mathbb{X}_i=\sum\limits_{j\neq i}a_j\mathbb{X}_j\) \(\implies\mathbb{W}^T \mathbb{X}_i=\sum\limits_{j\neq i}a_j\mathbb{W}^T\mathbb{X}_j\)

\(\because \mathbb{D}\ can\ be\ shattered\) \(\therefore \forall\ \mathbb{Y}=\left\{y_1,...,y_{d+2}\right\}\subseteq\left\{+1,-1\right\}\) (不管 \(\mathbb{X}_k\) 怎麼被分類) \(\exists\ \mathbb{W}\ s.t.\ y_k=Sign(\mathbb{W}^T\mathbb{X}_k)\) (我們都可以找到 \(\mathbb{W}\) 來滿足)

那我們指定一個特殊的分類 : \(\mathbb{X}_i=y_i=-1\) 且 \(\mathbb{X}_j=y_j=Sign(a_j)\ ,\ \forall j\neq i\) \(\implies \mathbb{W}^T\mathbb{X}<0\ and\ (\mathbb{W}^T\mathbb{X}_j)\cdot(Sign(a_j))>0\) (兩者同號) \(\implies\mathbb{W}^T\mathbb{X}=\sum\limits _{j\neq i}a_j\mathbb{W}^T\mathbb{X}_j>0\) (Contradiction!)

\(\therefore\forall\mathbb{D}\ with\ size-d+2\ cannot\ be\ shattered\) \(\implies d_{vc}(\mathbb{H})\leq d+1\)

從 Linear Algebra 的角度來看，我們可以如下定義出「dimension」: \(V\ is\ a\ vector\ space\ ,dim(V)=the\ cardinal\ number\ of\ B\) \(where\ B\ is\ a\ basis\ of\ V\ i.e.\ V=span(B)\ and\ B\ is\ a\ linear\ independent\ set.\) 對比到我們的hypothesis set \(\mathbb{H}\)，之前我們才說過，\(\mathbb{H}\) 可以唯一由 \(\left\{w_0,w_1,...,w_d\right\}\) 決定，且 \(w_i\)之間互相獨立，如若我們把 VC dimension 看成是 \(\mathbb{H}\) 這個空間的維度，那麼也可以把 \(\left\{w_0,w_1,...,w_d\right\}\) 看成是他的一組基底，那麼 VC dimension很明顯的就是 \(d+1\)。其實意義上跟課程的這個圖是一樣的。↩︎
試證明 \(VC\ dimension=d+1\) <p.f> \(Claim:d_{vc}\geq d+1\) \(Assume\ \mathbb{D}=\left\{\mathbb{X}_0,\mathbb{X}_1,...,\mathbb{X}_d\right\}\) \(where\ \mathbb{X}_0=(0,0,...,0)\) \(\mathbb{X}_1=(1,0,...,0)\) \(\vdots\) \(\mathbb{X}_d=(0,0,...,1)\)↩︎