Self-supervised learning: The dark matter of intelligence

12 min readMay 11, 2021

--

出處: Facebook AI Blog
作者: Yann LeCun, Ishan Misra

簡介

這篇文章所介紹的題目雖然不是 Paper，但第一位作者是圖靈獎的得主，他針對了目前 SSL (Self-supervised learning) 領域深度學習各個領域的發展以及具有發展的方向做了一個整理，其中包含了 NLP 在 SSL 上應用的 task (如:
Collobert-Weston 2008 model, Word2Vec, GloVE, fastText, 以及最近很火紅的: BERT, RoBERTa, XLM-R) ，並且也提到了在 Vision task 上的發展瓶頸與目前他們最新的研究 SEER，SEER 利用了 SwAV 和其他方法在十億個隨機未標記圖像上預訓練大型網絡，在這篇文章中也說明了 SSL 的重要性，SSL 是一個能夠讓我們的模型不再是一個黑盒子並同時能夠改善模型泛化能力的一個研究方向，其中 SSL 重要的精隨就是要解決 Supervised learning 得依靠大量 Label data 來衝高某個 Task 的分數，因為這會限制 AI 的發展，我們並不可能對所有 Task 與所有 Data 都下標註，這會耗費太多資源，我們希望 AI 在未來是能靠靠自己舉一反三，真正去學到 Data 當中代表的含意(Common sense)，這也就是 SSL 的精隨所在也更貼近人類智慧，而 Common sense 也就是這篇文章標題所提到的 Dark matter。

Self-supervised learning

Self-supervised learning 主要目的就是透過學習 Data 自身帶有的資訊來預測被隱藏的內容或是理解更高階的語意，像是在 NLP 的 Task 上就會被拿來預測隱藏的詞彙或句子，或是在影像上能夠透過學習當前的 Frame 來預測過去被隱藏的 Frame，像上方那張圖就是表示我們可以利用綠色區塊的 Frames 去預測灰色區塊的 Hidden frames。

而這邊需要釐清的一個觀念就是 Self-supervised learning 並不是 Unsupervised learning，Unsupervised learning 的命名其實並不太妥當，因為它還是需要靠別人監督來學習，並不是連資料本身的 Feedback 資訊都不拿來做 Supervise learning。

Self-supervised learning for language versus vision

透過已知的資訊來預測未知的資訊是 SSL 很經典的應用情境，這在 NLP 當中已經能夠達成，但在影像上的一個瓶頸是我們要像文字一樣透過前後敘述就能夠給一個 Probability 代表目前這禎影像的機率必須先理解這個圖像是什麼動物，是獵豹或是老虎? 並且還要能夠讓預測出來的影像與前後禎圖像有著一樣的外觀，這就是影像在複雜度上所會遇到的瓶頸，而 Facebook AI 的 FAIR 研究室近期也提出了 RegNets 這種可以調動模型大小的方法，讓我們能夠根據所需要的 Parameter 複雜度設計一套專屬的模型，他們也提出了 SSL 領域的相關訓練方法如 SwAV 等等。

A unified view of self-supervised methods

An energy-based model (EBM) measures the compatibility between an observation x and a proposed prediction y. If x and y are compatible, the energy is a small number; if they are incompatible, the energy is a larger number.

EBM (Energy-based model) 是一種能夠代表 SSL 的模型，他主要是透過 x (已知) 去跟 y (預測) 比較看看兩者相似與否，若是相似則輸出的 F(x , y) 會很小，否則會很大，就上面的文字敘述來訓練一個 EBM 的話就是要滿足兩個條件:
(1) 對於 Compatible (同類的 data) 要給予 Low energy。
(2) 對於 Incompatible (不同類的 data) 要給予 High energy。

舉例來說，要做 Image recognition 的話，一輛車子的圖像與那張圖片做旋轉應該還是仍然是同類別就應該要得到 Low energy。

Joint embedding, Siamese networks

Joint embedding architecture. The function C at the top produces a scalar energy that measures the distance between the representation vectors (embeddings) produced by two identical twin networks sharing the same parameters (w). When x and y are slightly different versions of the same image, the system is trained to produce a low energy, which forces the model to produce similar embedding vectors for the two images. The difficult part is to train the model so that it produces high energy (i.e., different embeddings) for images that are different.

另一個最近又被挖出來的架構是 Join embedding ，這是 Geoff Hinton’s lab 以及 Yann LeCun’s 兩位大佬在 1990~2000 年左右所提出來的理論，Join embedding 的概念主要是將模型分成兩個 Branch，x 是輸入圖片， y 則是 x 經過一些旋轉或放大，則當兩張圖片分別經過 Encoding 時他們得到的Embedding features 應該要相似，而比較困難的一點的是若是他們都只輸出一樣的 Embedding features 而不管輸入圖片是否相似的話上面的目的當然很容易達成，所以為了要讓不同 Image 在輸出時能夠得到 High energy，可以利用下面兩個策略達成此目的
(1) Contrastive methods
(2) Regularization methods。

Contrastive energy-based SSL

Training an EBM with a contrastive method consists in simultaneously pushing down on the energy of compatible (x,y) pairs from the training set, indicated by the blue dots, and pushing up on the energy of well chosen (x,y) pairs that are incompatible, symbolized by the green dots. In this simple example, x and y are both scalars, but in real situations, x and y could be an image or a video with millions of dimensions. Coming up with incompatible pairs that will shape the energy in suitable ways is challenging and expensive computationally.

Contrastive method 是基於利用簡單的兩個 Uncompatible 的 Pair 來更新 Parameter 讓不同類別的 Features 對比度加大，在這圖上指的不同類別就是綠色的點以及藍色得點，這種方法可以順利的應用在 NLP task 當中，舉例來說我們可以先將一段句子刪掉幾個詞彙(我們稱之為 Corrupted input)，此時要讓這個 Corrupted input 與 Original text 有很大的 Energy，而在此同時，除此之外，我們也訓練一個模型嘗試將這個 Corrupted input 還原成原本句子(我們稱之為 Reconstructed input)，此時這個 Reconstructed input 必須要跟 Original text 很像才行，但這種方法其實只適用於解決 NLP 的 Task，像是 GAN 這種 Task 就需要花費大量的 Parameter 才能做到，因為我們不太可能列舉出所有 Uncompatible image，能夠造成兩張影像差異的情形實在是太多了，不像句子可以直接刪掉一些關鍵詞彙，而且同時間還要讓那些被動過手腳的 Uncompatible image 經過 Reconstruction 後仍然為 Compatible image 就更難上加難了。

A latent-variable predictive architecture. Given an observation x, the model must be able to produce a set of multiple compatible predictions symbolized by an S-shaped ribbon in the diagram. As the latent variable z varies within a set, symbolized by a gray square, the output varies over the set of plausible predictions.

儘管看似十分困難但這個領域目前也有人投入研究，上方的圖當中 z 代表的是一個灰色 Domain 裡的 Latent-variable，這邊可以理解為變數空間當中的一個隨機變數，x 則代表 Input ，當 Input x 經過 model 得到 Embedding features h 後我們把 h 加上 Latent space 當中經過 Reularize 的 z 後就能夠得到 Output y，這裡跟 GAN 異曲同工的地方是 GAN 也可以給定一個 random variable (z) 來生成不同的 y，但跟 GAN 不同的地方是這邊有給定 x，因為 SSL 希望能夠學到的東西是較高級的語意特徵，所以透過這種方式訓練出來的 h 就能夠在某些方面代表這種高級的語意特徵，當這個特徵去加上 z 這個隨機變數後就能生成出各種來自同個 domain 內的 Output y，也就是灰色扭曲的那塊 domain，而我們可以在透過 Cross entrophy 去計算他與 Ground truth 差異，得到 Loss，總結來說，Dec 就像 GAN 的 Generator，C 就像 GAN 的 Discriminator。

Non-contrastive energy-based SSL

在 Joint embedding 架構當中使用 Non-contrastive 的方法大概是近來 SSL 應用於 Vision task 最為火紅的方法了，一些比較有名的方法包含 DeeperCluster, ClusterFit, MoCo-v2, SwAV, SimSiam, Barlow Twins, BYOL，它們都各自使用不同方法，其中 DeeperCluster, SwAV, SimSiam ，這種方法就不需要像 Contrastive method 需要 Negative sample 才能訓練， Non-contrastive 的方法是計算同類別間的 Embedding feature 相似度，若是相似度愈高代表它們是同類別，而他們的原理也跟上面那張圖所表達的內容大致相同，就是希望說能夠不要像 GAN 一樣單靠一個 Latent-space 就決定他是什麼東西，這邊多使用了一個 R (Regularize) 來正規化這個 space 上的資訊，讓主宰 Dec 的變成 x，進而達成 SSL 想要取得一個通用 feature (h) 的目的。

Conclusion

SSL 是一種能夠去真正朝 Background knowledge 學習進而理解 Data 中 Common sense 的方法，在這種種方法當中，筆者認為若是要在 Vision 的領域能夠取得進一步的突破的話透過 Latent-variable 來對 Input x 做類似 Data augmentation 的效果好讓 Dec 能夠把整個同樣的 Domain 的影像都算一次 Loss 來讓 h 生成較好的特徵資訊或許是較有發展的一個方向，而 Facebook AI 最近也很積極的想把這塊領域做好，他們近期也整合了許多類似的方法在下面這個 Github 連結當中。https://github.com/facebookresearch/vissl

Reference

https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/
https://www.youtube.com/watch?v=Ag1bw8MfHGQ