DistilBERT | Notion

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Abstract

대규모 사전학습 모델의 전이학습은 NLP 전반에서 활용되고 있지만, 한정된 자원에서 이런 대규모 모델을 학습 및 추론하는 것은 여전히 Challenge로 남아 있습니다. 이 논문에서는 작지만 강력한 언어 표현 모델로써 DistilBERT를 제시합니다.

distillation의 활용방안에 관한 이전 연구 대다수는 task-specific model에서만 그 적용범위를 한정지었지만, 이 연구에서는 사전학습과정에서 knowledge distillation을 사용해 기존 BERT대비 사이즈는 40% 감소시키면서도 성능은 97% 유지하고, 연산 속도는 60%만큼 빨라졌습니다.

사전학습동안 대규모 모델을 통해 학습한 inductive biases을 제대로 활용하기 위해 language modeling, distillation and cosine-distance losses를 통합적으로 사용합니다.

DistilBERT는 작고, 빠르며 강하기에 사전학습을 직접 수행하기에도 용이하며, on-device 연산이 가능합니다.

Introduction

Untitled

지난 2년간 NLP의 동향을 보면, 대규모 사전학습 모델의 전이학습이 크게 부흥했으며, 이제는 기본적인 접근법이 되었습니다. 좌측의 그래프에서도 보이듯 점점 더 큰 모델로 더 좋은 성능을 견인하며 대규모 사전학습 모델들은 상당한 진일보를 이루었습니다.

The Larger, The Better로 축약되는 최근의 추세는 좋은 성능을 보이지만서도, 연산 비용이 과도하게 비싸기에 범용적으로 실시간 서비스에 적용하기에는 어렵다는 한계를 지닙니다.

이 논문에서는 knowledge distillation를 통해 상대적으로 소규모 모델을 구성하더라도, Downstream-Task에서 여전히 좋은 성능을 보인다는 것을 증명합니다. 이 모델을 통해 다양한 Task에서 기존 대규모 모델들과 유사한 성능을 저렴한 연산비용으로 즐길 수 있습니다.

We also show that our compressed models are small enough to run on the edge, e.g. on mobile devices. Using a triple loss, we show that a 40% smaller Transformer pre-trained through distillation via the supervision of a bigger Transformer language model can achieve similar performance on a variety of downstream tasks, while being 60% faster at inference time. Further ablation studies indicate that all the components of the triple loss are important for best performances. We have made the trained weights available along with the training code in the Transformers2 library from HuggingFace.

Knowledge Distillation

Knowledge Distillation

Knowledge Distillation은 학생이 선생님의 지식을 전달받듣이, 소규모 모델(The Student)이 대규모 모델(The Teacher)의 출력값을 배우는 형식ㅇ로 학습하는 일종의 모델 압축기법입니다.