Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for\n Large Language Models
2
Zitationen
8
Autoren
2022
Jahr
Abstract
Transformer-based language models have become a key building block for\nnatural language processing. While these models are extremely accurate, they\ncan be too large and computationally intensive to run on standard deployments.\nA variety of compression methods, including distillation, quantization,\nstructured and unstructured pruning are known to decrease model size and\nincrease inference speed, with low accuracy loss. In this context, this paper's\ncontributions are two-fold. We perform an in-depth study of the\naccuracy-compression trade-off for unstructured weight pruning of BERT models.\nWe introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight\npruning method based on approximate second-order information, which we show to\nyield state-of-the-art results in both stages of language tasks: pre-training\nand fine-tuning. Specifically, oBERT extends existing work on unstructured\nsecond-order pruning by allowing for pruning blocks of weights, and by being\napplicable at the BERT scale. Second, we investigate the impact of this pruning\nmethod when compounding compression approaches to obtain highly compressed but\naccurate models for deployment on edge devices. These models significantly push\nboundaries of the current state-of-the-art sparse BERT models with respect to\nall metrics: model size, inference speed and task accuracy. For example,\nrelative to the dense BERT-base, we obtain 10x model size compression (in MB)\nwith < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and\n29x CPU-inference speedup with < 7.5% accuracy drop. Our code, fully integrated\nwith Transformers and SparseML, is available at\nhttps://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT.\n