Automatic Document Selection for Efficient Encoder Pretraining

Yukun Feng, Patrick Xia, Benjamin Van Durme, Joao Sedoc

Artificial Intelligence And Data Science PDF Available Non-peer-reviewed Preprint

Automatic Document Selection for Efficient Encoder Pretraining

Yukun Feng, Patrick Xia, Benjamin Van Durme, Joao Sedoc · Published 2022-10-20

Expertini /
Research /
Artificial Intelligence And Data Science /
Automatic Document Selection for Efficient...

📄 Download PDF 🔖 Bookmark Paper

Abstract

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.

Keywords

Artificial Intelligence & Data Science

📄 Full Paper Available as PDF

This paper is available as a downloadable PDF.

📄 Download PDF

Comments (0)

No comments yet. Be the first to comment.

Paper Details

Authors Yukun Feng ,
Patrick Xia ,
Benjamin Van Durme ,
Joao Sedoc
Published 2022-10-20
Category Artificial Intelligence And Data Science
Status Non-peer-reviewed Preprint
Language English
Word Count 119

Automatic Document Selection for Efficient Encoder Pretraining

Abstract

Keywords

✨ AI Plain-English Summary

Comments (0)

Related Papers

Sparse matrix-variate Gaussian process blockmodels for network modeling

Hierarchical Maximum Margin Learning for Multi-Class Classification

Tightening MRF Relaxations with Planar Subproblems

Rank/Norm Regularization with Closed-Form Solutions: Application to ...