CS336 Notes: Lecture 14 - Data 2
Data filtering and deduplication at scale: n-gram language models, fastText classifiers, importance sampling, MinHash, LSH, and Bloom filters for efficient web-scale processing.
Read
Blog
Filter
Data filtering and deduplication at scale: n-gram language models, fastText classifiers, importance sampling, MinHash, LSH, and Bloom filters for efficient web-scale processing.
Training data for LLMs: Common Crawl processing, quality filtering, the evolution of data pipelines from BERT to modern models, and the critical role of copyright and licensing.