|
- OpenWebText2 - Read the Docs
OpenWebText2 is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released
- EleutherAI openwebtext2 - GitHub
Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons The plug and play version of OpenWebText2 contains: 17,103,059 documents; 65 86GB uncompressed text
- Skylion007 openwebtext · Datasets at Hugging Face
The viewer is disabled because this dataset repo requires arbitrary Python code execution Please consider removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library) If this is not possible, please open a discussion for direct help
- Download - OpenWebTextCorpus
Download Download Summary: Today we’re announcing the release of a beta version of Open WebText – an open source effort to reproduce OpenAI’s WebText dataset, as detailed here This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University
- OpenWebText2 - Eleuther AI site
OpenWebText2 is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released
- OpenWebText2 - EleutherAI
OpenWebText2 is an enhanced version of the original OpenWebTextCorpus, covering all Reddit submissions from 2005 up until April 2020 It was developed primarily to be included in the Pile
- Papers with Code - OWT2 Dataset
(5) OpenWebText2 — EleutherAI https: www eleuther ai artifacts openwebtext2 OpenWebText2 is an enhanced version of the original OpenWebTextCorpus It encompasses all Reddit submissions from 2005 up until April 2020, with additional months becoming available after the corresponding PushShift dump files are released¹²³
- WebText Background - OpenWebText2 - Read the Docs
OpenWebText2 Motivation Our primary goals for the corpus are: More data! Coverage of the original OpenWebTextCorpus ended at December 2017 Include all languages, providing metadata for easy filtering; Provide several versions of the generated corpus for differing user requirements Both versions will be broken up by month and frozen, with
|
|
|