tokenizers · PyPI Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions) Extremely fast (both training and tokenization), thanks to the Rust implementation
Tokenizers的工具 - Hugging Face This format is incompatible with “slow” tokenizers (not powered by the tokenizers library), so the tokenizer will not be able to be loaded in the corresponding “slow” tokenizer
GitHub - huggingface tokenizers: Fast State-of-the-Art Tokenizers . . . Takes less than 20 seconds to tokenize a GB of text on a server's CPU Easy to use, but also extremely versatile Designed for research and production Normalization comes with alignments tracking It's always possible to get the part of the original sentence that corresponds to a given token