copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
SlovakBERT: Slovak Masked Language Model - ACL Anthology Samples were limited to a maximum of 512 tokens and for each sample, we t as many full sentences as possible We used Adam optimization algorithm (Kingma and Ba, 2015) with 5 104learning rate and 10k warmup steps
: Slovak Masked Language Model - arXiv. org We also show how many unique tokens were used (effective vocabulary) for the tokenization of this particular dataset Multilingual LMs have a smaller portion of their vocabulary used since they contain many tokens useful mainly for other languages, but not for Slovak
arXiv:2109. 15254v1 [cs. CL] 30 Sep 2021 In this section we describe our own Slovak masked language model, what data were used for training, what is the architecture of the model and how it was trained
Slovak morphological tokenizer using the Byte-Pair Encoding algorithm This study introduces a new approach to text tokenization, SlovaK Morphological Tokenizer (SKMT), which integrates the morphology of the Slovak language into the training process using the Byte-Pair Encoding (BPE) algorithm
Tokenization Impacts Multilingual Language Modeling: Assessing . . . Our study offers a deeper under- standing of the role of tokenizers in multilin- gual language models and guidelines for future model developers to choose the most suitable tokenizer for their specic application before undertaking costly model pre-training 1 1 Introduction
skLEP: A Slovak General Language Understanding Benchmark Pre-trained on 104 Wikipedias (including Slovak) using a self-supervised strategy that combines Masked Language Modeling (MLM) and Next Sen-tence Prediction (NSP), it employs WordPiece to-kenization with a shared vocabulary of 110,000 tokens
Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak . . . 2 1 corpora used in the analysis alysis were released by the Department of the SNC in 2013–2020 and are accessible for all registered users The first one, the reference corpus prim-7 0-frk [7], amounting to more than 253 million tokens, is composed of an even share of journalistic, specialised,
in Contemporary Slovak Language Based on the Slovak National Corpus ose latest main corpus – prim-4 0 – made available in early 2009, counts about 550 million tokens For example, the five lemmas (a, v, na, sa, byť) are consistently the top five, ghest-ra king lemmas across all its specializations, even in the spoken corpus � to ž e e j že
SLOVAK DEPENDENCy TREEBANK IN UNIVERSAL DEPENDENCIES The best results for Slovak were achieved by the team from Stanford: 83 86% content-word labeled attachment score (CLAS), 86 04% labeled attachment score (LAS) and 89 58% unlabeled attachment score (UAS)