{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Published 2014 in WaC@EACL

ABSTRACT

In this paper we present the construction process of top-level-domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with standard language identification methods and 2. as most web corpora, they partially contain low-quality data not suitable for the specific research and application objectives. We approach both problems by using language modeling on the crawled data only, omitting the need for manually validated language samples for training. On the task of discriminating between closely related languages we outperform the state-of-the-art Blacklist classifier reducing its error to a fourth.

PUBLICATION RECORD

Publication year
2014
Venue
WaC@EACL
Publication date
2014-04-01
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.3115/v1/W14-0405
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Automatic Detection and Language Identification of Multilingual Documents
2014cited by this paper
The SETimes.HR Linguistically Annotated Corpus of Croatian
2014cited by this paper
The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction
2013cited by this paper
Parsing Croatian and Serbian by Using Croatian Dependency Treebanks
2013cited by this paper
Lemmatization and Morphosyntactic Tagging of Croatian and Serbian
2013cited by this paper
Efficient Web Crawling for Large Text Corpora
2012cited by this paper
Building Large Corpora from the Web Using a New Efficient Tool Chain
2012cited by this paper
Efficient Discrimination Between Closely Related Languages
2012cited by this paper
langid.py: An Off-the-shelf Language Identification Tool
2012cited by this paper
hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene
2011cited by this paper
Boilerplate detection using shallow text features
2010cited by this paper
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
2009cited by this paper
Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike
2009cited by this paper
Cleaneval: a Competition for Cleaning Web Pages
2008cited by this paper
HunPos: an open source trigram tagger
2007cited by this paper
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
2026cites this paper
Hyperraising and copy raising are structurally different: experimental evidence fromSerbian
2025cites this paper
Competing Strategies in Morphological Approximation: Exploring Prefixoids kvazi(-), nadri(-), nazovi(-), and pseudo(-) in Croatian
2025cites this paper
Acquisition of Morphological Variation: An Elicitation Experiment on Children's Production of Parallel Forms in Croatian and Estonian.
2025cites this paper
Bleating, growling, barking, and spitting: Metaphorical extensions and valency patterns of verbs of speaking
2025cites this paper
Beyond the Suffix: Examining Imperfectivization strategies in L2 and Heritage BCMS in Italy
2025cites this paper
BERT-Based Implementation of the Serbian Language POS Tagger
2025cites this paper
Diachronic change in meaning and argument coding of mental verbs in the Croatian language
2025cites this paper
SENSORY RATINGS AND SENSITIVITY TO PERCEPTUAL VARIABLES: NOVEL APPROACH TO EVALUATING SEMANTIC MEMORY IN MILD COGNITIVE IMPAIRMENT.
2024cites this paper
Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
2024cites this paper
Theme-vowel minimal pairs show argument structure alternations
2024cites this paper
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation
2024cites this paper
Novi jezi\v{c}ki modeli za srpski jezik
2024cites this paper
A corpus-based study of maximizer–adjective patterns in Croatian
2024cites this paper
Two Faces of One Suffix: Some Thoughts on Using Corpora in Usage-Based Studies of Word-Formation
2024cites this paper
Putting languages into perspective: A comprehensive database of English words and their Croatian equivalents
2024cites this paper
What do we do with (to) laws and what do laws do to us
2024cites this paper
How do we feel about borrowed words? Affective and lexico-semantic norms for most frequent unadapted English loanwords in Croatian (ENGRI CROWD)
2024cites this paper
An insight into the Croatian degree modifier paradigm and its clustering profiles
2024influential citation
New Textual Corpora for Serbian Language Modeling
2024cites this paper
Implementation of the Serbian Language POS Taggers Using the NLTK Library
2023cites this paper
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
2023influential citation
Personality adjectives in Serbian Tweets: An opening
2023cites this paper
Avoiding stress on non-lexical material in nouns and verbs: predictable verb prosody in Serbo-Croatian stress standard varieties
2023cites this paper
Weather domain in Croatian: a corpus–based overview of precipitation and non–precipitation expressions
2023influential citation
A Serbian Question Answering Dataset Created by Using the Web Scraping Technique
2023cites this paper
POSSESSIVE, KIND AND NOT SO KIND: THE DIFFERENT USES OF THE ADJECTIVAL -OV IN SERBO-CROATIAN
2023cites this paper
BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian
2023cites this paper
A Survey of Resources and Methods for Natural Language Processing of Serbian Language
2023cites this paper
Sa ili bez istog padeža
2023cites this paper
Kvantifikatori u značenju skupova životinja u hrvatskome i ruskome jeziku
2023cites this paper
Students’ Strategies for Translating Most Frequent English Loanwords in Croatian
2022cites this paper
CroaTPAS: USPOREĐIVANJE ZNAČENJA VIDSKIH PARNJAKA S NAMJEROM ISTRAŽIVANJA ODNOSA IZMEĐU VIDA, AKTIONSARTA I GLAGOLSKE POLISEMIJE U HRVATSKOM
2022cites this paper
Izgradnja jezičnog korpusa govora mržnje na hrvatskom medijskom prostoru društvenih mreža
2022cites this paper
Inventory of modifiers and sources of grammaticalization of compound indefinite pronouns in Slavic languages
2022cites this paper
Germanizmi u medijskom prostoru
2022cites this paper
OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan
2022cites this paper
CroaTPAS: A Survey-based Evaluation
2022influential citation
Corpus linguistics for low-density varieties. Minority languages and corpus-based morphological investigations
2022cites this paper
Computational intelligence in processing of speech acoustics: a survey
2022cites this paper
Zbornik radova Fakulteta tehničkih nauka, Novi Sad
2022cites this paper
KONSTRUKCIJSKI PRISTUP KAO TEMELJ ZA POUČAVANJE HRVATSKOGA KAO INOGA JEZIKA: PRILOZI RJEČNIKU KONSTRUKCIJA
2022cites this paper
Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Transduction
2021influential citation
Crowdsourcing for the Russian Morphological Lexicon
2021cites this paper
Representing variation in a spoken corpus of an endangered dialect: the case of Torlak
2021cites this paper
Factors contributing to prefixation of biaspectual verbs in Croatian
2021cites this paper
Migration Discourse in Croatian News Media
2021cites this paper
BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian
2021cites this paper
Actionality and affi xation of biaspectual verbs in Croatian in the light of formal–functional theory of verbal aspect1
2021cites this paper
Reading Predictors in Croatian
2021cites this paper
Zapažanja o hrvatskim adverbijalima položaja u vremenu
2021cites this paper
Annotating Croatian Semantic Type Coercions in CROATPAS
2020influential citation
A corpus-based approach to reevaluation of Croatian verb classification
2020cites this paper
Metafora na razmeđu koncepata, jezika i diskursa
2020cites this paper
Considering foreignization and domestication in EU legal translation: a corpus-based study
2020cites this paper
A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts
2020cites this paper
The Design of Croderiv 2.0
2020cites this paper
Neural Machine Translation for translating into Croatian and Serbian
2020cites this paper
Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool
2020cites this paper
Varijantni frazemi u e-rječniku
2020cites this paper
Universal Derivations 1.0, A Growing Collection of Harmonised Word-Formation Resources
2020cites this paper
Findings of the 2020 Conference on Machine Translation (WMT20)
2020cites this paper
Neural Machine Translation between similar South-Slavic languages
2020cites this paper
XML-Encoding of a spoken Serbian corpus targeting forms of address
2020cites this paper
Metaphoricity of Lexemes from the Lexical Field Family in English and Porodica in Serbian
2020cites this paper
Otvoreni resursi i tehnologije za obradu srpskog jezika
2020cites this paper
Corpora and Processing Tools for Non-standard Contemporary and Diachronic Balkan Slavic
2019cites this paper
High Quality ELMo Embeddings for Seven Less-Resourced Languages
2019cites this paper
Analogical classification in formal grammar
2019cites this paper
Discriminating Between Similar Languages Using Large Web Corpora
2019cites this paper
Folia Linguistica
2019cites this paper
Clitic Climbing, the Raising-Control Dichotomy and Diaphasic Variation in Croatian
2019cites this paper
State-of-the-art on monolingual lexicography for Croatia (Croatian)
2019cites this paper
Računalna obrada književnih tekstova na primjeru analize korpusa ruskih romantičara i realista
2019cites this paper
Redesign of the Croatian derivational lexicon
2019cites this paper
Normalization and parsing algorithms for uncertain input
2019cites this paper
Chapter 5. MetaNet.HR
2019cites this paper
Chapter 9. Metaphor repositories and cross-linguistic comparison
2019cites this paper
Multilingual and Cross-Lingual Graded Lexical Entailment
2019cites this paper
Corpus-Supported Foreign Language Teaching of Less Commonly Taught Languages
2019influential citation
Dynamic N-Gram System Based on an Online Croatian Spellchecking Service
2019cites this paper
De la constitution d’un corpus arboré à l’analyse syntaxique du serbe [From the constitution of a treebank to the syntactic analysis of the Serbian language]
2018cites this paper
Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages
2018cites this paper
Corpus analysis of croatian constructions with the verb doći ‘to come’
2018cites this paper
Designing a Croatian Aspectual Derivatives Dictionary: Preliminary Stages
2018influential citation
Impersonal constructions in Slavic languages and the agentivity of the verb
2018cites this paper
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
2018cites this paper
Using Neural Transfer Learning for Morpho-syntactic Tagging of South-Slavic Languages Tweets
2018cites this paper
2018-0920 hr 500 k – A Reference Training Corpus of Croatian
2018cites this paper
Sub-label dependencies for Neural Morphological Tagging – The Joint Submission of University of Colorado and University of Helsinki for VarDial 2018
2018cites this paper
Corpus analysis of croatian constructions with the verb doći ‘to come’
2018cites this paper
Untying the Gordian knot: interpreting extended term-forming patterns
2018cites this paper
hr500k – A Reference Training Corpus of Croatian.
2018cites this paper
Automatic Language Identification in Texts: A Survey
2018cites this paper
Challenges in the Management of Large
2018cites this paper
Challenges in the Management of Large Corpora and
2018cites this paper
Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian
2018cites this paper
Clitic Climbing and Stacked Infinitives in Bosnian, Croatian and Serbian – A Corpus-Driven Study
2018cites this paper
Fine-grained Semantic Textual Similarity for Serbian
2018cites this paper
Web corpora - the best possible solution for tracking rare phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian
2017cites this paper