If we want to search for “computer”, we might also be interested in non-English documents. For example “computer” is a word that is frequently used in other languages as-is. Furthermore, German has different rules than other Latin languages for plural and dative forms, meaning that searching for “jahr” should also match “Jahre” (plural) and ”Jahren” (plural dative).Ĭommon term: Some languages also make use of common or domain-specific terminology. Without a custom analyzer that can decompound these words, we wouldn’t be able to search for “jahr” and get back documents about school years, “Schuljahr”. A simple example is combining “Jahr” (“year”) into other words like “Jahrhunderts” (“century”), “Jahreskalender” (“annual calendar”) or “Schuljahr” (“school year”). To help motivate this further, let’s have a quick look at a few benefits of language-specific analyzers.ĭecompounding: In the German language, nouns are often built by compounding other nouns together to create beautifully long and hard to read compound words. For example, pre-trained NLP models such as Google’s BERT and ALBERT or OpenAI’s GPT-2 are commonly trained on per-language corpora or corpora with a predominant language, and fine tuned for tasks such as document classification, sentiment analysis, named entity recognition (NER), etc.įor the following examples and strategies, unless otherwise specified, we will assume that documents contain either a single or a predominant language. German, Dutch, Korean)įor similar reasons, we find language identification in more general natural language processing (NLP) pipelines as one of the first processing steps to make use of highly precise, language-specific algorithms and models. Word form normalization: stemming and lemmatization.Using a suite of language-specific analyzers in Elasticsearch (both built-in and through additional plugins), we can provide improved tokenization, token filtering and term filtering: The former is common in domains such as computer science where English is the predominant language of communication, while the latter is commonly found in biological and medical text where Latin terminology is frequently interspersed with English.īy applying language-specific analysis, we can improve relevance (both precision and recall) by ensuring that document terms are understood, indexed and searched over appropriately. ![]() ![]() The documents may contain a single language or multiple. Given a set of documents where we do not yet know the language(s) they contain, we want to efficiently search over them. Language identification is used to improve the overall search relevance for these multilingual corpora. We need to understand the language of these documents as best we can to analyze them properly and provide the best search experience possible. This poses a problem for many search applications. In today’s highly interconnected world, we find that documents and other sources of information come in a variety of languages. We’ve covered some of these topics in the past, and we’ll build on these in some of the examples that follow. With this release, we wanted to take the opportunity to describe some use cases and strategies for searching in multilingual corpora, and how language identification plays a part. ![]() We’re pleased to announce that along with the release of the machine learning inference ingest processor, we are releasing language identification in Elasticsearch 7.6.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |