![]() In fact, ChatGPT was trained on 300 billion words-imagine how many of those words might have belonged to a non-standard English dialect. English, which then becomes web-scraped training data for NLP systems and generative AI tools. A plurality online content-including books, blogs, news articles, advertisements, and social media posts-is written in “standard” U.S. ![]() Not only are speakers of under-resourced languages at risk, but so are speakers of regional dialects of “high resource” languages. Without enough data to train usable language-based systems, most of the world’s AI applications will under-represent billions of people around the world. One reason for this extreme asymmetry is that speakers of under-resourced languages have limited access to digital services, which means they have a significantly smaller digital footprint and therefore are less likely to be included in web-scraped training data. Of all the active languages worldwide, only 20 are considered to be “ high-resource” languages, a categorization that refers to the amount of data available in a certain language to effectively train language-based systems. As such, English has become one of the most data-rich languages, and the mass availability of English data has led to the creation of English-centric datasets and models.Įven before generative AI, most natural language processing (NLP) systems were designed and tested in “high resource” languages, like English. But only a few hundred languages are represented online-with English taking up the largest proportion. Most language-based systems are trained on internet data that researchers can scrape at scale. The resource disparities that exist across languages tend to perpetuate further disparities in technologies, such as generative AI systems and LLMs, due to their link to the digital divide. Language differences start with the digital divide For example, during the transatlantic slave trade, literacy was a weapon used by white supremacists to reinforce the dependence of Blacks on slave masters, which resulted in many anti-literacy laws being passed in the 1800s in most Confederate states.īecause of this historical artifact and other movements that have banned bilingual communications in preference for English-only rules and laws, it is important to consider the implications of constructing the same linguistic frameworks in the digital world, which exacerbate the digital divide in autonomous and generative systems. Depending on where one sits within socio-ethnic contexts, native language can internally strengthen communities while also amplifying and replicating inequalities when coopted by incumbent power structures to restrict immigrant and historically marginalized communities. ![]() Consequently, large language models (LLMs) that train AI tools, like generative AI, rely on binary internet data that serve to increase the gap between standard and non-standard speakers, widening the digital language divide.Īmong sociologists, anthropologists, and linguists, language is a source of power and one that significantly influences the development and dissemination of new tools that are dependent upon learned, linguistic capabilities. Moreover, in the English language alone, there are over 150 dialects beyond “standard” U.S. There are over 7,000 languages spoken worldwide, yet a plurality of content on the internet is written in English, with the largest remaining online shares claimed by Asian and European languages like Mandarin or Spanish. As noted by the Austrian linguist and philosopher Ludwig Wittgenstein, “The limits of my language mean the limits of my world.” This is especially true today, when the language we speak can change how we engage with technology, and the limits of our online vernacular can constrain the full and fair use of existing and emerging technologies.Īs it stands now, the majority of the world’s speakers are being left behind if they are not part of one of the world’s dominant languages, such as English, French, German, Spanish, Chinese, or Russian. Current applications often are not optimized for certain populations or communities and, in some instances, may exacerbate social and economic divisions. But language is not monolithic, and opportunities may be missed in developing generative AI tools for non-standard languages and dialects. Prompt-based generative artificial intelligence (AI) tools are quickly being deployed for a range of use cases, from writing emails and compiling legal cases to personalizing research essays in a wide range of educational, professional, and vocational disciplines.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |