Home Tech/AIHow AI and Wikipedia have plunged at-risk languages into a downward spiral

How AI and Wikipedia have plunged at-risk languages into a downward spiral

by admin
0 comments
How AI and Wikipedia have plunged at-risk languages into a downward spiral

When Kenneth Wehr took over the management of the Greenlandic-language Wikipedia four years back, his initial move was to erase nearly all existing content. He believed it was necessary for the platform to have any hope of continuity.

At 26, Wehr is not a native of Greenland—his upbringing was in Germany—but a teenage visit to this autonomous Danish territory ignited an intense fascination within him. He dedicated years to penning obscure Wikipedia entries in his mother tongue on various aspects of Greenland. He even relocated to Copenhagen to study Greenlandic, a tongue spoken by around 57,000 primarily Indigenous Inuit individuals dispersed across numerous isolated Arctic settlements.

The Greenlandic Wikipedia was created around 2003, shortly after the English version launched. By the time Wehr took charge nearly 20 years later, a myriad of contributors had collectively produced about 1,500 articles amounting to tens of thousands of words, showcasing the effectiveness of the crowdsource model that propelled Wikipedia to be a primary online reference source, even in the most unexpected settings.

However, there was a crucial issue: The Greenlandic Wikipedia was illusory.

Nearly every article had been contributed by those who didn’t actually use the language. Wehr, now teaching Greenlandic in Denmark, estimates that only one or two actual Greenlanders had contributed. His biggest concern, however, was the influx of articles that seemed generated by individuals relying on machine translation. These entries were filled with basic errors—from grammatical mishaps to nonsensical language to major inaccuracies, such as a claim that Canada’s population was just 41. Some articles featured random letter sequences produced by machines unable to find appropriate Greenlandic terms.

“To the authors, it might have seemed Greenlandic, but they were unaware,” Wehr laments.

“Sentences often made no sense or contained clear mistakes,” he states. “AI translators struggle significantly with Greenlandic.”

What Wehr describes is not an isolated incident in the Greenlandic version.

Wikipedia stands as the most ambitious multilingual initiative after the Bible: It boasts editions in over 340 languages, with another 400 obscure ones being initiated and assessed. Many smaller editions have been overwhelmed by content translated automatically as AI tools have become widely available. Volunteers working on four African languages estimated to MIT Technology Review that 40% to 60% of articles in their respective Wikipedia editions were unedited machine translations. Additionally, an audit of the Inuktitut Wikipedia—a language akin to Greenlandic spoken in Canada—by MIT Technology Review concluded that more than two-thirds of pages containing multiple sentences included segments generated this way.

This is generating a complex issue. AI systems, from Google Translate to ChatGPT, learn to communicate in new languages by analyzing large datasets of text harvested from the internet. Given that Wikipedia often serves as the largest repository of online linguistic information for languages with few speakers, inaccuracies on these pages—whether grammatical or otherwise—can taint the data pools that AI systems rely upon. Consequently, this leads to translation errors and an ongoing cycle of linguistic decline as more poorly translated Wikipedia pages are added, while AI models continue to learn from these faulty translations. The essence of the matter is captured by a simple concept: Garbage in, garbage out.

“These systems depend on raw data,” explains Kevin Scannell, an ex-computer science professor at Saint Louis University, who now creates software tailored for languages at risk of extinction. “They learn about a language from the ground up without any other input. There are no grammar guides, dictionaries, or anything but the input text.”

While exact data on this issue’s scale is limited—partly due to the confidentiality of much AI training data and the rapidly evolving nature of the field—back in 2020, Wikipedia was estimated to comprise over half of the training data for AI models translating languages spoken by millions in Africa, including Malagasy, Yoruba, and Shona. A research team from Germany in 2022 found that Wikipedia constituted the sole accessible source of online linguistic data for 27 under-resourced languages.

This issue could have serious implications for instances where Wikipedia is poorly constructed—potentially pushing some of the world’s most vulnerable languages to the brink as future generations begin to abandon them.

“AI models will reflect the state of Wikipedia for these languages,” states Trond Trosterud, a computational linguist at the University of Tromsø in Norway, who has been alerting to the likely detrimental effects of mismanaged Wikipedia editions for years. “I find it hard to conceive of it not having repercussions. And, naturally, the greater Wikipedia’s dominance, the more severe the consequences.”

Utilize with caution

Automation has been integrated into Wikipedia since its earliest stages. Bots maintain the platform: they fix broken links, correct poor formatting, and even rectify spelling errors. These repetitive tasks can be automated effectively. An army of bots also exists to churn out brief articles on rivers, cities, or animals by fitting names into predefined phrases. Overall, they have generally improved the platform.

However, AI operates differently. Anyone can deploy it to create substantial harm with just a few clicks.

Wikipedia has navigated the initiation of the AI era better than many other platforms. It hasn’t been inundated with AI bots or false information, unlike social media. It largely maintains the innocence of the earlier internet period. It remains open and free for everyone to use, edit, and draw from, and it is governed by the community it serves. It is transparent and user-friendly. Nevertheless, community-managed platforms thrive on the size of their communities. English has flourished, while Greenlandic has faltered.

“What we need are dedicated Wikipedians. This is a critical need that people often overlook. It isn’t just magic,” remarks Amir Aharoni, a member of the volunteer Language Committee overseeing Wikipedia edition requests. “Machine translation can be effective and beneficial when used responsibly. Unfortunately, not everyone can be trusted to use it ethically.”

Trosterud has examined the conduct of users on smaller Wikipedia editions and notes that AI has encouraged a faction he labels as “Wikipedia hijackers.” These users encompass a wide range—from naive youths creating pages about their hometowns or favored YouTubers to well-intentioned Wikipedians who, believing that creating articles in minority languages assists those communities in some way, contribute inadvertently.

“The issue with them today is that they are equipped with Google Translate,” Trosterud states, pointing out that this facilitates the creation of lengthier and more credible-looking entries than previously possible: “In earlier times, they were limited to dictionaries.”

This has effectively streamlined acts of destruction—most affecting vulnerable languages, which tend to receive lesser quality AI translations. Various factors contribute to this, but a significant part arises from the limited source text available online. Additionally, sometimes models struggle to recognize a language due to its similarities to others, or because languages like Greenlandic and many Native American languages have structures that are ill-suited to most machine translation systems’ methodologies. (Wehr notes that Greenlandic is predominantly agglutinative, meaning words are formed by appending prefixes and suffixes to stems. Consequently, many words are highly context-dependent, expressing meanings that would require full sentences in other languages.)

Research conducted by Google prior to a significant upgrade to Google Translate three years ago indicated that translation systems for languages with fewer resources generally deliver inferior results compared to those for more resource-rich languages. Researchers observed, for instance, that their model often mistranslated basic nouns across languages, including animal names and colors. In a statement to MIT Technology Review, Google stated its commitment to maintaining a high quality standard for all 249 languages it supports “by rigorously testing and enhancing [its] systems, especially for languages with limited public text resources online.”)

Wikipedia itself provides a built-in editing feature known as Content Translate, which allows users to automatically convert articles from one language to another—intending to save time by keeping the original references and intricate formatting intact. However, it relies on external machine translation systems, making it susceptible to the same flaws as other machine translators—issues that the Wikimedia Foundation acknowledges as challenging to rectify. Each edition’s community determines if this feature is permissible, with some opting against it. (Significantly, the English-language Wikipedia has largely prohibited its use, citing that around 95% of articles generated using Content Translate failed to meet acceptable standards without significant additional editing.) Notably, it’s quite easy to identify when this program has been employed; Content Translate appends a tag in the Wikipedia backend.

Other AI programs are notably harder to track. Nevertheless, many Wikipedia editors I’ve spoken to reported observing a noticeable increase in poorly translated articles once their languages were incorporated into major online translation platforms.

Some Wikipedians utilizing AI for translations have at times admitted their lack of proficiency in the target languages. They may perceive themselves as offering smaller communities a rough draft to be refined—essentially emulating the model that has successfully operated for the more active Wikipedia editions.

Google Translate, for example, indicates that the Fulfulde term for January actually means June, whereas ChatGPT claims it’s August or September. The algorithms also interpret the Fulfulde word for “harvest” as meaning “fever” or “well-being,” among other options. 

However, once incorrect entries are created in lesser-used languages, there typically isn’t a legion of proficient speakers ready to correct them. These editions often have very few readers and sometimes no regular editors at all.

Yuet Man Lee, a Canadian educator in his 20s, recounts using a combination of Google Translate and ChatGPT to translate several articles he had authored for the English Wikipedia into Inuktitut, hoping to assist a smaller Wikipedia community. He appended a note on one article indicating that it was merely a rough translation. “I didn’t expect anyone would notice [the article],” he reflects. “When you post something on the smaller Wikipedias—usually, no one pays attention.”

Yet, he adds, he still hoped “someone might come across it and improve it”—expressing his uncertainty about whether the AI-generated Inuktitut translation was grammatically sound. No one has edited the article since its creation.

Lee, who instructs social sciences in Vancouver and began editing English Wikipedia entries a decade ago, suggests that users familiar with busier Wikipedia editions can fall prey to what he terms “bigger-Wikipedia arrogance”: When they contribute to smaller editions, they assume there will be others who will correct their errors. Sometimes it does work. Lee observes having previously added several articles to Wikipedia in Tatar—a language of several million speakers mainly in Russia—one of which was ultimately refined. However, he points out that the Inuktitut Wikipedia is comparatively a “barren wasteland.”

He articulates that his intentions were commendable: he aimed to increase the number of entries in an Indigenous Canadian Wikipedia. “I now contemplate that it might have been misguided. I didn’t take into account that I might be contributing to a recursive cycle,” he reflects. “I was driven by the desire to share content, fueled by curiosity and enjoyment, without adequately considering the repercussions.”

“Totally, completely no future”

Wikipedia is propelled by unbridled optimism. Editing can be a thankless endeavor, encompassing lengthy periods spent quarreling with anonymous, pseudonymous individuals, yet dedicated contributors invest hours of unpaid effort for the sake of a noble cause. This dedication is what motivates many regular small-language editors I consulted. They all dread the implications of continued subpar content on their pages.

Abdulkadir Abdulkadir, a 26-year-old agricultural planner I spoke with over a crackly phone line from a bustling roadside in northern Nigeria, shared that he dedicates three hours daily to refining entries in his native Fulfulde, a language utilized primarily by pastoralists and farmers across the Sahel. “However, the workload is overwhelming,” he expressed.

Abdulkadir perceives an urgent necessity for the Fulfulde Wikipedia to function effectively. He has been advocating for it as one of the limited online resources available to farmers in remote locales, potentially supplying details on which seeds or crops are best suited for their fields in a language they can easily grasp. Providing them with a machine-translated article, Abdulkadir told me, could “easily mislead them,” given that the information will likely not be accurately translated into Fulfulde.

For instance, Google Translate states that the Fulfulde term for January means June, while ChatGPT claims it’s August or September. The systems also suggest that the Fulfulde word for “harvest” translates to “fever” or “well-being,” among other alternatives.

Abdulkadir recounted that he was compelled to amend an entry about cowpeas, a crucial cash crop throughout much of Africa, after realizing it was largely nonsensical.

If anyone desires to generate content on the Fulfulde Wikipedia, Abdulkadir argues, the translations must be done manually. Otherwise, “whoever reads your articles will [not] be able to obtain even basic information,” he advises these Wikipedians. Nevertheless, he estimates that approximately 60% of articles still consist of unedited machine translations. Abdulkadir predicted that unless there are significant advancements in how AI systems learn and are utilized, the future for Fulfulde appears grim. “It’s going to be dreadful, honestly,” he stated. “It is headed toward total, complete extinction.”

Across Nigeria from Abdulkadir, Lucy Iwuala contributes to Wikipedia in Igbo, a language spoken by millions in southeastern Nigeria. “The damage has already been inflicted,” she conveyed, browsing through the two most recently generated articles. Both had been automatically translated via Wikipedia’s Content Translate and contained so many errors that she expressed it would have caused her a headache to keep reading them. “There are certain terms that remain untranslated. They are still in English,” she highlighted. She recognized the username that created the entries as a repeat offender. “This one even incorporates letters that aren’t present in the Igbo alphabet,” she noted.

Iwuala began contributing to Wikipedia three years ago out of concern that English was encroaching upon Igbo. Many engaged in smaller Wikipedia editions share this fear. “This is my culture. This is my identity,” she asserted. “That underpins the whole endeavor: to ensure that you are not erased.”

Iwuala, now a professional translator fluent in English and Igbo, mentioned that those causing the most disruption are inexperienced and perceive AI translations as a means to swiftly boost the visibility of the Igbo Wikipedia. She frequently finds herself having to clarify at online edit-a-thons she organizes or through emails with various flawed editors that the outcome can be counterproductive, deterring users: “You will feel disheartened, and you won’t want to return. You will just leave and revert to the English Wikipedia.”

These concerns resonate with Noah Ha‘alilio Solomon, an assistant professor of Hawaiian language at the University of Hawai‘i. He reported that some 35% of words on some Hawaiian Wikipedia pages are incomprehensible. “If this is the Hawaiian that will be available online, it will cause more harm than benefit,” he asserts.

Hawaiian, which was perilously close to extinction several decades ago, has been experiencing revitalization efforts spearheaded by Indigenous advocates and academics. Witnessing such poor-quality Hawaiian on a widely utilized platform like Wikipedia distresses Ha‘alilio Solomon.

“It’s painful, as it harkens back to all the times our culture and language were appropriated,” he laments. “We have been fervently fighting for language revitalization. It’s a challenging path, and this can present additional hurdles. People will incorrectly perceive this as an accurate portrayal of the Hawaiian language.”

The ramifications of these errors on Wikipedia can swiftly manifest. AI translators that have undoubtedly included these inaccurate pages in their training data are now aiding in the creation, for example, of flawed AI-generated manuals aimed at learners of languages such as Inuktitut and Cree, Indigenous languages in Canada, along with Manx, a minor Celtic language from the Isle of Man. Several of these have recently appeared for sale on Amazon. “It was nothing but complete absurdity,” remarks Richard Compton, a linguist at the University of Quebec in Montreal, regarding a volume he reviewed that was marketed as an introductory phrasebook for Inuktitut.

Rather than enhancing accessibility to minority languages, AI is creating an ever-growing labyrinth for learners and speakers of those languages. “It feels like a slap in the face,” Compton states. He is concerned that younger generations in Canada, eager to learn their languages, having fought so hard against marginalization to preserve their heritage, might resort to online resources like ChatGPT or phrasebooks on Amazon, exacerbating the situation. “It’s a form of deception,” he argues.

A race against time

UNESCO reports that a language goes extinct every two weeks. However, whether the Wikimedia Foundation, the entity responsible for Wikipedia, has a duty toward the languages represented on its platform is uncertain. In a conversation with Runa Bhattacharjee, a senior director at the foundation, she indicated that the responsibility lies with individual communities to determine what content they want on their Wikipedia pages. “Ultimately, it falls upon the community to ensure there is no vandalism or unwarranted content, whether arising from machine translation or other sources,” she asserted. Generally, Bhattacharjee noted, editions are considered for closure only when specific complaints arise about them.

But if there’s no engaged community, how can an edition be rectified or a complaint lodged?

Bhattacharjee clarified that the Wikimedia Foundation views its role in such circumstances as maintaining the Wikipedia infrastructure in case someone emerges to revive it: “We provide the environment for them to flourish and evolve. That’s our position.”

Inari Saami, spoken in a single isolated community in northern Finland, exemplifies how people can effectively leverage Wikipedia. The language was on the brink of extinction four decades ago; merely four children spoke it. Their parents founded the Inari Saami Language Association in a final effort to ensure its survival. Their endeavors paid off. Presently, there are hundreds of speakers, schools employing Inari Saami as the medium of instruction, and 6,400 Wikipedia entries in the language, each meticulously edited by fluent speakers.

This achievement underscores how Wikipedia can indeed serve as a unique vehicle for small and committed communities to promote the preservation of their languages. “We prioritize quality over quantity,” asserts Fabrizio Brecciaroli, a member of the Inari Saami Language Association. “We plan to use Wikipedia as a repository for the written language. It’s crucial for younger generations to engage with Inari Saami in a digital format.”

This initiative has been so effective that Wikipedia has been incorporated into the curriculum at schools teaching Inari Saami, Brecciaroli notes. He frequently receives calls from teachers requesting simple articles on subjects ranging from tornadoes to Saami folklore. Wikipedia has even enabled the creation of new vocabulary in Inari Saami. “We have to constantly generate new words,” Brecciaroli explains. “Young people require them to converse about sports, politics, and video games. If they are uncertain how to articulate something, they now refer to Wikipedia.”

Wikipedia represents an enormous intellectual undertaking. The situation with Inari Saami indicates that, with diligent effort, success is achievable for smaller languages. “Our primary goal is to secure the survival of Inari Saami,” Brecciaroli concludes. “It might be advantageous that there isn’t a Google Translate for Inari Saami.”

This may hold true—although large language models like ChatGPT can be configured to translate into languages that traditional machine translation systems do not accommodate. Brecciaroli acknowledged that ChatGPT isn’t particularly efficient in Inari Saami, but the performance varies greatly depending on the request; if posed a question in the language, the reply often incorporates Finnish terms or fabricated words. However, if the inquiry is made in English, Finnish, or Italian and then requested in Inari Saami, it tends to yield better results.

Given all this, the challenge of generating high-quality online content transforms into a race against time. “ChatGPT thrives on a wealth of text,” Brecciaroli states. “As long as we continue inputting quality material, we will eventually yield fruitful outcomes. That’s the optimism.” This notion is echoed by several linguists I consulted—believing it may be possible to break the “garbage in, garbage out” pattern. (OpenAI, responsible for ChatGPT, did not respond to a request for commentary.)

Yet, the overall challenge is likely to intensify, as many languages are not as fortunate as Inari Saami—leading to their AI translators being trained on an increasing volume of flawed data. Wehr, regrettably, appears far less hopeful about the prospects for his cherished Greenlandic.

Since his initial deletion of much of the Greenlandic-language Wikipedia content, Wehr has spent years attempting to enlist speakers to aid in its revival. He has made appearances in Greenlandic media and issued appeals on social media. Yet, he reports receiving little response; he describes the experience as disheartening.

“There seems to be no one in Greenland interested in this, or willing to contribute,” he states. “It seems completely futile, which is why it should be closed.”

Late last year, he initiated a process to request the shutdown of the Greenlandic-language Wikipedia from the Language Committee. Following months of contentious discussions among various Wikipedia administrators, some expressed surprise that an edition appearing superficially healthy could be plagued by such significant issues.

Just earlier this month, Wehr’s request was granted: The Greenlandic Wikipedia is set to be closed, with any remaining articles moved to the Wikipedia Incubator, where new language editions are nurtured and developed. Among the reasons cited by the Language Committee is the reliance on AI tools, which “have frequently produced nonsensical content that could misrepresent the language.”

Nonetheless, it may already be too late—errors in Greenlandic seem to have already seeped into machine translation systems. When prompted, both Google Translate and ChatGPT fail to accurately count to 10 in proper Greenlandic.

Jacob Judah is an investigative journalist based in London.

You may also like

Leave a Comment