Around half the world’s population still lacks access to the internet. Companies like Facebook, SpaceX, and Amazon want to change that by launching constellations of satellites into the sky, which will beam internet back down to Earth. But even if these projects succeed, tech giants may face a more fundamental problem in bridging the digital divide: language.
There are thousands of different tongues spoken around the world, but most of the content on the web is only available in a select few, primarily English. More than 10 percent of Wikipedia is written in English, for example, and almost half the site’s articles are in European dialects. Getting one billion more people online is often held up as the next major milestone, but when they log on for the first time, those users may find the internet has little to offer in the primary languages they speak.
“Approximately 5 percent of the world speaks English at home,” said Juan Ortiz Freuler, a fellow at the World Wide Web Foundation, during a panel at the RightsCon conference in Tunisia Wednesday, but around “50 percent of the web is in English.” Freuler argued the internet has facilitated “cultural homogenization,” now that the majority of its users rely on Facebook and Google, and communicate in the same dominant languages. But the problem “is not because of changes in technology,” said Kristen Tcherneshoff, the community director of Wikitongues, an organization that promotes language diversity. Corporations and governments largely didn’t provide the resources and support necessary to bring smaller languages online.
Many of the biggest online platforms were founded in Silicon Valley, and started with primarily English-speaking user bases. As they’ve expanded around the world and to different languages, they’ve been playing catch-up. Facebook has faced criticism for not employing enough native speakers to monitor content in countries where it has millions of users. In Myanmar, for example, the company for years had only a handful of Burmese speakers as hate speech proliferated. Facebook has admitted that it did not do enough to prevent its platform from being used to incite violence in the country.
Another part of the problem stems from the fact that relatively few datasets have been created in these languages that are suitable for training artificial intelligence tools. Take Sinhala, also known as Sinhalese, which is spoken by around 17 million people in Sri Lanka and can be written in four different ways. Facebook’s algorithms—trained primarily on English and other European languages—don’t map well to it. That makes it difficult for the social network to automatically identify things like hate speech in the country, or stop the flow of misinformation after a terrorist attack.
But Tcherneshoff says language diversity is about more than just practicality, it’s about expression. Jokes, emotions, and art are often difficult, if not impossible, to translate from one language to another. She pointed to projects like the Mother Language Meme Challenge, which invited people to make memes in their native tongue for UNESCO’s International Mother Language Day in 2018. The idea, in part, was to demonstrate how humor is often intimately tied to language.
Mozilla is one organization working to crowdsource language datasets that can be used by any developer for free, like Common Voice, which it claims is “the world’s most diverse voice dataset.” It includes recordings from over 42,000 people in dominant languages like English and German, but also Welsh and Kabyle. The project is designed to give engineers the tools they need to build things like speech-to-text programs in different tongues. Mark Surman, the executive director of the Mozilla Foundation, believes open source datasets like Common Voice are one of the only viable ways to ensure more language diversity in emerging tech. At for-profit companies, the issue “falls very low on the economic ladder,” he said during the RightsCon panel.
Bringing more languages online may ultimately be an exercise in cultural preservation, rather than utility. Despite advocates’ best efforts, it’s unlikely there will ever be as many websites in Yoruba, say, as there are in French or Arabic. New internet users may simply opt to browse in their second or third language instead of their native tongue.
At the same time, corporations like Google have built programs that make it easier to access online content in different languages, like Google Translate. Google also gave some of its tools to Wikipedia to help translate articles, although they still require careful review by native speakers; Wiki editors have complained that the Google tools sometimes produce shoddy results. For the time being, promoting language diversity online still requires the concerted effort of humans.