The Platform
Latest Articles
by Shovkat Shamuratov
by Syed Hamza Mahroof
by Gordon Feller
by Manish Rai
by Mohammad Ibrahim Fheili
by Arees Khan Mangi
by Rupa Kumari
by Amro Shubair
by James Carlini
by Shovkat Shamuratov
by Syed Hamza Mahroof
by Gordon Feller
by Manish Rai
by Mohammad Ibrahim Fheili
by Arees Khan Mangi
by Rupa Kumari
by Amro Shubair
by James Carlini
Digital Development Needs a Language Layer: Multilateral Development Banks Should Lead
Multilateral banks have financed the digital “rails.” Now they need to put up the signs in every language people use.
Multilateral development banks (MDBs), such as the World Bank, the Asian Development Bank, and the African Development Bank, have spent decades investing in the fundamentals of growth: roads, power, schools, and hospitals. In the past few years, they’ve added a newer class of essential assets—digital public infrastructure (DPI): national ID systems, real-time payment rails, and secure data-exchange platforms that let governments deliver services at scale.
Here’s the missing layer: language. If the rails are in place but the signage is written only in dominant tongues, whole populations will watch the digital state speed by. The most elegant platforms won’t matter if citizens can’t understand the interface or make themselves understood.
This piece makes a straightforward case: MDBs should fund and steward multilingual datasets—especially for low-resource and Indigenous languages—as core development infrastructure. These assets power speech recognition, machine translation, voice assistants, and service chatbots that actually work for the people who need them most. The price tag is modest by development standards; the payoff is reach, equity, and an ecosystem that local innovators can build on.
The Missing Language Layer
DPI has become a mantra in development circles: interoperable systems that let people prove who they are, move money, and share data safely. Yet too often those systems are designed first—or only—in the languages of elites and majorities. Speakers of minority and low-resource languages are left with partial access at best.
The failure is basic comprehension. Without high-quality, openly available language data—parallel text corpora, speech recordings, and lexicons—AI-enabled services inside DPI (such as a benefits hotline, a triage chatbot, or a voice-enabled payment app) either misfire or don’t exist. For billions, the linguistic barrier is as real as a lack of bandwidth.
Treat Language Data as a Digital Public Good
A useful frame already exists. The UN-backed Digital Public Goods Alliance defines digital public goods (DPGs) as openly licensed software, data, models, and standards that meet best practices on privacy, equity, and security and advance the SDGs. DPI provides the rails; DPGs supply the blocks that run on them.
Open, well-governed multilingual datasets are archetypal DPGs. They are non-rivalrous by nature—my use doesn’t diminish yours—but often excludable unless policy and procurement push toward openness. Left to the market, high-quality language resources cluster where profits are most likely; low-resource communities are often ignored. Meanwhile, valuable data sits in private silos or in scattered, poorly standardized collections—the result: duplication, bias, and waste.
Why MDBs, Not Just Markets
MDBs are designed to fill exactly this kind of gap. National governments face domestic politics and capacity constraints; private firms chase near-term returns; philanthropy is catalytic but finite. MDBs operate across borders, finance at scale, and have a mandate to maximize public benefit. They already bankroll the identity systems and payment networks that require language capability to be equitable. Extending that portfolio to multilingual datasets is a natural, low-cost multiplier.
But data alone doesn’t deliver public value. Poorly curated or proprietary datasets can entrench error, bake in bias, or exclude the very communities MDBs aim to serve. If language data is going to function as infrastructure, MDB projects must go beyond “collection” and insist on the full stack: open licensing, clear quality standards, transparent governance, community oversight, and privacy by design.
Cost, Speed, and Leverage
Language data is labor-intensive, but cheap compared to big-ticket infrastructure. That’s what makes it powerful. India’s Bhashini reportedly spent on the order of $6–7 million to assemble datasets across 22 scheduled languages, unlocking applications from voice-based digital payments to farmer helplines. By contrast, a single national digital-ID rollout can cost hundreds of millions; the World Bank’s commitment to Ethiopia’s digital ID was about $350 million. In that light, a modest language-data budget is a rounding error that significantly increases the ROI of larger digital programs.
The leverage doesn’t stop at cost. Once created, high-quality datasets are reusable across sectors: health, education, agriculture, and social protection. They disproportionately benefit people who are less literate, less connected, or less fluent in dominant languages—precisely the groups most at risk of being left behind. Open language data has documented downstream impacts on education, financial inclusion, healthcare, agriculture, communication, and disaster response.
Proof Points on the Ground
There’s no need to start from scratch. Consider what’s already working. India’s Bhashini platform shows how a public repository can knit together contributions from government agencies, research labs, startups, and volunteers through shared tooling and evaluation tracks, and it is already powering practical applications from payments to helplines. Building on that ecosystem, AI4Bharat demonstrates how releasing open corpora and models across dozens of Indian languages can catalyze a developer community that builds in local languages first.
In West Africa, the YorùLect consortium assembled a high-quality corpus covering four Yoruba dialects in a matter of months, grounding its work in community engagement to capture real dialectal range. At a continental scale, the philanthropy-backed Lacuna Fund has seeded open language datasets for more than two dozen African languages, demonstrating that public-good infrastructure can be built even in resource-constrained contexts. Taken together, these efforts outline an operational playbook that MDBs can scale: fund toolkits for community collection, align data standards and benchmarks, publish templates for consent and privacy, and create regional knowledge-sharing platforms so countries can reuse methods rather than reinvent them. Treat the resulting datasets as evolving public assets, not one-off inputs, and their value compounds.
The Case for MDB Leadership
The case for MDB leadership rests on four planks. First is inclusion and rights: equitable access to digital services now defines development, and language inclusion is the foundation for participation by rural, Indigenous, and low-literacy communities. Second is the provision of regional public goods: open, interoperable datasets cross borders and unlock gains in education, health, agriculture, and governance, serving whole populations rather than narrow client bases. Third is better value from sunk costs: adding language support extends the reach and increases the ROI of identity, payments, and data-exchange systems that banks already fund, making it a cheap force multiplier. And fourth is correcting market failure: for low-resource languages, the costs are front-loaded and the returns diffuse, which is precisely where MDBs aggregate demand, de-risk investment, and crowd in public, private, and philanthropic partners.
How to Operationalize—Without Reinventing the Wheel
A practical program fits easily within existing digital portfolios: create dedicated financing windows for multilingual data—with priority for low-resource languages—and make language a day-one requirement in DPI so ID budgets cover translation and speech while health and education plan for local-language content, triage chatbots, and transcription. Build capacity in universities, statistical offices, labs, and civic-tech with open tools, governance templates, and training; broker global–local partnerships (e.g., Masakhane, Mozilla) to keep projects state-of-the-art and grounded; link disbursements to open releases, standards compliance, and measurable coverage gains; and fund South–South exchange and shared benchmarks so Kiswahili, Hausa, and Bengali datasets interoperate rather than fragment.
A Practical Rollout Pattern
The fastest way to integrate multilingual data into digital infrastructure is to bundle it directly into existing projects, rather than treat it as a separate initiative. For example, when a bank funds a new national ID or payment system, a small portion of that budget can be set aside for collecting and releasing language datasets at the same time. The work can follow a simple sequence: at the outset, identify which local languages matter most for the service; midway through the project, release partial datasets and pilot tools so that progress is visible and immediately useful; and by the project’s close, ensure that an open dataset is publicly available, accompanied by basic documentation, benchmarks of quality, and a plan for ongoing maintenance by a designated steward. If funding is limited, banks can pool resources across countries with similar linguistic needs or tie disbursements to the verified release of high-quality datasets through results-based financing.
Anticipating the Pushback
Large institutions don’t change course easily. Multilateral banks already have entrenched budgets and routines, and adding a new requirement—like building language data into every digital project—will meet resistance. The way to overcome that resistance is to show how language fits inside what they already do. First, tie language objectives directly to existing development frameworks so they are seen as part of the mandate, not an extra. Second, prove the idea with small pilots that deliver visible user benefits—for example, a voice-based enrollment tool in a widely spoken minority language. Third, publish short evaluation reports so that progress is transparent, comparable across projects, and difficult to dismiss. These steps lower the political and bureaucratic cost of adopting multilingual data as infrastructure.
What Success Looks Like
A farmer in Madhya Pradesh asks a voice assistant in Bundeli for crop price supports. An elder in rural Ghana renews health insurance in Twi. A social-protection hotline in northern Nigeria automatically transcribes and translates calls across Hausa dialects so caseworkers respond faster. A Kenyan startup taps an open Kiswahili corpus to build a safety filter that finally catches abuse without over-blocking everyday speech. None of this requires frontier AI. It requires trustworthy, open, multilingual data.
The bottom line: Language is not ornamentation. It’s the interface of rights, services, and opportunity. MDBs have the mandate and the means to close the gap. Treating multilingual datasets as infrastructure is a small budget shift with outsized returns—and a clear signal that inclusion is more than a slogan.
Abeer Sharma is a lawyer and researcher focused on international dispute resolution and the governance of emerging technologies. His legal work spans arbitration, financial crimes, constitutional law, and complex regulation, grounded in comparative and international practice. He has worked across the blockchain ecosystem on governance and institutional design, including DAOs and decision-making systems. His current research centers on AI governance—how policy and regulatory frameworks can keep pace with fast-moving technologies. Trained in law, economics, and philosophy, he brings an interdisciplinary lens to technology, governance, and global risk, emphasizing responsible AI adoption and pragmatic governance innovations.