Iranian Languages | SilkRoadNLP

u8736419749_Split-screen_showing_Cyrillic_Russian_script_lang_74c80857-300d-4b38-bd51-0711

Scripts and Orthographies

Orthographic systems vary widely across the family and often reflect regional history and policy:

Persian (Iran): Perso-Arabic script; Tajik (Tajikistan): Cyrillic (with historical use of Perso-Arabic and Latin); Dari (Afghanistan): Perso-Arabic.
Kurdish: commonly Latin (Hawar/Bedirxan) in Turkey/Syria and Kurdo-Arabic in Iraq/Iran; historical use of Cyrillic in the former USSR; multiple standardization efforts exist by region/variety.
Pashto: extended Perso-Arabic script with additional letters standardized over centuries.

These cross-script realities pose practical challenges for NLP (OCR, normalization, transliteration, tokenization) and opportunities for cross-script tools and benchmarks.

u8736419749_An_abstract_visualization_of_neural_networks_wove_3dc48ff7-edde-4422-8cd5-2a8c

Iranian Languages: An Overview

Iranian linguistic family

The Iranian (Iranic) linguistic family is one of the two major branches of Indo-Iranian, itself a branch of Indo-European. It comprises dozens of modern and historical languages and varieties spoken across Iran, Afghanistan, Tajikistan, Pakistan, Iraq, Turkey, the Caucasus, and parts of Central and South Asia. Major contemporary languages include Persian (Farsi/Dari/Tajik), Kurdish, Pashto, Balochi, Luri, Gilaki, Mazandarani, and Ossetic. Estimates of the number of Iranian languages vary by source; SIL notes ~86 Iranian languages, with the largest communities in Persian, Pashto, and the Kurdish continuum.

Iranian Language Chart: List of languages accepted at SilkRoadNLP 2026

Historical lineage and classification

Within Indo-Iranian, the Iranian branch developed through three broadly attested stages—Old Iranian, Middle Iranian, and New (Modern) Iranian—with Old Persian (Achaemenid inscriptions) and Avestan as the best-attested Old Iranian languages. Today’s varieties are typically grouped into Western and Eastern Iranian, with Western often further divided into Southwestern (e.g., Persian) and Northwestern (e.g., Kurdish, Balochi, Gilaki/Mazandarani), and Eastern including Pashto, Ossetic (or Ossetian), and others.

Western Iranian
- Southwestern: Persian (Farsi/Dari/Tajik) and close relatives.
- Northwestern: Kurdish (Sorani/Kurmanji/Laki), Balochi, Luri, Gilaki, Mazandarani, Central Iranian varieties, etc.
Eastern Iranian: Pashto, Ossetic, Pamir languages, and several smaller or endangered varieties.

Iranian Language Family (ethnologue)

Geographic distribution

Iranian languages are transnational, spanning the Iranian plateau, the Caucasus (e.g., Ossetic), Central Asia (e.g., Tajik), Mesopotamia and Anatolia (e.g., Kurdish varieties), and South Asia (e.g., Pashto in Afghanistan/Pakistan). Read more at The Iranian Language Family.

Linguistic characteristics (high-level)

Across the family, many languages display SOV (Subject-Object-Verb) word order, rich derivational morphology, and productive light-verb and compound-predicate constructions (prominent in Persian and Kurdish). Variation is substantial: for instance, some Eastern Iranian languages (e.g., Pashto) retain case-marking contrasts that interact with alignment patterns, while several Western Iranian varieties have reduced nominal case and developed robust linking/ezafe constructions. (Details vary by language/variety)

Low-resource, endangerment, and policy context

Many Iranian languages are low-resource in the digital sphere, lacking open corpora, standardized orthographies, and NLP tools. Several are endangered or at risk, with limited documentation and few community resources—a situation repeatedly highlighted in Iranian linguistics and language-policy literature. Regional language policies, education media, migration, and script reforms (e.g., Tajik Cyrillic adoption) have shaped usage and intergenerational transmission, impacting both visibility and technology support.

Why this matters for NLP and LLMs

In the era of LLMs, data imbalances risk amplifying exclusion: languages with sparse digital presence are less likely to be represented accurately, and their scripts, dialect continua, and cultural registers can be mis-modeled. For the Iranian linguistic family, priority areas include:

Data creation & curation: balanced corpora across regions, genres, and scripts.
Benchmarks & evaluation: cross-script, cross-variety, and dialect-aware tasks.
Script technology: OCR, transliteration, normalization across Perso-Arabic/Cyrillic/Latin.
Documentation & revitalization: community-led projects for endangered and heritage varieties.
Ethical & culturally grounded NLP: respecting local naming, identity, and sociolinguistic realities.

u8736419749_Show_script_from_Tajikistan_Cyrillic_Russian_scri_d9a3a672-ac33-494e-8b88-4d4b