top of page
u8736419749_Split-screen_showing_Cyrillic_Russian_script_lang_74c80857-300d-4b38-bd51-0711

Scripts and Orthographies

Orthographic systems vary widely across the family and often reflect regional history and policy:
 

  • Persian (Iran): Perso-Arabic script; Tajik (Tajikistan): Cyrillic (with historical use of Perso-Arabic and Latin); Dari (Afghanistan): Perso-Arabic. 

  • Kurdish: commonly Latin (Hawar/Bedirxan) in Turkey/Syria and Kurdo-Arabic in Iraq/Iran; historical use of Cyrillic in the former USSR; multiple standardization efforts exist by region/variety. 

  • Pashto: extended Perso-Arabic script with additional letters standardized over centuries. 
     

These cross-script realities pose practical challenges for NLP (OCR, normalization, transliteration, tokenization) and opportunities for cross-script tools and benchmarks.

u8736419749_An_abstract_visualization_of_neural_networks_wove_3dc48ff7-edde-4422-8cd5-2a8c

Iranian Languages: An Overview

Distribution_of_Iranian_Languages.png

Iranian linguistic family

The Iranian (Iranic) linguistic family is one of the two major branches of Indo-Iranian, itself a branch of Indo-European. It comprises dozens of modern and historical languages and varieties spoken across Iran, Afghanistan, Tajikistan, Pakistan, Iraq, Turkey, the Caucasus, and parts of Central and South Asia. Major contemporary languages include Persian (Farsi/Dari/Tajik), Kurdish, Pashto, Balochi, Luri, Gilaki, Mazandarani, and Ossetian. Estimates of the number of Iranian languages vary by source; SIL notes ~86 Iranian languages, with the largest communities in Persian, Pashto, and the Kurdish continuum.

Historical lineage and classification

Within Indo-Iranian, the Iranian branch developed through three broadly attested stages—Old Iranian, Middle Iranian, and New (Modern) Iranian—with Old Persian (Achaemenid inscriptions) and Avestan as the best-attested Old Iranian languages. Today’s varieties are typically grouped into Western and Eastern Iranian, with Western often further divided into Southwestern (e.g., Persian) and Northwestern (e.g., Kurdish, Balochi, Gilaki/Mazandarani), and Eastern including Pashto, Ossetian, and others. 

  • Western Iranian

    • Southwestern: Persian (Farsi/Dari/Tajik) and close relatives.

    • Northwestern: Kurdish (Sorani/Kurmanji/Laki), Balochi, Luri, Gilaki, Mazandarani, Central Iranian varieties, etc.

  • Eastern Iranian: Pashto, Ossetian, Pamir languages, and several smaller or endangered varieties.

Geographic distribution

Iranian languages are transnational, spanning the Iranian plateau, the Caucasus (e.g., Ossetian), Central Asia (e.g., Tajik), Mesopotamia and Anatolia (e.g., Kurdish varieties), and South Asia (e.g., Pashto in Afghanistan/Pakistan). Read more at The Iranian Language Family.

Linguistic characteristics (high-level)

Across the family, many languages display SOV (Subject-Object-Verb) word order, rich derivational morphology, and productive light-verb and compound-predicate constructions (prominent in Persian and Kurdish). Variation is substantial: for instance, some Eastern Iranian languages (e.g., Pashto) retain case-marking contrasts that interact with alignment patterns, while several Western Iranian varieties have reduced nominal case and developed robust linking/ezafe constructions. (Details vary by language/variety)

Low-resource, endangerment, and policy context

Many Iranian languages are low-resource in the digital sphere, lacking open corpora, standardized orthographies, and NLP tools. Several are endangered or at risk, with limited documentation and few community resources—a situation repeatedly highlighted in Iranian linguistics and language-policy literature. Regional language policies, education media, migration, and script reforms (e.g., Tajik Cyrillic adoption) have shaped usage and intergenerational transmission, impacting both visibility and technology support.

Why this matters for NLP and LLMs

In the era of LLMs, data imbalances risk amplifying exclusion: languages with sparse digital presence are less likely to be represented accurately, and their scripts, dialect continua, and cultural registers can be mis-modeled. For the Iranian linguistic family, priority areas include:
 

  • Data creation & curation: balanced corpora across regions, genres, and scripts.

  • Benchmarks & evaluation: cross-script, cross-variety, and dialect-aware tasks.

  • Script technology: OCR, transliteration, normalization across Perso-Arabic/Cyrillic/Latin.

  • Documentation & revitalization: community-led projects for endangered and heritage varieties.

  • Ethical & culturally grounded NLP: respecting local naming, identity, and sociolinguistic realities.
     

u8736419749_Show_script_from_Tajikistan_Cyrillic_Russian_scri_d9a3a672-ac33-494e-8b88-4d4b

Further reading & reference resources

  • SIL Eurasia – Iranian language family overview (counts, macro view) (Link)

  • Ethnologue – Indo-Iranian/Western/Eastern Iranian listings (subgroups & language entries). (Link)

  • Wikipedia – Iranian languages (Link)

  • Fanoos Map of Languages of Iran: interactive map (Link)

  • Gholami (2020), “Endangered Iranian Languages” – survey of endangerment dynamics. (Link)

  • Script specifics: Tajik (Cyrillic; historical Arabic/Latin); Kurdish (Latin & Kurdo-Arabic; historical Cyrillic/Yezidi); Pashto (extended Perso-Arabic).

bottom of page