
Scripts and Orthographies
Orthographic systems vary widely across the family and often reflect regional history and policy:
-
Persian (Iran): Perso-Arabic script; Tajik (Tajikistan): Cyrillic (with historical use of Perso-Arabic and Latin); Dari (Afghanistan): Perso-Arabic.
-
Kurdish: commonly Latin (Hawar/Bedirxan) in Turkey/Syria and Kurdo-Arabic in Iraq/Iran; historical use of Cyrillic in the former USSR; multiple standardization efforts exist by region/variety.
-
Pashto: extended Perso-Arabic script with additional letters standardized over centuries.
These cross-script realities pose practical challenges for NLP (OCR, normalization, transliteration, tokenization) and opportunities for cross-script tools and benchmarks.

Iranian Languages: An Overview

Iranian linguistic family
The Iranian (Iranic) linguistic family is one of the two major branches of Indo-Iranian, itself a branch of Indo-European. It comprises dozens of modern and historical languages and varieties spoken across Iran, Afghanistan, Tajikistan, Pakistan, Iraq, Turkey, the Caucasus, and parts of Central and South Asia. Major contemporary languages include Persian (Farsi/Dari/Tajik), Kurdish, Pashto, Balochi, Luri, Gilaki, Mazandarani, and Ossetian. Estimates of the number of Iranian languages vary by source; SIL notes ~86 Iranian languages, with the largest communities in Persian, Pashto, and the Kurdish continuum.
Historical lineage and classification
Within Indo-Iranian, the Iranian branch developed through three broadly attested stages—Old Iranian, Middle Iranian, and New (Modern) Iranian—with Old Persian (Achaemenid inscriptions) and Avestan as the best-attested Old Iranian languages. Today’s varieties are typically grouped into Western and Eastern Iranian, with Western often further divided into Southwestern (e.g., Persian) and Northwestern (e.g., Kurdish, Balochi, Gilaki/Mazandarani), and Eastern including Pashto, Ossetian, and others.
-
Western Iranian
-
Southwestern: Persian (Farsi/Dari/Tajik) and close relatives.
-
Northwestern: Kurdish (Sorani/Kurmanji/Laki), Balochi, Luri, Gilaki, Mazandarani, Central Iranian varieties, etc.
-
-
Eastern Iranian: Pashto, Ossetian, Pamir languages, and several smaller or endangered varieties.
Geographic distribution
Iranian languages are transnational, spanning the Iranian plateau, the Caucasus (e.g., Ossetian), Central Asia (e.g., Tajik), Mesopotamia and Anatolia (e.g., Kurdish varieties), and South Asia (e.g., Pashto in Afghanistan/Pakistan). Read more at The Iranian Language Family.
Linguistic characteristics (high-level)
Across the family, many languages display SOV (Subject-Object-Verb) word order, rich derivational morphology, and productive light-verb and compound-predicate constructions (prominent in Persian and Kurdish). Variation is substantial: for instance, some Eastern Iranian languages (e.g., Pashto) retain case-marking contrasts that interact with alignment patterns, while several Western Iranian varieties have reduced nominal case and developed robust linking/ezafe constructions. (Details vary by language/variety)
Low-resource, endangerment, and policy context
Many Iranian languages are low-resource in the digital sphere, lacking open corpora, standardized orthographies, and NLP tools. Several are endangered or at risk, with limited documentation and few community resources—a situation repeatedly highlighted in Iranian linguistics and language-policy literature. Regional language policies, education media, migration, and script reforms (e.g., Tajik Cyrillic adoption) have shaped usage and intergenerational transmission, impacting both visibility and technology support.
Why this matters for NLP and LLMs
In the era of LLMs, data imbalances risk amplifying exclusion: languages with sparse digital presence are less likely to be represented accurately, and their scripts, dialect continua, and cultural registers can be mis-modeled. For the Iranian linguistic family, priority areas include:
-
Data creation & curation: balanced corpora across regions, genres, and scripts.
-
Benchmarks & evaluation: cross-script, cross-variety, and dialect-aware tasks.
-
Script technology: OCR, transliteration, normalization across Perso-Arabic/Cyrillic/Latin.
-
Documentation & revitalization: community-led projects for endangered and heritage varieties.
-
Ethical & culturally grounded NLP: respecting local naming, identity, and sociolinguistic realities.

Further reading & reference resources
-
SIL Eurasia – Iranian language family overview (counts, macro view) (Link)
-
Ethnologue – Indo-Iranian/Western/Eastern Iranian listings (subgroups & language entries). (Link)
-
Wikipedia – Iranian languages (Link)
-
Fanoos – Map of Languages of Iran: interactive map (Link)
-
Gholami (2020), “Endangered Iranian Languages” – survey of endangerment dynamics. (Link)
-
Script specifics: Tajik (Cyrillic; historical Arabic/Latin); Kurdish (Latin & Kurdo-Arabic; historical Cyrillic/Yezidi); Pashto (extended Perso-Arabic).
.png)