the Local Language Speech Technology Initiative
 
Home
Vision
Approach
Languages

 Database

 Sponsorship
Demos
Downloads
Documents
Partners
News Archive
Contact Us
 
Home > Languages > Database


Language Complexity Scoring

These are some of the languages we have scored. The scores are somewhat approximate, but at the very least give a good indication of the difficulty in producing a usable TTS system in the language:
Language Basic Intelligibility Full Intelligibility
Pashto 9 9.5
Arabic (classical) 7 8.5
Russian 6 9
Tibetan 6 7.5
isiZulu 6 8
Ibibio 5 7
Thai 5 8
English 4 6
Hindi 2 4
Welsh 1 4
Kiswahili 0 4
Tamil 0 2.5

The basis for scoring is a combination of script and languages features. The script features are as follows:
Script Feature Score TTS Modules based on the feature
Capitalization 0 – present (e.g., English)
+1 – absent (e.g., Burmese)
Sentence extraction, Tokenisation, Text Normalisation, Proper names processing, Acronym processing, Abbreviations processing  
Grapheme- to-phoneme correspondence     0 – direct (e.g., Lao)
+1– direct with exceptions (e.g., Malayalam)
+2 – not direct (e.g., English)
+3 – direct with optional vowel
marking (Arabic and Hebrew scripts)
Grapheme-to-phoneme conversion
Symbols for loan words 0 – absent (e.g., Armenian)
–0.5 – present (e.g., Bengali)
Loan words processing
Symbols for tones or stress (in combination with the field ones in the table Languages <only for tone languages>
0 – present (e.g., Lao)
+1 – absent (e.g., Chinese)
<only for the languages with free word stress>
0 – absent (e.g., Cyrillic)
–1 – present (e.g., Greek)
Tone and stress assignment
Punctuation marks 0 – European (or close)
+1– absent or very limited (e.g., Thai)
Sentence extraction, Tokenisation, Text Normalisation
Spaces between words 0 – present (e.g., Latin)
+1 – absent (e.g., Chinese etc)
Sentence extraction, Tokenisation
Homographs 0 – absent (e.g., Hindi)
+1 – present (e.g., Chinese)
Homograph disambiguation
Other characteristics E.g. –0.5 – different forms of initial, medial, final and isolated forms (Arabic) Morphological decomposition

And the language features are:
Language Feature Score TTS Modules based on the feature
Tones (Cues for tone assignment) <only for tone languages>
+0 – present (e.g., Panjabi)
+1 – absent (e.g., Khmer)
+2 – grammatical and terraced tones
Tone assignment
Lexical stress (Cues for lexical stress assignment) > 0 – fixed (or no stress at all)
+1 – rule-based (e.g., Arabic)
+2 – free (e.g., Russian)
Stress assignment
Secondary stress or rhythm 0 – no secondary stress (e.g., Bengali)
+1 – rule-based (e.g., Estonian)
+2 – not rule-based (e.g., German)
Stress assignment
Intonation patterns More or less language-independent Phrasing
Other phonetic characteristics +1 – palatalization
+1 – reduction
Grapheme-tophoneme conversion
Morpho-syntactic characteristics 0 – analytical (e.g., German languages)
+1 – agglutinating (e.g., Turkish); isolated (e.g., Thai)
+1 – inflecting (e.g., Hindi)
+2 – highly inflecting (e.g., Russian)
+2 – root inflecting (e.g., Arabic)
Morphological analysis,
Proper syntactic characteristics (general organisation of a sentence) 0 – fixed word order (e.g., English)
+1 – free word order (e.g., Russian);
+1 – isolating languages (e.g., Chinese)
Phrasing


Top

 

 

© Local Language Speech Technology Initiative. All Rights Reserved.