|
|
| Home >
Languages > Database |
 |
|
Language Complexity Scoring
|
These are some of the languages we have scored.
The scores are somewhat approximate, but at the very
least give a good indication of the difficulty in
producing a usable TTS system in the language:
| Language |
Basic Intelligibility |
Full Intelligibility |
| Pashto |
9 |
9.5 |
| Arabic (classical) |
7 |
8.5 |
| Russian |
6 |
9 |
| Tibetan |
6 |
7.5 |
| isiZulu |
6 |
8 |
| Ibibio |
5 |
7 |
| Thai |
5 |
8 |
| English |
4 |
6 |
| Hindi |
2 |
4 |
| Welsh |
1 |
4 |
| Kiswahili |
0 |
4 |
| Tamil |
0 |
2.5 |
| | |
The basis for scoring is a combination of script and languages features. The script features are as follows:
| Script
Feature |
Score |
TTS
Modules based on the feature |
| Capitalization |
0 – present (e.g., English)
+1 – absent (e.g., Burmese)
|
Sentence extraction, Tokenisation, Text Normalisation, Proper names processing, Acronym processing,
Abbreviations processing
|
|
Grapheme- to-phoneme
correspondence
|
0 – direct (e.g., Lao)
+1– direct with exceptions (e.g.,
Malayalam)
+2 – not direct (e.g., English)
+3 – direct with optional vowel
marking (Arabic and Hebrew
scripts)
|
Grapheme-to-phoneme
conversion |
|
Symbols for loan words
|
0 – absent (e.g., Armenian)
–0.5 – present (e.g., Bengali) |
Loan words
processing |
|
Symbols for tones or
stress (in
combination with the
field ones
in the
table Languages |
<only for tone languages>
0 – present (e.g., Lao) +1 – absent (e.g., Chinese)
<only for the languages with free
word stress>
0 – absent (e.g., Cyrillic)
–1 – present (e.g., Greek) |
Tone
and stress assignment |
| Punctuation
marks |
0 – European (or close)
+1– absent or very limited (e.g.,
Thai) |
Sentence extraction, Tokenisation,
Text Normalisation |
| Spaces between
words |
0 – present (e.g., Latin)
+1 – absent (e.g., Chinese etc) |
Sentence
extraction, Tokenisation |
| Homographs |
0 – absent (e.g., Hindi)
+1 – present (e.g., Chinese) |
Homograph
disambiguation |
| Other
characteristics |
E.g. –0.5 – different forms of
initial, medial, final and isolated
forms (Arabic) |
Morphological
decomposition |
|
And the language features are:
| Language
Feature |
Score |
TTS
Modules based on the feature |
|
Tones (Cues for tone
assignment) |
<only for tone languages>
+0 – present (e.g., Panjabi)
+1 – absent (e.g., Khmer)
+2 – grammatical and terraced tones
|
Tone
assignment |
|
Lexical stress (Cues
for lexical stress
assignment) |
>
0 – fixed (or no stress at all)
+1 – rule-based (e.g., Arabic)
+2 – free (e.g., Russian)
|
Stress
assignment |
|
Secondary stress or
rhythm |
0 – no secondary stress (e.g., Bengali)
+1 – rule-based (e.g., Estonian)
+2 – not rule-based (e.g., German) |
Stress
assignment |
| Intonation
patterns |
More
or less language-independent |
Phrasing |
|
Other phonetic
characteristics |
+1 – palatalization
+1 – reduction |
Grapheme-tophoneme
conversion |
|
Morpho-syntactic
characteristics |
0 – analytical (e.g., German languages)
+1 – agglutinating (e.g., Turkish);
isolated (e.g.,
Thai)
+1 – inflecting (e.g., Hindi)
+2 – highly inflecting (e.g., Russian)
+2 – root inflecting (e.g., Arabic) |
Morphological
analysis,
|
|
Proper syntactic
characteristics
(general organisation
of a sentence) |
0 – fixed word order (e.g., English)
+1 – free word order (e.g., Russian);
+1 – isolating languages (e.g., Chinese) |
Phrasing |
|
|
Top
|
|
|
|
© Local Language Speech Technology Initiative. All Rights
Reserved. | |
 |
 |
|
|