|
TTS-related Database |
|
The TTS-related language database has formed the foundation of the local language TTS program.
It encapsulates our knowledge of languages and scripts used worldwide, and specifically identifies the language
and script features which add complexity when building TTS systems.
|
|
|
|
The database can be searched by either language features or script features, or both:
|
|
|
|
|
|
You can search using many other features with an advanced search
|
|
|
|
From these language and script features we can derive an approximate complexity score for any language the
project is considering developing. The procedure is not automatic as the assignment of scores is to a certain
extent subjective. Some of the languages we have scored, and the basis for the scoring, are shown
here.
|
|
|
|
The database is neither complete nor 100% accurate. There is a distinct lack of data for some of the required
linguistic information - in particular: place of lexical stress, place of secondary stress (if any) and
presence/absence of homographs. The identification of lexical and secondary stress in certain languages
appears to need more research work. In fact we are finding that the process of producing a TTS system
for a language is in itself a very effective way of refining our understanding of the langauge. We touch
on this in our paper Issues in Porting TTS to Minority Languages
|
|
|
|
There are three languages in the database that can be written in two scripts and from the TTS development
point of view these are considered to be different entries (these languages are Malay, Panjabi and Sindhi).
|
Your feedback on the data is very welcome. Please write your comments, suggestions or corrections to
Ksenia Shalonova or Roger Tucker
|
|
|
|
References |
1. A great deal of information about language chatacteristics on the main levels such as phonology/morphology/syntax was taken from The Rosetta Project
2. Some information about the place of lexical stress was taken from
StressTyp Database
|
|
Top
|