With the development of information technology and artificial intelligence,speech synthesis plays a significant role in the fields of Human-Computer InteractionTechniques. However, the main problem of current speech synthesis techniques islacking of naturalness and too monotonous to realize the mechanism of usersubjective drive.
     This paper summarizes the general process of speech synthesis. It is pointed outthat prosody generation module is an important part in the process of this technique.Duration model and stress model are two key issues for prosody generation. For theduration model, the synchronization control of gaze duration of eyes when readingand pronunciation duration in speech synthesis was presented. For the stress model,the method of Extreme Learning Machine (ELM) and Semi-supervised ExtremeLearning Machine (SELM) were presented to predict stress and the comparison wasaccomplished through experiments. In this paper, the semantic stress estimation wasalso researched. Because semantic stress depends on the expression of subjectiveawareness, the relationship between eye movement and stress was tried to becalculated and analyzed.
     Around the aspects illustrated above, the main research work and innovationpoints are listed as follows:
     The method of using eye movement signal to control speech synthesis wasproposed. Introducing eye movement characteristics into speech synthesis will enrichhuman-computer interactive form and there will be practical significance andapplication prospect in terms of disabled assisted speech interaction. Based on thecharacteristics of the implicit rhythm reading, relative independence between speechprocessing system of text and eye movement control system was discussed. It wasproved that under the same text familiarity condition, gaze duration of eyes whenreading and internal voice pronunciation duration are synchronous.
     A single hidden layer feedforward neural network ELM was proposed for Chinese language stress prediction. In the experiment, ELM and SVM with RBFkernel function were respectively used for Chinese language stress prediction, and theresults showed that ELM with high accuracy can greatly improve the speed ofclassification learning and prediction.
     A modified semi-supervised SELM model was proposed to accomplish stressprediction. SELM is only used in the training sample set with small amount of labeledsamples. Based on the labeled samples learning, this algorithm will test theconfidence threshold of unlabeled samples. Testing adopts exchange training set andprediction set to determine high degree confidence of expanded samples. Theexperiment showed that SELM algorithm has higher efficiency in the classification ofunlabeled samples. This algorithm of semi-supervised strategy provides an effectivesolution for reducing sample label workload.
     The exploratory study of semantic stress prediction based on the gazecharacteristics of human eyes was proposed. A group of eye movement stressprediction experiments were accomplished to discuss how to use eye movement datato predict semantic stress in the specific context. Three kinds of neural networkmodels were also used to classify experimental samples of eye movements. Theresults showed that the characteristics of eyes such as gaze duration and fixationcount are related with semantic stress level.
     Fujisaki modeling method based on the tone superposition was introduced todiscuss fundamental frequency curve generation and rhythm modification. Amodified speech synthesis model ED_Fujisaki model was presented. This model cansynthesize personalized rhythm of readers’ subjective expression.
