The informationize level of Chinese Newspaper Publishing has leaped greatly since the application of Chinese Characters’laser photocomposing system. During recent years, the Chinese Newspaper Publishing has scaled up continuously, and the producing processes, such as reportorial writing, typesetting, press, financial and circulational management etc. have digitalized. However, the quality control process, which processes news text and newspaper to control errors and repetitions, is still complete manual. The manual quality control process has been the bottleneck of newspaper publishing because of its low efficiency and high cost.
     In this thesis, based on analyzing the problems of current newspaper publishing process, the current newspaper publishing process was adapted and several automatic error checking and repetition detecting algorithms were proposed, in order to achieve intelligent aided quality control of newpaper publishing. The primary contributions including:
     1. The current producing process and related softwares were integrated and optimized, and the concept and technical framework of intelligence aided quality control of the Chinese Newspaper Publishing was presented. The adapted and optimized producing process provides not only a digital coordinated quality control platform for users and computers, but also a close-loop learning environment for computers, in which environment the computers can learn new words and language knowledges, and then these knowledges were applied in the lexical semantic class based error checking and repetition detection algorithms, thus the computers can aid the quality control with high inteligence.
     2. In order to find semantic errors of texts by using the lexical semantics, a method for substantive lexical semantic classification taxonomy was proposed. And a seed words based semantic class automantic acquisition algorithm for Chinese substantive lexion was proposed. The algorithm can learn semantic class of substantive lexicon from words unsegmented Chinese corpus, and can acquire multi semantic class for multi-sense words, and can acquire subjective words. The semantic class based Chinese lexical analysis process was presented, in this process the conditional random fields model was used to lable the semantic class of segmented Chinese words and identify the boundary of noun phrase.
     3. According to error types and error causations, four algorithms for different error types and error causations were proposed to detect syntactic, semantic and inconsistent errors, which have not been solved in traditional Chinese automatic proofreading. The semantic class based tri-gram error checking algorithm was used to detect the vocabulary replacement errors and some syntactic and semantic errors. The selectional preference based error checking algorithm was used to detect subject-predicate collocation errors and verb-object collocation errors by using the selectional preference. The point mutual information based error checking algorithm was used to detect syntactic and punctuational errors by using the point mutual information between syntactic conjunctions and punctuations. The inconsistent error checking algorithm was used to detect the inconsistent of person name and title in a text.
     4. For the purpose of historical news texts automatic organization in repetition detection, a repetition detection algorithm was proposed. The historical news texts were first classified according to general topics, and then were clustered by events. For the online repetition detection, the input text was first classified to general topic and assigned to event by using the first paragraph text, and then the whole text was used to predict whether the input text was repetition or not. This algorithm can both organize the historical texts automatically and detect repetitions, and the precision of repetition detection was improved by similarity computing between paragraphs of different texts.
     The application system based on adapted and optimized producing pocess has been put into application in Changjing Newspaper for more than 2 years; the advantages on efficiency and cost have been proven. And most of the error checking and repetition detection algorithms have been applied in the system.
