Linguistic knowledge base is a fundamental resource for natural language processing. The completeness, representation and organization of the knowledge directly affect the application performance of the knowledge-based natural language processing.
     Most of the taxonomy-based knowledge bases were constructed on the basis of human-oriented dictionary, they have a low converge and a long updating cycle, and the isolated storage strategy for domain knowledge bases is hard to meet the need of knowledge sharing and redundancy reducing.On the other hand, many of the existing natural language processing applications only involve the word level knowledge, and rare of them use the semantic knowledge about the concepts and the relationships among concepts, which limits the applications'performance.
     To solve the problems mentioned above, this paper proposed a domain label assignment method based on the manually constructed machine readable dictionary, which can be used to automatically implement the domain dictionaries. By using the well-defined taxonomy and formal description of concepts in ontology, we can improve the performance of knowledge storage, representation and sharing for existing knowledge base. The main works of this dissertation are summarized as follows:
     1. A word domain assignment method based on the word gloss is proposed. The domain specialized dictionaries and a general dictionary are used in this method to train the label model which is then used to automatically add domain labels to the new word in the general dictionary. This method can effectively reduce the labor cost while improving the coverage of knowledge base
     2. An adaptive hierarchical classification system generation method is proposed in chapter3, and a hierarchical domain assignment method based on the automatically generated classification system is also proposed in this chapter. The method utilize the vocabulary information to analyze the relevancy between domains, and on the basis a hierarchical classification tree is automatically generated and then be used in the top-down hierarchical domain label step.
     3. A new conceptualized feature description model C-VSM based on ontology is proposed, in order to resolve the polysemy and synonyms problems in domain terminology. We make word sense disambiguation on polysemy and merge the synonyms by mapping the word in text to the concept node in ontology to reduce the number of features and increase the weight of main features, which can improve the efficiency of text representation. The training documents and the new documents are represented as C-VSM and then be used in the traditional classifier.
     4. We introduce the C-VSM model into text classification, and discuss the related technologies including feature selecting method, feature weighting calculation, text similarity calculation and so on. A new balanced feature selecting method is presented by combining the information gain and the document frequency, to promote the classification performance. And the feature weight is adjusted to improve the text similarity calculation by analyzing the semantic relation between concepts.
