详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
Knowledge base is "brain" of natural language processing systems and enables them to "understand" and process natural language. This dissertation makes effort to explore new technologies of domain knowledge acquisition. The main contributions are as follow:
     1. To solve web redundancy information during the domain knowledge source acquisition, a web document duplicate removal algorithm based on keyword sequences (i.e. KSM) is presented. Referring to comprehensive information theory, KSM uses keyword sequences of web document to represent its structure feature and intension feature, then judges information redundancy by comparing keyword sequences between similar documents. In the various obscure duplicate detection experiments, the overall precision and recall rate of KSM is 99.2% and 97.7% respectively.
     2. To improve the recall of terms with low frequency, an automatic Chinese term extraction algorithm based on language cognition theory is presented. Making use of discourse markers in research papers, this algorithm introduces "weighed frequency" factor to C-Value and SCP_f measures, then proposes MC-SCP measure to evaluate both "unithood" and "termhood" of candidate terms. In the "License Plate Recognition" domain term extraction, the overall recall and precision is 96.5% and 77.8% respectively, while the recall and precision for terms with low frequency is 96.2%. and 79.3% respectively.
     3. To acquire various relations of terms, a multi-strategies based relation acquisition model is designed, including a) rule-based synonymical relation acquisition, b) hierarchical relation acquisition based on terms' morphologic similarities, c) non-hierarchical relation acquisition based on all weighted association rules, and d) PSO-based term clustering.
     4. To alleviate the conflict between swarming of multi-domain research papers and limitation of editors' knowledge, a domain-knowledge-guided first review assistant system is presented. According to the editors' experience, the first review is refined into four judgments. In the experiment of 2365 research papers, this system can assist editors with filtering 35% unqualified manuscripts.
