网络双语语料挖掘关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着统计方法的迅速发展,大规模双语语料库已成为跨语言信息处理不可或缺的基础资源。双语语料已被大量应用于挖掘双语术语、命名实体和双语词典等更细粒度的互译等价对,为统计机器翻译和跨语言信息检索等领域提供支持。然而,现有的双语语料资源十分匮乏,而低密度语言的双语语料尤为稀缺。近年来,网络原始双语资源迅速增长,且具有内容新颖和来源广阔的优势,围绕网络双语语料挖掘方法的研究已成为人们关注的焦点。
     本文以网络双语语料挖掘技术为研究对象,进行了平行语料和可比语料挖掘系统的设计,并展开了四项关键技术的研究:平行网页识别、网页正文抽取、关键词提取以及跨语言文本相似度计算。分述如下:
     1)基于新特征信息的平行网页识别为解决挖掘网络平行语料面临网页结构非对称的问题,本文提出将改进的编辑距离计算网页HTML标签序列的相似性以及最大匹配计算数字序列的相似性等作为特征信息,利用支持向量机进行平行网页识别。该方法降低了对网页结构信息的依赖程度,提高了对现有的低密度语言网络资源的适应性。
     2)基于文本密度模型的网页正文抽取为解决结构布局各异的网页抽取正文时发生边界误判的问题,本文提出一种基于文本密度模型的新闻网页正文抽取方法,主要通过融合网页结构和语言特征的统计模型,将网页文档按文本行转化成正、负文本密度序列,再根据邻近行内容的连续性,利用高斯平滑技术修正文本密度序列,最后采用改进的最大子序列分割密度序列抽取正文内容。该方法既保持了正文的完整性又排除了噪声的干扰,且无需人工干预或反复训练。
     3)基于LDA模型的文档关键词提取为解决现有的关键词抽取方法未能综合体现文本主题的显著性、可读性以及全面性的问题,本文提出一种基于文档隐含主题的关键词抽取新算法TFITF,主要利用大规模语料产生隐含主题模型以计算词汇对主题的TFITF权重,并进一步产生词汇对文档的权重,再采用共现信息合并相邻词汇以形成候选关键短语,最后使用相似性排除隐含主题相近的冗余短语。该方法有效地提高了文档关键词抽取的准确率与召回率。
     4)基于Bi-LDA模型的跨语言文档相似度为解决使用互译词汇等特征匹配跨语言文档时无法衡量文档对主题相关性的问题,本文提出基于Bi-LDA模型分析不同语言文档的跨语言主题模型,并给出文档-主题的KL散度、主题频率-逆文档频率的余弦值和文档的条件概率三种方法,分别计算不同语言文档的相似度,为筛选相似文本对自动构建可比语料库提供基础。该方法增强了对文档语义信息的理解,克服了利用互译词汇匹配文档的表面性,可有效地匹配主题一致的不同语言文档。
     本文在平行语料挖掘中主要的技术有平行网页识别和网页正文抽取,在可比语料挖掘中主要的技术有网页正文抽取、关键词提取和跨语言文本相似性。实验证明,本文的方法提高了网络资源的利用率和网络双语语料挖掘的质量。
With the development of statistical techniques, the large-scale bilingual corpora have been indispensable fundamental resources for cross-language processing research field. The bilingual corpora have been applied to mine fine-grained translation equivalents, such as bilingual terminologies, named entities and bilingual lexicography, to support statistical machine translation and cross-language information retrieval. However, existing bilingual corpora are significantly scarce in practical use, especially the low-density languages. In recent years, the original bilingual resources are witnessing rapidly increasing on the web with its advantage of innovative content and vast sources. Mining bilingual corpora from web have become the focus of attention.
     With the purpose of study on mining bilingual corpora, this thesis designs two systems to mine parallel corpora and comparable corpora respectively together with four key technologies which includes parallel webpages identification, content extraction, keyphrase extraction and cross-language document similarity. The main work includes:
     1) Parallel Webpage Identification Based on the New Heuristic Information To solve the problem of heterogeneous web structure with mining parallel corpora from web, this thesis develops tag structure alignment calculated in accordance with the improved edit distance and the similarity of co-occurrence number sequence calculated in accordance with maximal common subsequences as the new heuristics. Then we apply a support vector machine to combine these heuristics to classify pages as parallel pages or not. This approach reduces dependence on page structure information to improve the adaptability of the low-density language.
     2) Web Content Extraction Based on Text Density Model In order to avoid misjudgment boundary and obtain useful content from different layout webpages, this thesis proposes an approach of web content extraction which is based on the text density model, integrating page structure features with language features to convert text lines of page document into a positive or negative density sequence. Additionally, the Gaussian smoothing technique is adopted to revise the density sequence, which takes the content continuity of adjacent lines into consideration. Finally, the improved maximum sequence segmentation is adopted to split the sequence and extract web content. Without any human intervention or repeated training, this approach can maintain the integrity of content and eliminate noise disturbance.
     3) Keyphrase Extraction Based on LDA Model In order to solve the problem that existing methods lose the comprehensive analysis of significance, readability and coverage of document topics, a new algorithm of keyphrase extraction TFITF which bases on the implicit topic model is presented. The algorithm adopts the large-scale corpus to produce latent topic model to calculate the TFITF weight of vocabulary on the topic and further generate the weight of vocabulary on the document. Then adjacent lexical are picked as keyphrases based on co-occurrence information. Lastly, according to the similarity of vocabulary topics, redundant phrases are excluded. The method can effectively improve the precision and recall of keyphrase extraction.
     4) Cross-language Document Similarity Based on Bi-LDA Model In order to solve the problem of existing methods which adopt inter-translate words and relative features cannot evaluate the topical relation between cross-language document pairs, this thesis adopts Bi-LDA model to analyze document topic structure and gives the similarity of cross-language documents by KL divergence between document-topics, cosine similarity between values of Topic Frequency-Inverse Document Frequency and condition probability between documents to construct comparable corpora. This method enhances the understanding of document semantic information, overcomes the superficial matching of vocabulary and obtains similar documents with consistent topics.
     The system of mining parallel corpora mainly adopts parallel webpages identification and content extraction. The system of mining comparable corpus mainly adopts content extraction, keyphrase extraction and cross-language document similarity. The experiment results that the method of the thesis can effectively improve the utilization of web resources and the quality of bilingual corpora.
引文
[1]KARIN A, BENGT A.2004. Advances in Corpus Linguistics [M]. Rodopi,1-419.
    [2]Baker M.1995. Corpora in Translation Studies:An Over view and Some Suggestions for Future Research [J]. Target,7(2):223-243.
    [3]Resnik P.1998. Parallel Strands:A Preliminary Investigation into Mining the Web for Bilingual Text[C]//Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup. America: Springer-Verlag,72-82.
    [4]McEnery T.1997. Multilingual corpora-current practice and future trends[C]//13th ASLLB Machine Translation Conference. London, England,75-86.
    [5]Belinda Maia.2003. What are Comparable Corpora? Electronic resource, found at http://web.letras.up.pt/bhsmaia/belinda/pubs/CL2003%20workshop.doc
    [6]Skadina I, Aker A, Giouli V, et al.2010. Collection of comparable corpora for under-resourced languages[C]//Proceedings of the Fourth International Conference Baltic HLT 2010. Riga, Latvia,161-168.
    [7]McEnery A, Xiao R.2007. Parallel and comparable corpora:What are they up to? [C]//Proceedings of Incorporating Corpora:Translation and the Linguist Translating Europe Multilingual Matters. Clevedon, UK.2007.
    [8]Ji H.2009. Mining name translations from comparable corpora by creating bilingual information networks [C]//Proceedings of the 2nd Workshop on Building and Using Comparable Corpora:from Parallel to Non-parallel Corpora. Suntec, Singapore:ACL,34-37.
    [9]Bowker L, Pearson J.2002. Working with specialized language:a practical guide to using corpora[M]. London/New York:Routledge.
    [10]Braschler M, Schauble P.1998. Multilingual Information Retrieval Based on Document Alignment Techniques[C]//Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries. Heraklion, Greece:Springer-Verlag,183-197.
    [11]Brown P F, Cocke J, Pietra S A D, et al.1990. A statistical approach to machine translation[J]. Computational linguistics,16(2):79-85.
    [12]Sato S, Nagao M.1990. Toward memory-based translation[C]//Proceedings of the 13th conference on Computational linguistics-Volume 3. ACL,247-252.
    [13]Sanchez-Martinez F, Forcada M L, Way A.2009. Hybrid Rule-Based-Example-Based MT: Feeding Apertium with Sub-sentential Translation Units[C]//3rd International Workshop on Example-Based Machine Translation. Dublin, Ireland,11-18.
    [14]Udupa R, Saravanan K, Kumaran A, et al.2008. Mining named entity transliteration equivalents from comparable corpora[C]//Proceedings of the 17th ACM conference on Information and knowledge management. Napa Valley, California, USA:ACM,1423-1424.
    [15]Shao L, Ng H T.2004. Mining new word translations from comparable corpora[C]//Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics.
    [16]MORIN E, DAILLE B.2007. Bilingual Terminology Mining-Using Brain, not brawn comparable corpora[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic,664-671.
    [17]Saralegi X, Vicente I S, Gurrutxaga A.2008. Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain[C]//Proceedings of LREC2008 Workshop on Building and Using Comparable Corpora. Marrakech, Morocco.2008:27-32.
    [18]Yu K, Tsujii J.2009. Bilingual dictionary extraction from Wikipedia[C]//Proceeding of MT Summit Ⅻ. Ottawa, Canada,2009.
    [19]Sadat F, Yoshikawa M, Uemura S.2003. Learning bilingual translations from comparable corpora to cross-language information retrieval:hybrid statistics-based and linguistics-based approach[C]//Proceedings of the sixth international workshop on Information retrieval with Asian languages-Volume 11. Sappro, Japan:Association for Computational Linguistics,57-64.
    [20]Gaussier E, Renders J M, Matveeva I, et al.2004. A geometric view on bilingual lexicon extraction from comparable Corpora[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Barcelona, Spain,526-533.
    [21]Tanaka K, Iwasaki H.1996. Extraction of lexical translations from non-aligned corpora[C]//Proceedings of the 16th conference on Computational linguistics-Volume 2. Association for Computational Linguistics. Copenhagen, Denmark,580-585.
    [22]Yu K, Tsujii J.2009. Extracting bilingual dictionary from comparable corpora with dependency heterogeneity[C]//Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume:Short Papers. ACL,121-124.
    [23]Sproat R, Tao T, Zhai C X.2006. Named entity transliteration with comparable corpora[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics,73-80.
    [24]Feng D, Lu Y, Zhou M.2004. A New Approach for English-Chinese Named Entity Alignment[C]//International Conference on Empirical Methods in Natural Language Processing (EMNLP),372-379.
    [25]Klementiev A, Roth D.2006. Named Entity Transliteration and Discovery from Multilingual Comparable Corpora [C]//Proceedings of the Human Language Technology Conference of North American Chapter of the ACL, New York, America,82-88.
    [26]Tao T, Yoon S Y, Fister A, et al.2006. Unsupervised named entity transliteration using temporal and phonetic correlation[C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. ACL,250-257.
    [27]Lam W, Chan S K, Huang R.2007. Named entity translation matching and learning:With application for mining unseen translations[J]. ACM Transactions on Information Systems (TOIS),25(1):1-32.
    [28]Lu M, zhao J.2006. Multi-feature Based Chinese-English Named Entity Extraction from Comparable Corpora[C]//Proceedings of the 20th Pacific Asia Conference on Language Information and Computation, Wuhan, China,134-141.
    [29]Barzilay R, Elhadad N.2003. Sentence alignment for monolingual comparable corpora[C]//Proceedings of the 2003 conference on Empirical methods in natural language processing. Association for Computational Linguistics, Sapporo, Japan,25-32.
    [30]Munteanu D S, Marcu D.2006. Extracting parallel sub-sentential fragments from non-parallel corpora[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Sydney, Australia:Association for Computational Linguistics,81-88.
    [31]Bin LU, Tao JIANG, Kapo CHOW, et al.2010. Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT[C]//Proceedings of BUCC, LREC 2010, Malta,42-49.
    [32]Ma X.2006. Champollion:A robust parallel text sentence aligner[C]//LREC 2006:Fifth International Conference on Language Resources and Evaluation. Genova, Italy,489-492.
    [33]Moore R C.2002. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]//Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation. Springer-Verlag,135-144.
    [34]Wu D, Fung P.2005. Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora[C]//Proceedings of the Second international joint conference on Natural Language Processing. Je ju Island, Korea:Springer-Verlag,257-268.
    [35]Utiyama M., Isahara H.2007. A Japanese-English patent parallel corpus[C]//Proceeding of MT Summit XI.475-482.
    [36]Rauf S A, Schwenk H.2011. Parallel sentence generation from comparable corpora for improved SMT[J]. Machine Translation,25(4):341-375.
    [37]Ion R, Ceausu A, Irimia E.2011. An expectation maximization algorithm for textual unit alignment[C]//Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Portland, Oregon, USA:ACL,128-135.
    [38]Stefanescu D, Ion R, Sabine Hunsicker.2012. Hybrid parallel sentence mining from comparable corpora[C]//Proceedings of EAMT 2012, Trento, Italy,137-144.
    [39]Quirk C, Udupa R, Menezes A.2007. Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction[C]//In Proceedings of MT Summit XI, European Association for Machine Translation. Copenhagen, Demark,321-327.
    [40]Munteanu D S, Fraser A, Marcu D.2004. Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora[C]//Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Boston, Massachusetts, USA,265-272.
    [41]Kaji H, Tsunakawa I. Okada D.2010. Using Comparable Corpora to Adapt a Translation Model to Domains[C]//Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC10). Malta,2182-2188.
    [42]Su F, Babych B.2012. Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-) parallel translation equivalents[C]//Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra). Avignon, France: Association for Computational Linguistics,10-19.
    [43]Talvensaari T, Laurikkala J, Jarvelin K, et al.2007. Creating and Exploiting a Comparable Corpus in Cross-Language Information Retrieval[J]. ACM Transactions on Information Systems,25(1):322-334.
    [44]McNamee P, Mayfield J, Nicholas C.2009. Translation corpus source and size in bilingual retrieval[C]//Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume:Short Papers. Boulder, Colorado:Association for Computational Linguistics,25-28.
    [45]Yogatama D, Tanaka-Ishii K.2009. Multilingual spectral clustering using document similarity propagation[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 2-Volume 2. Singapore:Association for Computational Linguistics,871-879.
    [46]Zagibalov T, Belyatskaya K, Carroll J.2010. Comparable English-Russian Book Review Corpora for Sentiment Analysis[C]//Proceedings of 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Lisbon, Portugal,2010.
    [47]Tholpadi G, Das M K, Bhattacharyya C, et al.2012. Cluster labeling for multilingual scatter/gather using comparable corpora[M]//Advances in Information Retrieval. Springer Berlin Heidelberg,388-400.
    [48]Lee L, Aw A, Vu T, et al.2009. MARS:multilingual access and retrieval system with enhanced query translation and document retrieval[C]//Proceedings of the ACL-IJCNLP 2009 Software Demonstrations. Suntec, Singapore:ACL,21-24.
    [49]Gliozzo A, Strapparava C.2005. Cross language text categorization by acquiring multilingual domain models from comparable corpora[C]//Proceedings of the ACL workshop on building and using parallel texts. Morristown, NJ, USA:ACL,9-16.
    [50]Jiang L, Yang S, Zhou M, et al.2009. Mining bilingual data from the web with adaptively learnt patterns[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics,870-878.
    [51]Resnik P, Smith N A.2003. The web as a parallel corpus[J]. Computational Linguistics, 29(3):349-380.
    [52]Chen J, Nie J Y.2000. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proceedings of the sixth conference on Applied natural language processing. San Francisco:Association for Computational Linguistics,21-28.
    [53]Shi L, Niu C, Zhou M, et al.2006. A dom tree alignment model for mining parallel data from the web[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. ACL,489-496.
    [54]Shi L, Zhou M.2008. Improved sentence alignment on parallel web pages using a stochastic tree alignment model[C]//Proceedings of the Conference on EMNLP. ACL,505-513.
    [55]Zhang Y, Wu K, Gao J, et al.2006. Automatic acquisition of chinese-english parallel corpus from the web[C]//Proceedings of the 28th European conference on Advances in Information Retrieval. Springer-Verlag,420-431.
    [56]叶莎妮,吕雅娟,黄赞,等.2008.基于Web的双语平行句对自动获取[J].中文信息学报,22(5):67-73.
    [57]Ma X, Liberman M.1999. Bits:A method for bilingual text search over the web[C]//Machine Translation Summit VII.538-542.
    [58]Enright J, Kondrak G.2007. A fast method for parallel document identification[C]//Human Language Technologies 2007:The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. Rochester, NY: Association for Computational Linguistics,29-32.
    [59]Sakre M M, Kouta M M, Allam A M N.2009. AUTOMATED CONSTRUCTION OF ARABIC-ENGLISH PARALLEL CORPUS[J]. Computer Science and Network Security.
    [60]Chen J, Chau R, Yeh C H.2004. Discovering parallel text from the World Wide Web[C]//Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation-Volume 32. Australia: Australian Computer Society, Inc.,157-161.
    [61]Patry A, Langlais P.2011. Identifying parallel documents from a large bilingual collection of texts:application to parallel article extraction in Wikipedia[C]//Proceedings of the 4th Workshop on BUCC. Portland, Oregon:Association for Computational Linguistics,87-95.
    [62]Antonova A, Misyurev A.2011. Building a Web-based parallel corpus and filtering out machine-translated text[C]//Proceedings of the 4th Workshop on Building and Using Comparable Corpora:Comparable Corpora and the Web. Portland, Oregon:ACL,136-144.
    [63]林政,吕雅娟,刘群,等.2010.Web平行语料挖掘及其在机器翻译中的应用[J].中文信息学报,24(5):85-91.
    [64]Mohler M, Mihalcea R.2008. Babylon Parallel Text Builder:Gathering Parallel Texts for Low-Density Languages[C]//Proceedings of the LREC.1228-1231.
    [65]Sheridan P, Ballerini J P.1996. Experiments in multilingual information retrieval using the SPIDER system[C]//Proceedings of the 19th ACMSIGIR conference. Zurich, Switzerland. 58-65.
    [66]Aker A, Kanoulas E, Gaizauskas R J.2012. A light way to collect comparable corpora from the Web[C]//Proceedings of LREC.15-20.
    [67]Tao T, Zhai C X.2005. Mining comparable bilingual text corpora for cross-language information integration[C]//Proceedings of ACM SIGKDD, Chicago, Illinois, USA:ACM, 691-696.
    [68]Vu T, Aw A T, Zhang M.2009. Feature-based method for document alignment in comparable news corpora[C]//Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Athens, Greece:ACL,843-851.
    [69]OARD D W, DIEKEMA A R.1998. Cross-Language Information Retrieval [J]. Annual Review of Information Science and Technology,33:223-256.
    [70]Huang D, Zhao L, Li L, et al.2010. Mining large-scale comparable corpora from Chinese-English news collections[C]//Proceedings of the 23rd International Conference on Computational Linguistics:Posters. Beijing, China:ACL,472-480.
    [71]Fiser D, Vintar S, Ljubegic N, et al.2011. Building and using comparable corpora for domain-specific bilingual lexicon extraction[C]//Proceedings of the 4th Workshop on Building and Using Comparable Corpora. Portland, Oregon:ACL,19-26.
    [72]Talvensaari T, Pirkola A, Jarvelin K, et al.2008. Focused web crawling in the acquisition of comparable corpora [J]. Information Retrieval,11(5):427-445.
    [73]Leturia I, San Vicente I, Saralegi X.2009. Search engine based approaches for collecting domain-specific Basque-English comparable corpora from the Internet[C]//Proceedings of the Fifth Web as Corpus Workshop (WAC5). Basque Country, Spain,53-61.
    [74]Otero P G, Lopez I G.2009. Wikipedia as multilingual source of comparable corpora[C]//Proceedings of the 3rd Workshop on BUCC. Malta,21-25.
    [75]Ion R, Tufis D, Boros T, et al.2010. On-Line Compilation of Comparable Corpora and their Evaluation[C]. Proceedings of the FASSBL. Dubrovnik, Croatia,29-34.
    [76]Li B, Gaussier E, Aizawa A.2011. Clustering comparable corpora for bilingual lexicon extraction[C]//Proceedings of the 49th Annual Meeting of the ACL. Portland, Oregon:ACL, 473-478.
    [77]Preiss J.2012. Identifying comparable corpora using LDA[C]//Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montre'al, Canada:ACL,558-562.
    [78]Cartoni B, Zufferey S, Meyer T, et al.2011. How comparable are parallel corpora? Measuring the distribution of general vocabulary and connectives[C]//Proceedings of the 4th Workshop on Building and Using Comparable Corpora. Portland, Oregon:ACL,78-86.
    [79]Fung P, Cheung P.2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus[C]//Proceedings of the 20th international conference on Computational Linguistics. Geneva, Switzerland:ACL,1051-1057.
    [80]Kilgarriff A, Rose T.1998. Measures for corpus similarity and homogeneity[C]//Proceedings of the 3rd conference on EMNLP. Granada, Spain,46-52.
    [81]Vasiljevs A.2010. ACCURAT:Metrics for the evaluation of comparability of multilingual corpora[C]//Proceedings of the workshop on Methods for the Automatic Acquisition of Language Resources and their Evaluation Methods, LREC2010. Malta.
    [82]Li B, Gaussier E.2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Beijing, China:Association for Computational Linguistics,644-652.
    [83]Denoual E.2006. A method to quantify corpus similarity and its application to quantifying the degree of literality in a document[J]. International Journal of Technology and Human Interaction (IJTHI),2(1):51-66.
    [84]Sharoff S.2010. Analyzing Similarities and Differences between Corpora[C]//Proceedings of Language Technologies. Ljubljana, Slovenia.
    [85]Cortes C, Vapnik V.1995. Support-vector networks[J]. Machine learning,20(3):273-297.
    [86]于新,吴健,洪锦玲.2011.基于词典的汉藏句子对齐研究与实现[J].中文信息学报,25(4):57-62.
    [87]才让加.2011.面向自然语言处理的大规模汉藏(藏汉)双语语料库构建技术研究[J].中文信息学报,25(6):157-161.
    [88]吐尔根·依布拉音,袁保社.2011.新疆少数民族语言文字信息处理研究与应用[J].中文信息学报,25(6):149-156.
    [89]那顺乌日图,淑琴.2007.面向信息处理的蒙古语规范化研究[J].中央民族大学学报,34(6):115-122.
    [90]Deerwester S C, Dumais S T, Landauer T K, et al.1990. Indexing by latent semantic analysis[J]. JASIS,41(6):391-407.
    [91]Hofmann T.1999. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM,50-57.
    [92]Blei D M, Ng A Y, Jordan M I.2003. Latent dirichlet allocation[J]. the Journal of machine Learning research,3:993-1022.
    [93]Griffiths T L, Steyvers M.2004. Finding scientific topics[J]. Proceedings of the National academy of Sciences of the United States of America,101(Suppl 1):5228-5235.
    [94]石晶,胡明,石鑫,等.2008.基于LDA模型的文本分割[J].计算机学报,31(10):1865-1873.
    [95]Hoffman M, Bach F R, Blei D M.2010. Online learning for latent dirichlet allocation[C]//advances in neural information processing systems,856-864.
    [96]徐戈,王厚峰.2011.自然语言处理中主题模型的发展[J].计算机学报,34(8):1423-1436.
    [97]Ni X, Sun J T, Hu J, et al.2009. Mining multilingual topics from wikipedia[C]//Proceedings of the 18th international conference on World wide web. Madrid, Spain:ACM,1155-1156.
    [98]Mimno D, Wallach H M, Naradowsky J, et al.2009. Polylingual topic models[C]//Proceedings of the Conference on EMNLP:Volume 2. Singapore:ACL,880-889.
    [99]Ballesteros L, Croft W B.1998. Resolving ambiguity for cross-language retrieval[C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM.64-71.
    [100]Gibson D, Punera K, Tomkins A.2005. The volume and evolution of web page templates[C]//Special interest tracks and posters of the 14th international conference on World Wide Web. Chiba, Japan:ACM,830-839.
    [101]Chen Y, Ma W Y, Zhang H J.2003. Detecting web page structure for adaptive viewing on small form factor devices[C]//Proceedings of the 12th international conference on World Wide Web. Budapest, Hungary:ACM,225-233.
    [102]Yu S, Cai D, Wen J R, et al.2003. Improving pseudo-relevance feedback in web information retrieval using web page segmentation[C]//Proceedings of the 12th international conference on World Wide Web. Budapest, Hungary:ACM,11-18.
    [103]Adelberg B.1998. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents[J]//ACM Sigmod Record. Washington, USA:ACM, 27(2):283-294.
    [104]Kang Daeki, Choi J.2002. Metanews:An information agent for gathering news articles on the web[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada,588-593.
    [105]Yang S H, Lin H L, Han Y B.2008. Automatic data extraction from template-generated Web pages[J]. Journal of Software,19(2):209-223.
    [106]Kohlschutter C, Fankhauser P, Nejdl W.2010. Boilerplate detection using shallow text features[C]//Proceedings of the third ACM international conference on Web search and data mining. New York, USA:ACM,441-450.
    [107]Song R, Liu H, Wen J R, et al.2004. Learning important models for web page blocks based on layout and content analysis[J]. ACM SIGKDD Explorations Newsletter,6(2):14-23.
    [108]Gibson J, Wellner B, Lubar S.2007. Adaptive web-page content identication[C]// Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management. Lisbon, Portugal,105-112.
    [109]Ziegler C N, Skubacz M.2007. Content extraction from news pages using particle swarm optimization on linguistic and structural features [C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. Silicon Valley, USA:IEEE Computer Society,242-249.
    [110]Pasternack J, Roth D.2009. Extracting article text from the web with maximum subsequence segmentation[C]//Proceedings of the 18th international conference on World wide web. Madrid, Spain:ACM,971-980.
    [111]Finn A, Kushmerick N, Smyth B.2001. Fact or fiction:Content classification for digital libraries[C]//Proceedings of the second DELOS Network of Excellence Workshop on Personalization and Recommender Systems in Digital Libraries. Dublin, Ireland,2-6.
    [112]Pinto D, Branstein M, Coleman R, et al.2002. QuASM:a system for question answering using semi-structured data[C]//Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. Portland, Oregon, USA:ACM,46-55.
    [113]Mantratzis C, Orgun M, Cassidy S.2005. Separating XHTML content from navigation clutter using DOM-structure block analysis[C]//Proceedings of the sixteenth ACM conference on Hypertext and hypermedia. Salzbury, Austria:ACM,145-147.
    [114]Debnath S, Mitra P, Giles C L.2005. Automatic extraction of informative blocks from webpages[C]//Proceedings of the 2005 ACM symposium on Applied computing. Santa Fe, New Mexico:ACM,1722-1726.
    [115]Gottron T.2008. Content code blurring:A new approach to content extraction[C]//Proc of the 19th International Conference on Datebase and Expert Systems Applications. Turin, Italy: IEEE,29-33.
    [116]Weninger T, Hsu W H, Han J.2010. CETR:content extraction via tag ratios[C]//Proceedings of the 19th international conference on World wide web. Raleigh, North Carolina, USA:ACM,971-980.
    [117]李素建,王厚峰,俞士汶,等.2004.关键词自动标引的最大熵模型应用研究[J].计算机学报,27(9):1192-1197.
    [118]Nguyen T D, Kan M Y.2007. Keyphrase extraction in scientific publications[M]//Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. Springer Berlin Heidelberg,317-326.
    [119]Treeratpituk P, Teregowda P, Huang J, et al.2010. Seerlab:A system for extracting key phrases from scholarly documents[C]//Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics,182-185.
    [120]Luhn H P.1957. A statistical approach to mechanized encoding and searching of literary information[J]. IBM Journal of research and development,1(4):309-317.
    [121]Li J, Zhang K.2007. Keyword extraction based on tf/idf for Chinese news document [J]. Wuhan University Journal of Natural Sciences,12(5):917-921.
    [122]马颖华,王永成,苏贵洋,等.2003.一种基于字同现频率的汉语文本主题抽取方法[J].计算机研究与发展,40(6):874-878.
    [123]Tomokiyo T, Hurst M.2003. A language model approach to keyphrase extraction[C]//Proceedings of the ACL 2003 workshop on Multiword expressions:analysis, acquisition and treatment-Volume 18. Association for Computational Linguistics,33-40.
    [124]Liu Z, Chen X, Zheng Y, et al.2011.Automatic keyphrase extraction by bridging vocabulary gap[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics,135-144.
    [125]Mihalcea R, Tarau P.2004. TextRank:Bringing order into texts[C]//Proceedings of EMNLP. Barcelona, Spain. Association for Computational Linguistics,404-411.
    [126]Wan X, Xiao J.2008. Single document keyphrase extraction using neighborhood knowledge[C]//Proceedings of AAAI,855-860.
    [127]Litvak M, Last M.2008. Graph-based keyword extraction for single-document summarization[C]//Proceedings of the workshop on multi-source multilingual information extraction and summarization. Association for Computational Linguistics,17-24.
    [128]李鹏,王斌,石志伟,等.2012Tag-TextRank:一种基于Tag的网页关键词抽取方法[J].计算机研究与发展,49(11):2344-2351.
    [129]胡学钢,李星华,谢飞,等.2010.基于词汇链的中文新闻网页关键词抽取方法[J].模式识别与人工智能,(001):45-51.
    [130]Chen J, Yan J, Zhang B, et al.2006. Diverse Topic Phrase Extraction Through Latent Semantic Analysis [C]//In proceeding of the IEEE International Conference on Data Mining, 834-838.
    [131]Liu Z, Sun M.2010. Domain-specific term rankings using topic models[M]//Information Retrieval Technology. Springer Berlin Heidelberg,454-465.
    [132]Steinberger R, Pouliquen B, Hagman J.2002. Cross-lingual document similarity calculation using the multilingual thesaurus eurovoc[M]//Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg,415-424.
    [133]Hasan M M, Matsumoto Y.2001. Multilingual Document Alignment-A Study with Chinese and Japanese[C]//NLPRS,617-623.
    [134]王洪俊,施水才,俞士汶,等.2007.跨语言相似文档检索[J].中文信息学报,21(1):30-37.
    [135]Uszkoreit J, Ponte J M, Popat A C, et al.2010. Large scale parallel document mining for machine translation[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Beijing, China:ACL,1101-1109.
    [136]Potthast M, Stein B, Anderka M.2008. A Wikipedia-based multilingual retrieval model[M]//Advances in Information Retrieval. Springer Berlin Heidelberg,522-530.
    [137]Jagarlamudi J, Daumd III H.2010. Extracting multilingual topics from unaligned comparable corpora[M]//Advances in Information Retrieval. Springer Berlin Heidelberg, 444-456.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700