基于拉普拉斯谱分析的科学论文甄别方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
由自然语言和形式化语言表达的学术论文是人类保存和传播知识的最重要的工具。然而,现今学术领域有不少劣质甚至伪造的学术论文滥竽充数,占用学术发表资源,污染人类的知识体系。这些人工产生或是用算法自动生成的劣质或者伪论文有一个共同特点——语法与规范上均无问题,然而,语义上却是晦涩难懂乃至根本无意义。这些劣质或者伪学术论文,应该与严肃认真的、有学术价值的高水平学术论文有本质的区别。发现这个本质区别,并利用其对学术论文进行初步甄别,是本文的主要研究内容。通过此研究,可以更深入地了解主要由自然语言表达的人类知识体系的结构特征。另外,从实际的角度来看,如果能对数量巨大的学术论文稿件进行较为可靠的初步筛选,使得评审人的宝贵时间不至在伪学术论文上浪费,将是一项很实际、很有价值的工作。
     语言网络作为实际的复杂网络,其小世界特性和无标度特性已经被中外学者证明。分析语言网络复杂网络特征,可以推测伪论文的词同现网络与真论文的词同现网络的复杂网络特征存在明显区别。研究复杂网络结构特征时,有些学者应用谱图理论中的拉普拉斯谱分布图从几何角度分析,发现随机网络、小世界网络和无标度网络的拉普拉斯谱分布存在显著差异。
     本文以科学论文词同现网络为研究对象,运用拉普拉斯谱分析方法研究其网络结构特征,在比较真伪科学论文的拉普拉斯谱特征:拉普拉斯特征值分布、谱密度分布和特征值极值等的基础上,找出两类论文由拉普拉斯谱表征属性的本质区别,从而设计拉普拉斯谱甄别方法实现对真伪科学论文的自动甄别。
     本文运用设计的拉普拉斯谱甄别方法,分别对收集到的真伪科学论文样本:MIS Quarter论文、管理科学与工程国际会议录取与未录取论文、以及SCI engine随机生成的伪论文四类样本的各100篇论文进行了拉普拉斯谱图绘制和深入分析,发现真伪科学论文的拉普拉斯谱分布存在显著差异,从而证明可以利用科学论文词同现网络的拉普拉斯谱特征来甄别真伪论文。
Academic papers expressed by natural language and formal language papers are the most important tools that the human preserve and disseminate the knowledge. Today, however, there are many poor academic papers and even inauthentic those take up academic publication resources and pollute of human knowledge. These poor and inauthentic papers artificially produced or automatically generated by algorithms have a common feature which is standard on grammar with no problems, but not obscure and even pointless in semantics. These poor quality or inauthentic papers should have essential differences with serious and high level academic papers. Survey the essential differences, and using them to initially discriminate their papers is the main contents of this article. Through this research, we can more in-depth understand the structural features of human knowledge mainly expressed by the natural language. In addition, from a practical point of view, it will be a very practical and great value work that paper reviewers' valuable time is not to waste in the inauthentic academic papers if large quantities of papers on the manuscript can be discriminated for a more reliable initial.
     As a real complex network, the small world and scale-free characteristics of language network have been proved by Chinese and foreign scholars. According to the complex network characteristics of the language network, we can presume that the word co-occurrence networks of dissertation papers are more likely the characteristics of random networks, while the real papers are more inclined to the characteristics of the small world network or the scale-free network. While in the study of characteristics of complex networks, some scholars apply the Laplacian spectrum distribution of graph theory in network topology structure from the geometric view, and find that the Laplacian spectrum distributions with the random network, the small world network and the scale-free network are significantly different.
     This paper takes the word co-occurrence network of scientific papers as the object of study. We use the Laplacian spectrum analysis method to study the structures of the word co-occurrence networks. Based on the comparative study of the Laplace spectral characteristics of scientific papers: Laplace eigenvalue distribution, Laplace spectral density distribution and the Laplace extreme eigenvalues, we can find the essential different characteristics of the two types of scientific papers identified by the Laplacian spectrum, and that, these differences can be used to design Laplacian spectrum screening method to achieve scientific Automatic paper screening.
     In this paper, we use the Laplacian spectrum discriminating method to plot and in-depth analysis of the Laplace spectra graphs of the authenticity of the collected scientific papers samples. The papers samples are MIS Quarter papers, accepted and not accepted papers of International Conference on Management Science and Engineering, and the SCI engine pseudo-random generated papers. We select all 100 papers of every type of the four samples and comparative investigate their Laplace spectra. The study of the paper discovers that there are significant differences in spectral distribution which can be prove that Laplacian spectral characteristics of the word co-occurrence network can be used to identify the authenticity of scientific papers.
引文
1 G. Miller and N. Chomsky. Finitary Models of Language Users. In R. Luce, R. Bush, and E. Galanter, editors, Handbook of Mathematical Psychology, New York, 1963. Wiley.
    2 Alan Sokal. Transgressing the Boundaries: Toward a Transformative Hermeneutics of Quantum Gravity. Social Text 1996,46/47:217-252
    3 Alan Sokal. A Physicist Experiments with Cultural Studies. Lingua Franca, May/June 1996
    4 Igor Bogdanoff, Topological origin of inertia. Czech.J.Phys,2001,51:1153-1176
    5 Grichka Bogdanoff, Igor Bogdanoff. Topological Field Theory of the Initial Singularity of Space-time. Class.Quant.Grav. , 2001,18:4341-4372
    6 http://math.ucr.edu/home/baez/bogdanoff/
    7 Stribling, Jeremy, Aguayo, Daniel, Krohn, Maxwell. Rooter: A Methodology for the Typical Unification of Access Points and Redundancy. 2005 World Multiconference on Systemics, Cybernetics and Informatics
    8 Cancho R F I, Sole R V. The Small World of Human Language. Proceedings of the Royal Society of London Series B2-Biological Sciences, 2001, 268(1482):2261-2265
    9刘知远,孙茂松.汉语词同现网络的小世界效应和无标度特性.中文信息学报,2007,21(6):52-58
    10 Mehmet M. Dalkilic, Wyatt T. Clark, James C. Costello, Predrag Radivojac. Using Compression to Identify Classes of Inauthentic Texts. Proceedings of the 2006 SIAM International Conference on Data Mining, 2006: 604-608
    11李炯生,张晓东,潘永亮.图的Laplace特征值.数学展,2003,2(32):157-165
    12 http://improbable.com/2006/07/22/inauthentic-paper-detector/
    13 http://www.elsewhere.org/pomo/
    14 Dorogovtsev S N, Mendes J F F. Language as an Evolving Word Web. Proceedings of the Royal Society of London Series B-Biological Sciences, 2001, 268(1485):2603-2606
    15 Steyvers M, Tenenbaum J B. The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science: A Multidisciplinary Journal. 2005, 29(1):41–78
    16 Steels L. Language as a Complex Adaptive System. Schoenauer M. In: Proceedings of ppsn-vi, Lecture Notes in Computer Science. Berlin:Springer-Verlag, 2000:17-26
    17 Sole R V, Murtra B C, Valverde S, et al. Language Networks: Their Structure, Function and Evolution. Trends in Cognitive Sciences, 2006
    18 Fiedler M. Algebra connectivity of graphs. Czech.Math.J,1973,23:298-305
    19 Bojan Mohar. The Laplancian Spectrum of Graphs. Graph Theory, Combinatorics and Applications.
    20 Lars Hagen, Andrew B.Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computed-Aided Design, 1992, 11(9):1047-1085
    21 Albert R, Barabasi A L. Statistical Mechanics of Complex Networks. Reviews of Modern Physics, 2002,74(1): 472-497
    22 Barabasi A L, Albert R. Emergence of scaling in random networks. Science. 1999,286(5439):509–512
    23 A.Jamakovic,P.Van Mieghem. The Laplacian Spectrum of Complex Networks.
    24 Damien Fay, Hamed Haddadi, Andrew Thomason, Andrew W. Moore, Richard Mortier, Almerima Jamakovic, Steve Uhlig, Miguel Rio. Weighted Spectral Distribution for Internet Topology Analysis: Theory and Applications. IEEE/ACM Transactions on Networking:1-14
    25韦洛霞,李勇,李伟等.汉字网络的3度分隔与小世界效应.科学通报,2004,49(24):2615-2616
    26韦洛霞,李勇,亢世勇等.汉语词组网的组织结构与无标度特性.科学通报,2005,50(15):1575-1579
    27唐璐,张永光,付雪.语义网络的结构:我们怎样学习语义知识.东南大学学报(英文版),2006,22(3):413-417
    28刘海涛.汉语句法网络的复杂性研究.复杂系统与复杂性科学.2007, 4(4):38-44.
    29刘建毅,王青华,王枞.文本网络表示研究与应用.中国科技论文在线,2007,2(10):755-760
    30刘知远,郑亚斌,孙茂松.汉语依存句法网络的复杂网络性质.复杂系统与复杂性科学,2008,5(2):37-45
    31李乔,冯克勤.论图的最大特征根.应用数学学报,1979,2(2):167-175
    32吴翠芳.关于图的谱和拉普拉斯谱.大连理工大学硕士学位论文,2005
    33孔敏.关联图的谱分析及谱聚类方法研究.安徽大学博士学位论文.2006
    34黄鹏飞,张道强.拉普拉斯加权聚类算法.电子学报, 2008, 36(12A):50-54
    35张江,王年,梁栋,唐俊.基于Laplace谱的图像分类.计算机技术与发展,2008,18(5)
    36王年,范益政,韦穗等.基于图的Laplace谱的特征匹配.中国图象图形学报,2006,11(3):332-336
    37孔敏,汤进,罗斌.基于拉普拉斯图的谱特征的图像聚类研究.中国科学技术大学学报,2007,37(9):1125-1129
    38沈亚田,沈夏炯,张磊.基于图划分的谱聚类算法在文本挖掘中应用.计算机技术与发展, 200905, 19(5):96-98
    39 Erdos P, Renyi A. Publications of the Mathematical Institute of the Hungarian Academy of Science.1960, 5:17
    40 Füredi, Z., Komlós, J. The eigenvalues of random symmetric matrices. Combinatorica, 1981, 1(3):333-341
    41 Bollobas B. Random Graphs. London: Academic Press Inc.1985
    42 Farkas, I.J, Derenyi, I., Barabási, A.L., & Vicsek, T. Spectra of real world graphs: Beyond the semicircle law. Physical Review E, Art. 2001, 64 (2)
    43 Chung, F., Lu, L. Y., & Vu, V. Spectra of random graphs with given expected degrees. Proceedings of the National Academy of Sciences of the United States of America, 2003, 100 (11): 6313-6318.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700