程序代码相似度中的代码转换技术的研究

英文题名：Research on Code Conversion Technology in Program Code Similarity Detection
作者：裴冬梅
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：程序代码转换 ; 相似度 ; 词表 ; 字符串匹配算法
英文关键词：Programming code converting ; Similar ; word list ; Character String arithmetic
学位年度：2008
导师：刘东升
学科代码：081203
学位授予单位：内蒙古师范大学
论文提交日期：2008-04-10

摘要

程序代码的分词转换技术是实现程序代码相似度判别系统的一个重要技术,一个好的分词转换技术不仅可以提高相似度判别系统中对程序进行相似度计算的速度,还可以提高相似度计算的精度,这对相似度判别系统的发展具有重要的现实意义。
     在程序代码相似度判别系统中,程序代码的分词转换技术得到了广泛的应用。我们可以把一个程序看作一个文本串,然后再通过一定的文法分析将这个文本串转换成描述程序基本信息的标记(token)串。所以对程序相似性的比较就转变成比较两个程序的标记串。而比较标记串的过程就是程序代码的分词转换的过程。
     本研究首先介绍了关于程序代码相似度判别技术,包括程序代码相似度判别的定义与分类,国内外研究发展的现状以及现有的程序代码相似度判别系统的相关介绍。然后对程序代码分词转换过程中所用到算法情况进行了介绍,包括分词算法,字符串匹配算法等。
     本研究设计了一个实验系统,该实验系统主要由四部分组成,第一部分,完成实验系统对程序代码的预处理及分词功能,预处理即去掉那些在程序中存在,但对相似度判别无影响的信息,如程序中的注释语句、连续的空格、空行等,接着对预处理后的程序代码进行分词;第二部分,创建程序代码转换所需的词表;第三部分,将程序代码的预处理及分词之后的程序采用字符串匹配算法转换为字符串标识;第四部分是通过用户界面可得到源程序代码转换后的结果输出。
     最后,通过一些实验对该实验系统进行简单的验证与分析。其中实验的数据来自于学生所做的程序作业,实验结果反映出该实验系统不仅可以支持多种程序语言的转换,而且转换后的实验结果可用于基于字符串相似度判别的算法中,为后续的研究,即对转换后的标记串进行相似度计算,从而得到相似程度的数据,提供了可靠的测试信息。
The segmenting programming code is a very important technical for implement the system of detecting programming code similar. A very good technical of segmenting words can provide faster and exacter method to the system of detecting programming code similar. It is very important effect to detecting programming code similar.
     In the system of the detecting programming code similar, the technical of segmenting words can use so widely. Firstly, we can look the programming code as the text string, then use grammar analysis method to convert such text string to the token what can describe basic information and properties of the programming code. Such this process just is word segmenting and converting.
     This paper introduces the technology of detecting programming code similar. Such as detecting similar definition and detecting similar technology sorts. Then it introduces detecting similar technology’s development in overseas and internal. At last, the paper introduces very useful arithmetic for programming code segmenting and converting. They are: segmenting words arithmetic, character string matching arithmetic etc.
     In programming code segmenting and converting research, We implement a experiment system, Its functions contain four parts, First part of function is programming code processing and segmenting, processing just is removing un-useful content ,such as comments, space and so on; Second part of function is creating a word dictionary for programming converting. Third part of function is using character string matching arithmetic to convert programming code to the token. Forth part of function print out the converting results via GUI of the system. At last, we must test and analyze
     our experiment system via a reasonable and scientific experimentation. All experimentation data come from student’s homework. Test result tell us this system can support multiple programming language converting, and the result as character string type, it can be used for detecting similar arithmetic base on character string detecting. This experiment provides stable testing information. And such information is very important for researching detecting programming code similar system.

引文

[1] FAIDHI, J. A. W. AND S. K. ROBINSON. An Empirical Approach for Detecting Program Similarity within a University Programming Environment. Computers and Education, 1987.
    [2] M. H. Halstead, Elements of software science, North Holland, New York, 1977.
    [3] KARL J. OTTENSTEIN, An Algorithmic Approach to the Detection and Prevention of Plagiarism. ACM SIGCSE Bulletin, 1976.
    [4] GRIER, SAM. A Tool that Detects Plagiarism in Pascal Programs. Twelfth SIGCSE Technical Symposium, St Louis, Missouri, 1981.
    [5] WHALE, G. Identification of Program Similarity in Large Populations. The Computer Journal, 1990.
    [6] D. Gitchell and N. Tran. Sim: A utility for detecting similarity in computer programs. In Proceedings of the 30th SIGCSE Technical Symposium, March 1999.
    [7] WISE, MICHAEL J. YAP3: Improved Detection of Similarities in Computer Program and other Texts. Department of Computer Science, University of Sydney, 2003.
    [8] A.Parker, J.O.Hamblen. Computer Algorithms for Plagiarism Detection. IEEE Transactions on Education, 1989.
    [9] P.Clough. Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies. Research Memoranda, 2000.
    [10] S.Singhe, F.J.Tweedie. Neural Networks and Disputed Authorship: New Challenges. The 4th International Conference on Artificial Neural Networks, Pairs, France, 1997.
    [11] 张文典,任冬伟.程序抄袭判定系统.小型微型计算机系统[J].第9卷,第10期,1988.
    [12]秦新国. 电子作业管理和作业抄袭检测技术研究.南京师范大学.2007.
    [13] 樊敏.程序作业自动测评的研究与研究.广东工业大学[D].2005.
    [14] 于国权.面向C语言题库的相似试题辨别方法研究.哈尔滨工业大学[D].2006.
    [15] 樊敏.程序作业自动测评的研究与研究.广东工业大学[D].2005.
    [16] 王宁.编程题自动评分系统中结构体的研究与实现.哈尔滨工业大学[D].2006.
    [17] 孙坤.C语言上机考试及自动评分系统的研究与实现.沈阳工业大学[D].2005.
    [18] 周高嵚，彭四伟．源代码在线测评系统中剽窃检测技术的研究与实现[J/OL].计算机与信息技术.http://www.ahcit.com/lanmuyd.asp?id=1659,2005.12.
    [19] A. Aiken. Moss: A System for Detecting Software Plagiarism,http://www.cs.berkeley.edu/~aiken/moss.html, Accessed 10th August 2004.
    [20] L. Prechelt, G. Malpohl and M. Phippsen. Finding Plagiarisms among a Set of Programs with JPlag. Journal of Universal Computer Science, vol 8,2002.
    [21] Yamamoto T., Matsushita M., Kamiya T. and Inoue K. Measuring Similarity of Large Software Systems Based on Source Code Correspondence. Draft, Division of Software Science, Graduate School of Engineering Science, Osaka University, 2002.
    [22] K. L. Verco, M. J. Wise. Software for detecting suspected plagiarism: comparing structure and attribute-counting systems. Computer Science, University of Sydney, 1996.
    [23] DONALDSON, JOHN L., ANN-MARIE LANCASTER, AND PAULA H. SPOSATO. A Plagiarism Detection System. Twelfth SIGCSE Technical Symposium, St Louis, Missouri, 1981.
    [24] Seo-Young Noh,Shashi K.Gradia.An XML Plagiarism Detection Model for Procedural Programming
    [25] A. Aiken. Moss: A System for Detecting Software Plagiarism, http://www.cs.berkeley.edu/~aiken/moss.html, Accessed 10th August, 2004.
    [26] Michael J. Wise, Detection of similarities in student programs: YAP’ing may be preferable to Plague’ing, SIGSCI Technical Symposium, Kansas City, USA, 1992.
    [27] Matthew Szuskiewicz.Automatic Plagiarism Detection in Software Code.B.A.(Mod.)Information and Communications Technology.May 2003
    [28] Adrian West. Coping with plagiarism in Computer Science teaching laboratories. Computers in Teaching Conference, Dublin, July 1995.
    [29] Jonathan Helfman, Dotplot: A program for exploring self-similarity in millions of lines of text and code, Journal of Computational and Graphical Statistics, June 1993.
    [30] Brenda S. Baker. Parameterized Pattern Matching: Algorithms and Applications. Journal of Computing SystemScience, 52, February 1996.
    [31] Udi Manber. Finding similar files in a large file system. In USENIX, San Francisco, CA, January 1994.
    [32] 孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究,中文信息学报.第14卷第1 期,1999.
    [33] 刘源, 谭强. 信息处理用现代汉语分词规范及自动分词方法[M].北京:清华大学出版社.1994.
    [34] 史继红, 赖茂生. 汉语自动标引加权方法试验研究,现代图书情报技术.1994.
    [35] 刘萍. 面向网络内容筛选的串匹配技术研究.中国科学院计算技术研究所.2005
    [36] 李雪莹,刘宝旭,许榕生.字符串匹配技术研究. 计算机工程. Vol.30 NO.22. November 2004
    [37] 谭建龙.串匹配算法及其在网络内容分析中的应用. 中国科学院计算技术研究所.2003
    [38] peter Drake.朱剑平等译.数据结构与算法.清华大学出版社.2006.10.第1版
    [39]范立新.用位并行法进行过滤的中文近似串匹配算法.浙江大学计算机科学与技术学院.2006
    [40] 李雪莹,刘宝旭,许榕生.字符串匹配技术研究. 计算机工程. Vol.30 NO.22. November 2004
    [41] Thomas H.Cormen,Charles E.Leiserson,Ronald L.Rivest,Clifford Stein.算法导论(第二版影印版).高等教育出版社。2002.5 ISBN 7-04-011050-4

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700