用户名: 密码: 验证码:
自然语言文本中数字水印的设计与研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
自然语言是人类相互交流中最主要、最准确、最高效的方式。随着数字时代的来临,人们每天都会接触大量的电子文档、网络新闻、论坛、博客等,自然语言数字文本已经成为新的交流层面上最重要的载体,如何保护其版权是亟待解决的问题。
     数字水印是数字文件版权保护的重要手段。对于数字水印的研究首先在多媒体载体的领域展开,在图像、音视频方面都出现了针对人类视觉特点或者听觉特点的水印算法。由于这几种媒体的处理手法相近,冗余度也较高,研究不断深入。近年来逆向的对水印算法的检测等攻击分析也逐渐得到重视。
     数字水印是数字文件版权保护的重要手段。对于数字水印的研究首先在多媒体载体的领域展开,在图像、音视频方面都出现了针对人类视觉特点或者听觉特点的水印算法,由于这几种媒体的处理手法相近,冗余度也较高,研究不断深入。逆向的对水印算法的检测等攻击分析也逐渐得到重视。反观文本方面,存在处理手段特殊、冗余度低、自然语言规则复杂、计算语言学受限等困难,文本数字水印的研究起步晚,成果也较少。但是因为文本既常见又重要,所以近年来投身文本水印领域的研究者逐渐增加,从排版类到语法语义类都出现了新颖的水印算法,同时文本水印算法的检测分析工作也已起步。不过总体来说,文本数字水印领域还未出现足够实用的方案,水印算法的检测分析成果凤毛麟角,整体上缺乏系统的理论基础。
     有鉴于此,本文的研究工作及取得的相应成果主要包括:
     1.自然语言文本中数字水印模型的研究。建立了适合文本的通讯模型,根据密码学基础的方法定义了水印的不可检测性、程序敌手、人类敌手、不可见攻击、鲁棒性等概念,构造了用交互证明系统验证水印算法安全性的方法,并将其应用于对实际水印系统的评价。
     2.自然语言文本中数字水印的设计。提出并实现一种新的文本数字水印算法——宋词水印。这是一种附加型生成文本水印,算法由水印信息直接生成一段宋词,这段宋词在字数、行数、句子形式、格律和韵脚等方面符合某个词牌,具有很强的迷惑性。将生成的宋词附加于载体文本中,验证时提取这段宋词,对照词典即可还原出水印信息。由于生成的宋词具有较高的迷惑性,所以水印具有良好的隐蔽性。实验结果表明水印信息与生成文本的大小比值达到16%,因此本方法也可作为一种高嵌入率的文本隐写算法。据我们所知,这是第一个利用特殊体裁的文本水印算法。
     3.自然语言文本中数字水印的检测研究。针对排版类的Snow水印设计检测算法,并指出检测一般性排版类水印算法的思路。针对语义类的基于同义词替换的水印,设计利用上下文信息的检测算法,通过考量关键词是否是同义词集合中最适合上下文的词语,判断该点是否被嵌入信息,整篇文章的关键词的考量结果导致文本是否带有水印信息的判断。同一同义词集合的词语对同样的上下文比较合适度时,我们用IDF系数调整常用词和冷僻词之间的差距。实验表明检测算法对于T-Lex同义词水印系统达到了90.0%的准确率、86.6%的精度和82.5%的召回率。针对基于翻译的水印系统,我们也设计了检测的方法。
     4.提出将整个互联网作为语料库的思想。如果将每个包含自然语言文本的网页视作语料库中的一篇文档,那么整个互联网就可视为一个超大规模的、按影响力有序的、实时更新的语料库。配合搜索引擎等工具,人们可以从中提取自然语言使用习惯等传统语料库因规模受限、成本过高等原因无法有效提供的信息。
Natural language is the most primary, the most exact, and the most efficient way of human communication. With the development of digital technique, people meet lots of electronic documents, netnews, forums, blogs, and so on. Digital natural language documents have became the most important media over the Internet. How to protect the copyright of these digital documents is an urgent problem.
     Digital watermarking is an important way to protect the copyright of digital files. Research in this area first develops in multimedia area. Making use of the disadvantages of human vision system and human auditory system, researchers have designed watermarking algorithms for image, audio and video. Due to the similarity of these multimedia carriers in processing and their sufficient redundancy, research in designing watermarking develops rapidly, and research on steganalysis of these schemes has received enough attention.
     By contrast, owing to special processing methods, low redundancy, complexity of natural language rules, and limitation of computer linguistics, research on watermarking in digital text starts late and gains less achievement. However, text is common and important in our daily life, more and more researchers investigate into this area in recent years. New watermarking algorithms emerge from formatting kind, syntactic kind to semantic kind. Meanwhile, steganalysis on text watermarking has already started. Generally speaking, in the area of digital watermarking in natural language text, application-proper schemes haven’t been designed yet, results in steganalysis are still rare, and the theoretic basis is waiting to be established. With this concern, the main research work and the corresponding contributions of this dissertation are as follows:
     1. Research on model for digital watermarking in natural language text. We establish communication model especially for text, use the methodology of foundations of the cryptography to define the concepts of undetectability, procedure adversary, human adversary, invisible attack and robustness. Also, we find out an approach to prove the safety of watermarking algorithms by interactive prove systems. And we use these to evaluate some actual watermarking systems.
     2. Design of watermarking schemes for digital natural language text. We propose and realize a new digital text watermarking system– StegCi. It is an appending watermarking scheme. A piece of Ci is produced from watermark by the encoding algorithm. The generated Ci is accord with some tune in number of lines and words, sentence patterns, rhythm and rhyme, so it is innocuous. Stego Ci is then added to the carrier text. During verification, watermark is extracted from the stego Ci by looking up a lexicon. Because stego Cis are innocuous, watermarking is difficult to detect. Experimental result show that the ratio of watermark to carrier reached 16%, which means StegCi is also a high embedding ratio text steganography system. To the best of our knowledge, this is the first text watermarking scheme making use of special type of literature.
     3. Detection of watermarking schemes for digital natural language texts. For algorithm Snow which belongs to the class of formatting methods, we design detection algorithm and point out the general way to steganalyze formatting schemes. For synonym substitution based schemes which fall into semantic kind, we design detecting algorithm by making use of the context information. By investigating whether the keyword is the most suitable word for the context in its synonym set, judgement of whether this keyword is carrying watermarking bit is made. The investigation over the whole text leads to the final judgement about watermarked or not. When comparing between words in a synonym set for the same context, we use IDF to balance common words and rare ones. Experimental results for T-Lex watermarking system show 90% accuracy, 86.6% precision and 82.5% recall rate. For watermarking system based on translation, we also design detecting algorithm.
     4. Developing the idea of treating the whole Internet as a corpus. If each webpage which contains natural language texts is treated as a document in this corpus, the whole Internet can be regarded as a large-scale, influence-weighted, up-to-date corpus. With the help of tools such as searching engine, people may get useful information about the usage of natural language which is very difficult to get from traditional corpra because of their limited size or unaffordable cost.
引文
Alattar AM, Alattar OM. 2004. Watermarking electronic text documents containing justified paragraphs and irregular line spacing[C]. //Proceedings of SPIE, 5306: 685-695.
    Amano T, Misaki D. 1999. A feature calibration method for watermarking of document images[C]. //Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR'99): 91-94.
    Anderson RJ, Petitcolas FAP. 1998. On the limits of steganography[J]. IEEE Journal on selected areas in communications, 16(4): 474-481.
    Anderson RJ, Petitcolas FAP. 1999. Information hiding: An annotated bibliography. //http://www.cl.cam.ac.uk/?fapp2/steganography/bibliography/.
    Atallah MJ, McDonough CJ, Raskin V, et al. 2001 Natural language processing for information assurance and security: An overview and implementations[C]. //Proceedings of the 2000 workshop on New security paradigms: 51-65.
    Atallah MJ, Raskin V, Crogan M. 2001a. Natural language watermarking: Design, analysis, and a proof-of-concept implementation[J]. Lecture Notes in Computer Science, 2137:185-199.
    Atallah MJ, Raskin V, Hempelmann CF. 2003. Natural language watermarking and tamperproofing[J]. Lecture Notes in Computer Science, 2578:196-212.
    Atallah MJ. 2005. A survey of watermarking techniques for non-media digital objects (invited talk)[C]. //Proceedings of the 2005 Australasian workshop on Grid computing and e-research. Australian Computer Society, Inc. Darlinghurst, Australia. 44: p73
    Bender W, Gruhl D, Morimoto N, et al. 1996. Techniques for data hiding[J]. IBM systems journal, 35(3&4): 313-336.
    Bennett K. 2004. Linguistic steganography: Survey, analysis, and robustness concerns for hiding information in text[J]. Purdue University, CERIAS Tech. Report, 2004-13.
    Berghel H. 1999. Digital village: Value-added publishing[J]. Communications of the ACM, 42(1):19-23.
    Bolshakov IA. 2004. A method of linguistic steganography based on collocationally-verified synonymy[J]. Information Hiding 2004, Lecture Notes in Computer Science, 3200:180-191.
    Brassil JT, Low S, Maxemchuk NF, et al. 1995. Electronic marking and identification techniques to discouragedocument copying[J]. IEEE Journal on Selected Areas in Communications, 13(8): 1495-1504.
    Chen Chao, Wang S, Zhang X. Information Hiding in Text Using Typesetting Tools withStego-Encoding[C]. // Proceedings of the First International Conference on Innovative Computing, Information and Control, 1: 459-462.
    Chiang YL, Chang LP,Hsieh WT. 2004. Natural language watermarking using semantic substitution for chinese text[J]. Lecture Notes in Computer Science, 2939: 129-140.
    Cohen AS, Lapidoth A. 2002. The Gaussian watermarking game[J]. IEEE Transactions on Information Theory, 48(6): 1639-1667.
    Costa MHM. 1983. Writing on dirty paper[J]. IEEE Transactions on Information Theory, 29(3): 439-441.
    Cox C, Killian J, Leighton T, et al. 1996. Secure spread spectrum watermarking for images, audio and video[C]. //Proc. of the IEEE Int. Conf. on Image Processing. 1174: 243-246.
    Gerhard CL, Setyawan I, Reginald LL. 2000. Watermarking digital image and Video data--A state of art Overview[J]. Signal processing Magazine, 17(5): 20-45.
    Gruhl D, Lu A, Bender W. 1996. Echo hiding[J]. Lecture notes in computer science. 1174:295-316.
    Gupta G, Pieprzyk J, Wang HX. 2006. An attack-localizing watermarking scheme for natural language documents[C]. // Proceedings of the 2006 ACM Symposium on Information, computer and communications security: 157-165.
    Huang D, Yan H. 2001. Interword distance changes represented by sine waves forwatermarking text images[J]. IEEE Transactions on Circuits and Systems for Video Technology, 11(12): 1237-1245.
    Katzenbeisser S, Petitcolas FAP. 2000. Information hiding techniques for steganography and digital watermarking[M]. Artech House, Inc. Norwood, MA, USA.
    Kim YW, Moon KA, Oh IS. 2003. A text watermarking algorithm based on word classification and inter-word space statistics[C]. //Proceedings of Seventh International Conference on Document Analysis and Recognition: 775-779.
    Kuhn M, Petitcolas F, Anderson R. 1999. Information hiding--a survey[C]. //Proceedings of the IEEE, Special issue on Protection of Multimedia Content.87(7):1062-1078.
    Li Q, Memon N, Sencar HT. 2006. Security issues in watermarking applications-a deeper look[C]. //Proceedings of the 4th ACM international workshop on Contents protection and security. ACM New York, NY, USA: 23-28.
    Liu J, Elia N. 2006. Writing on Dirty Paper with Feedback[C]. //Proceedings of the 2006 IEEE International Conference on Networking, Sensing and Control (ICNSC'06): 468-473.
    Liu YL, Sun XM, Can G, et al. 2007. An Efficient Linguistic Steganography for Chinese Text[C]. //2007 IEEE International Conference on Multimedia and Expo: 2094-2097.
    Low SH, Maxemchuk NF, Brassil JT. 1995. Document marking and identification using both line and wordshifting[C]. //Proceedings of IEEE INFOCOM'95. Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Bringing Information to People: 853-860.
    Meral HM, Sankur B, Ozsoy AS, et al. 2009. Natural language watermarking via morphosyntactic alterations[J]. Computer Speech and Language, 23(1): 107-125.
    Meral HM, Sevinc E, Unkar E. 2007. Syntactic tools for text watermarking[C]. //Proceedings of the 9th Conference on Security, Steganigraphy and Watermarking of Multimedia Contents Moulin P, Mihcak MK. 2004. The parallel-Gaussian watermarking game[J]. IEEE Transactions on Information Theory, 50(2): 272-289.
    Moulin P, O'Sullivan JA. 2000. Information-theoretic analysis of watermarking[C]. //Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP'00). 6: 3630-3633.
    Moulin P, O'Sullivan JA. 2003. Information-theoretic analysis of information hiding[J]. IEEE Transactions on information theory, 49(3): 563-593.
    Ni R, Ruan Q. 2006. Adaptive Watermarking Model and Detection Performance Analysis[C]. //First International Conference on Innovative Computing, Information and Control (ICICIC'06). 3: 479-482.
    O'Sullivan JA, Moulin P, Ettinger JM. 1998. Information theoretic analysis of steganography[C]. //Proceedings of IEEE International Symposium on Information Theory, 1998 (ISIT98): p297 Podilchuk CI, Delp EJ. 2001. Digital watermarking: algorithms and applications[J]. IEEE Signal Processing Magazine, 18(4): 33-46.
    Qadir MA, Ahmad I. 2005. Digital text watermarking: secure content delivery and data hiding in digital documents[C]. //CCST'05. 39th Annual 2005 International Carnahan Conference on Security Technology: 101-104.
    Rabah K. 2004. Steganography-The art of hiding data[J]. Information Technology Journal, 3(3):245-269.
    Raskin V, Nirenburg S, Atallah MJ, et al. 2002. Why NLP should move into IAS[C]. //International Conference On Computational Linguistic. Morristown, NJ, USA: 1-7.
    Ryder J. 2004. Steganography may increase learning everywhere[J]. Journal of Computing Sciences in Colleges, 19(5): 154-162.
    Shirali-Shahreza MH, Shirali-Shahreza M. 2006. A New Approach to Persian/Arabic Text Steganography[C]. // Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop onComponent-Based Software Engineering, Software Architecture and Reuse: 310-315.
    Simmons GJ. 1984. The subliminal channel and digital signatures[C]. //Proceedings of Eurocrypt. Springer, 209: 364-378.
    Somekh-Baruch, A. and Merhav, N. 2004. On the capacity game of public watermarking systems[J]. IEEE Transactions on Information Theory, 50(3): 511-524.
    Sun XM, Chen HW, Li CY. 2003. Deeper Chinese information processing of open software and its application in Chinese document watermarking. //International Conference on Machine Learning and Cybernetics. 1: 442-446
    Taskiran CM, Topkara U, Topkara M. 2006. Attacks on lexical natural language steganography systems[C]. //Proceedings of SPIE. 6072: 97-105.
    Tirkel A, Rankin G, Van Schyndel, et al. 1993. Electronic watermark[J]. Digital Image Computing, Technology and Applications (DICTA’93)
    Topkara U, Topkara M, Atallah MJ. 2006. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions[C]. //Proceedings of the 8th workshop on multimedia and security: 164-174.
    Vybornova O, Macq B. 2007a. Text Watermarking against Ownership Rights Violation[C]. //Proceedings of 2007 IEEE International Conference on Signal Processing and Communications (ICSPC 2007): 1135-1138.
    Vybornova O, Macq B. 2007b. Natural language watermarking and robust hashing based on presuppositional analysis[C]. //IEEE International Conference on Information Reuse and Integration (IRI 2007): 177-182.
    Wang H. and Sun X. and Liu Y. 2008. Natural Language Watermarking Using Chinese Syntactic Transformations[J]. Information Technology Journal, 7(6): 904-910.
    Winstein K. 1998. Lexical steganography through adaptive modulation of the word choice hash[R]. //http://www.imsa.edu/ keithw/tlex/.
    Yang C, Liu J, Niu Y. 2007. Robust Watermark Model Based on Subliminal Channel[C]. // 2007 International Conference on Computational Intelligence and Security: 931-934.
    袁征,温巧燕,刁俊峰. 2006.基于水印和密码技术的数字版权保护模式[J].北京邮电大学学报, 29(5): 98-102.
    曹卫兵,戴冠中,夏煜. 2002.基于文本的信息隐藏技术[J].计算机应用研究, 10: 39-41.
    白剑,徐迎晖,杨榆. 2004.利用文本载体的信息隐藏算法研究[J].计算机应用研究,计算机应用研究, 12:147-148.
    赵敏之,孙星明,向华政. 2006.基于不完整语义理解的文本数字水印算法研究[J].计算机应用研究, 6:118-120.
    徐迎晖,杨榆,钮心忻. 2006.基于语义的文本隐藏方法[J].计算机系统应用, 6:91-94.
    眭新光,罗慧.一种安全的基于文本的信息隐藏技术[J].计算机工程, 30(19): 104-105.
    张宇,刘挺,陈毅恒. 2005.自然语言文本水印[J].中文信息学报, 19(1): 56-62.
    杨建龙,王建民,王朝坤. 2007.基于胎记技术的自然语言文本版权保护方案[J].计算机工程, 33(24): 133-135.
    吴树峰, 2003.信息隐藏技术研究[D]:[硕士].中国合肥:中国科学技术大学.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700