详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
Rapid development makes Internet become an important resource in global information transformation and sharing. The data in the web are growing at a steady rate of geometric series, so it is more and more difficult to acquire a piece of useful information from the Web, and "information overload" has become an urgent problem needed to be solved. The ideal case is described as: people can inquire into the data in the web in the same way as we inquire into the data base. However, how to extract the useful information from vast and numerous data on the Web is still a problem which the researchers hope to solve.
     Such characteristics as large quantity, isomery and dynamic variation and so on make Web information extraction different from traditional information extraction, and bring new challenges. In recent years the extraction techniques have been enriched as the demand increases, and there exist many information extraction methods domestically and abroad. In this dissertation, we investigate the method of automatic knowledge acquisition in all subjects from the Web according to the need of the customers, in accordance with the subject knowledge data base to be established in the smart instructional system.
     Specialized knowledge acquisition method based on Web information extraction, which is proposed in this dissertation, is mainly enlightened by the idea that SRV regards the information extraction as a classification problem. Along with Web information extraction method based on HTML structure, we have constructed the frame of Web specialized knowledge acquisition system based on Web information extraction and classification method, and conducted special studies on some key techniques in this system. The detailed contents of this dissertation are listed as follows:
     1. Web page large-quantity acquisition and pretreatment are analyzed. Specialized knowledge acquisition based on Web requires collecting a large quantity of web pages with the same topic. Nowadays the service provided by all Search-engines can't meet the need. In this work, we present a simple and efficient method which is employed to automatically acquire web pages in large quantity and match the pages of the same topics by using canonical expressions.
     2. Page pretreatment method is studied. According to the label meaning in the HTML file structure, HTML vessel label tree is constructed. In view of the characteristics of noise block and subject content block in the pages, the noise node in the label tree is deleted and subject content block is confirmed.
     3. Subject information extraction method of the pages is discussed. In view of the fact that the present information extraction methods need much artificial intervention and much prior knowledge, and that different systems use different descriptive languages, we employ one kind of information extraction method based on XML mapping, establish Jtree by using DOM, automatically acquire the path of information extraction according to the tree node, and study information extraction rules, in order that the automation in information extraction is achieved.
     4. Chinese text characteristic expression method and text classification algorithm are also analyzed. The quantity of characteristic word in the text characteristic expression method of vector space model and the dimension of data searching space have an intimate relationship with the efficiency of classification algorithm. Based on the fact mentioned above, we have developed a characteristic word extraction method based on word gender, which can reduce the dimensions of characteristic vector. And we have also proposed two modified KNN algorithms, which are based on lessening of characteristic words and data division respectively, so that the efficiency and performance of classification algorithm are improved.
     5. Training base's automatic extraction method is studied. In order to improve the performance of the classification algorithm, a high-class training base has to be established. All the past researches are based on the training base which had already been established. However, in present study one high-class training base is automatically generated by Web excavation, in order to further improve the automation degree of specialized information acquisition.
     6. The information organization and storage methods are analyzed. The extracted specialized knowledge is organized into a form that the customer utility system-smart instructional system- can access directly, and the data are arranged initially according to the need of the utility system.
     In this dissertation, researches have been done on key techniques in every link of specialized knowledge acquisition based on web information extraction, the knowledge acquisition frame has been established, and elementary automation in the process of acquisition is achieved.
[1]David W.Embley.Toward Semantic Understanding-An Approach Based on Information Extraction Ontologies.In Klaus-Dieter Schewe and Hugh Williams,Eds.Proceedings of the Fifteenth Australasian Database Conference(ADC'04).Dunedin,New Zealand:Australian Computer Society,Inc,2004.3-12.
    [2]Ramón Aragüés Peleato,Jean-Cédric Chappelier,Martin Ra jman.Automated Information Extraction out of Classified Advertisements.In M.Bouzeghooub er al.Eds.Proceedings of the 5th International Conference on Applications of Natural Language to Information Systems-Revised Papers.London:Springer-Verlag,2000.203-214.
    [6]Ying Han,Fang Li,KeBin Liu et al.Template Based Chinese News Event Summarization.The proceeding of 2nd International Conference on Semantics,Knowledge,and Gdd(SKG'06),2006.53-54.
    [7]IEPAD:Web_Information extraction based on pattern discovery.http://chunnan.iis.sinica.edu.tw/iepad/IEPAD.pdf.
    [13]周剑辉,苑春法,黄锦辉.金融领域内信息抽取规则的自动获取.Advances in Computation Of Oriental Languages--Proceedings Of the 20th International Conference On Computer Processing of Oriental Languages,2003.410-416.
    [20]Fang Li,Li Feng,Huanye Sheng.Web page clustering and concepts mining.2nd IEEE International Conference on Cybernetics and Intelligent System,2006.416-421.
    [21]Fang Li,Shuangqing Yuan,Huanye Sheng.Iterative Mining Translations from the Web.Proceeding of international workshop on Challenges in Web Information retrieval and Integration,2005.
    [24]Rohini Srihare,Wei Li.Information Extraction Supported Question Answering[R].1999-10-15.
    [27]Nancy A.Chinchor.Overview of MUC-7/MET-2.In:Proceedings of the Seventh Message Understanding Conference,1998.
    [28]Marsh,E.,Perzanowski,D.MUC-7 EVALUATION OF IE TECHNOLOGY:Overview of Results.In:Proceedings of the Seventh Message Understanding Conference,1998.
    [29]Line Eikvil,Information Extraction from World Wide Web.Survey Report,1999.
    [30]K.Zechner.A Literautre Survey on Information Extraction and Text Summarization.Term paper,Carnegie Mellon University,1997.
    [31] Jian Sun, Ming Zhou, Jianfeng Gao. Class-based Language Modeling for Named Entity Identification. http://www.nlp.org.cn/docs/20030210/114/Chinese_NE_Identification_Jian Sun_draft.pdf.
    [32] Jian Sun, Jianfeng Gao, Lei Zhang et al. Chinese Named Entity Identification Using Class-based Language Model. in Proceedings of COLING, 2002.
    [33] Laender A, Ribeiro-Neto B, Silva A. A brief survey of web data extraction Tools[J]. SIGMOD Record, 2002. 31(2): 84-93.
    [34] Crescenzi, V., and Mecca, G Grammars Have Exceptions[J]. Information Systems, 1998. 23(9):539-565.
    [35] Hammer, J., Garcia-Molina, H., Nestorov, S., et al. The TSIMMIS Experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems(ADBIS'97), 1997.1-8
    [36] Arocena, G O., Mendelzon, A. O. WebOQL: Restructuring Documents, Databases, and Webs. In Proceedings of the 14th IEEE International Conference on Data Engineering, 1998. 24-33.
    [37] Ludascher, B., Himmeroder, R., Lausen, G et al. Managing semistructured data with florid: A deductive object-oriented perspective[J]. Information Systems., 1998.23(8): 589-613.
    [38] Huck, G, Fankhauser, P., Aberer, K.et al. Jedi: Extracting and synthesizing information from the web. In Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems, 1998.32-43.
    [39] Sahuguet, A., Azavant, F. Building intelligent web applications using lightweight wrappers[J]. Data and Knowledge Engineering, 2001.36(3): 283-316.
    [40] Liu, L., Pu, C, and Han, W. XWRAP: An XML-enable Wrapper Construction System for Web Information Sources. In Proceedings of the 16th IEEE International Conference on Data Engineering, 2000.611-621.
    [41] Crescenzi, V., and Mecca, G, and Merialdo, P. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 26th International Conference on Very Large Database Systems, 2001.109-118.
    [42] Baumgartner, R., Flesca, S., Gottlob, G Visual Web information extraction with Lixto. In Proceedings of the 26th International Conference on Very Large Database Systems, 2001. 119-128.
    [43] Califf M. E., MOONEY R. J. Relational Learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, 1999. 328-334.
    [44]Freitag,D.Machine learning for information extraction in informal domains[J].Machine Learning,2000.39(2-3):169-202.
    [45]Soderland,S.Learning information extraction rules for semi-structured and Free Text[J].Machine Learning,1999.34(1-3):233-272.
    [46]Muslea,I.,Minton,S.,Knoblock,C.A.Hierarchical wrapper induction for semistructured information sources[J].Autonomous Agents and Multi-Agent Systems,2001.4(1-2):93-114.
    [47]Hsuc N,Dung M.Generating finite-state transducers for semi-structured data extraction from the Web[J].Information System,1998.23(8):521-538.
    [48]Kushmerick N.Wrapper induction:efficiency and expressiveness[J].Artificial Intelligence Journal,2000.118(1-2):15-68.
    [49]Adelberg,B.NoDoSE:A Tool for Semi-Automatically Extracting Structured and Semi-Structured Data from Text Documents.In Proceedings of the ACM SIGMOD International Conference on Management of Data,1998.283-294.
    [50]Laender,A.H.F.,Ribeiro-Neto,B.A.,Da Silva,A.S.DEByE-Data Extraction by Example[J].Data and Knowledge Engineering,2002.40(2):121-154.
    [51]Ribeiro-Neto,B.A.,Laender,A.H.E,Da Silva,A.S.Extracting Semi-Structured Data Through Examples.In proceedings of the Eighth ACM International Conference on Information and Knowledge Management,1999.94-101.
    [52]Embley D,Campbelld,Jiang S,et al.Conceptual-model-based data extraction from,ultiple record web pages[J].Data and Knowledge Engineering,1999.31(3):227-251.
    [53]Christina Yip Chung,Michael Gertz,Neel Sundaresan.Reverse engineering for Web data:From visual to semantic structures.In Proceedings of 18th International Conference on Data Engineering,2002.53-63.
    [54]Christina Yip Chung,Neel,Sundaresan.Quixote:Building XML repositories from topic specific web documents.In Fourth Int.Workshop on the Web and Databases,2001.103-108.
    [59]Arnaud Sahuguet,Fabien Azavant.WysiWyg Web Wrapper Factory(W4F).In Proc.WWW'99,1999.
    [60]Google Web API.Http://www.google.ca/apis 2002.04.07.
    [61]Klaus Salchner.How To Integrate Google Search Into Your Application.http://www.codeguru.com/Csharp/Csharp/cs_webservices/tutorials/article.php/c8785.2004-12-23.
    [62]Steve Mansour.A Tao of Regular Expressions.http://www.scruz.net/%7esman/regexp.htm.1999-06-05
    [65]Shian-Hua Lin,Jan-Ming Ho.Discovering Informative Content Blocks from Web Documents.In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(SIGKDD 2002),2002.
    [66]Yiming Yang.Noise Reduction in a Statistical Approach to Text Categorization.In Proceedings of SIGIR-95,18th ACM International Conference on Research and Development in Information Retrieval,1995.256-263.
    [67]LI Xiaoli,SHI Zhongzhi.Innovating Web Page Classification Through Reducing Noise[J].Computer Science & Technology,2002.17(1):9-17.
    [68]Soumen Chakrabarti,Mukul M.Joshi,Vivek B.Tawde.Enhanced Topic Distillation Using Text,Markup Tags,and Hyperlinks.In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval,2001.208-216.
    [69]S.Chakrabarti,M.Joshi,M.Subramanyam.Accelerated focused crawling through online relevance feedback.In Proceedings of the 11th World Wide Web Conference(WWW),2002.148-159.
    [70]Bar-YossefZ,Rajagopalan S.Template Detection via Data Mining and its Applications.In Proceedings of the 11th World Wide Web Conference(WWW),2002.580-591.
    [71]Davision,B.D.Recognizing Nepotistic links on the web.[J].Proc of AAAI,2000.22(6):72-77.
    [72]Kao H.Y.,Lin S.H.,HO J.M.,et al.Entropy-based link analysis for mining web informative structures,In Proc.of the ACM 11th International Conf.on Information and Knowledge Management(CIKM-02),2002.574-581.
    [73]J.M.Kleinberg.Authoritative sources in a hyperlinked environment.In Proc.Of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms,1998.668-677.
    [74]Kushmerick N.Learning to remove Internet advertisements.In Proceedings of the third annual conference on Autonomous Agents,1999.175-181.
    [76]Arnaud Le Hors,Philippe Le Hégaret,Lauren Wood et al.Document Object Model(DOM)Level 3 Core Specification Version 1.0.W3C Recommendation.http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407.
    [77]Custavo Arocean.WebOQL:Exploiting document structure in Web queries[D].Toronto:Master's thesis,University of Toronto,1997.
    [78]XSL Transformations(XSLT),W3C Recommendation.http://www.w3.org/TR/xslt.html.1999-11-16.
    [81]G.Salton,A.Wong,C..S.Yang.A Vector Space Model for Automatic Indexing[J].Communication of the ACM,1975.18(11):613-620.
    [82]Salton G.Automatic Text Processing:The Transformation analysis and retrieval of information by computer.Addison-Wesley Series In Computer Science,1989.530.
    [86]Riloff.E.Automatically Constructing a Dictionary for Information Extraction Task.In Proceedings of the Eleventh Annual Conference on Artificial Intelligence,1993.811-816.
    [87]Ellen Riloff,Wendy Lehnert.Information Extration as a Basis for High-Precision Text Classification[J].ACM Transactions on Information Systems.1994(3):293-333.
    [88]Yiming Yang.Pedersen J.P.A Comparative Study on Feature Selection in Text Categorization.In Proceedings of the Fourteenth International Conference on Machine Learning(ICM'97),1997.412-420.
    [89]David D.Lewis.Feature Categorization.Speech Selection and Feature Extraction for Text and Natural Language.In Proceedings of a workshop held a Harriman,1992.212-217.
    [91]Tom Mitchell.Machine Learning[M].McCraw Hill,1997.
    [92]Kjersti Aas,Line Eikvil.Text Categorization:A survey,Technical report,Norwegian Computing Center.http://citeseer.nj.nec.com/aas99text.html.1999-06
    [93]Kenneth Ward Church,PatricK Hanks.Word association norms,mutual information and lexicography.In Proceedings of ACL27,1989.76-83.
    [94]Oh-Woog Kwon,Jong-Hyeok Lee.Web page Classification Based on k-Nearest Neighbor Approach.In Proceedings of the 5~(th)international workshop on Information retrieval with Asian languages table of contents,2000.9-15.
    [100]Yubin Dai,Teck Ee Loh,Christopher Khoo.A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information.In Proceedings of the 22"d Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999.82-89.
    [103]T.Joachims.Text categorization with support vector machines:learning with many relevant features.In Proceedings of ECML-98,10th European Conference on Machine Learning,1998.137-142.
    [104]Yiming Yang,S.Slattery,R.Ghani.A study of approaches to hypertext categorization[J].Journal of Intelligent Information System,2002.18(2-3):219-241.
    [105]Jyh-Jong Tsay,Jing-Doo Wang.Design and Evaluation of Approaches to Automatic Chinese Text Categorization[J].Computational Linguistics and Chinese Language Processing,2000.5(2):43-58.
    [106]Salton,G.Developments in automatic text retrieval.Science,1991.253(5023):974-979.
    [107]Miguel E.Ruiz,padmini Srinivasan.Hierarchical Neural Networks for Text Categorization.In Proceedings of SIGIR299,22nd ACM International Conference on Research and Development in Information Retrieval,1999.281-282.
    [108]Yang Y,Liu X.Are-examination of text categorization methods.Proceedings 22nd Annua International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99),1999.42-49.
    [109]Furnkranz J.Exploiting structural information for text classification on the WWW[M].Springer Berlin / Heidelberg,1999.
    [110]Oh.H,M yaeng.S,HoLee.M.A practical hypertext categorization method using links and incrementally available class information.In:Belkin NJ,Ingwersen P,Leong MK,eds.Proc.of the 23rd ACM Int'l Conf.on Research and Development in Information Retrieval(SIGIR-00),2000.264-271.
    [111]Wai Lam,Chao Yang Ho.Using a generalized instance set for automatic text categorization.In Proceedings of the 21th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'98),1998:81-89.
    [112]C.K.P.Wong,R.W.P.Luk,K.F.Wong et all.Text categorization using hybrid(ninded)terms.5th International Workshop on Information Retrieval with Asian Languages,2000.217-218.
    [114]G.-H.Cha,X.Zhu,D.Petkovic,C.-W.Chung.An efficient indexing method for nearest neighbor searches in high-dimensional image databases.IEEE Transactions on Multimedia,2002,4(1):76-87.
    [115]Hanan Samet.Depth-First K-Nearest Neighbor Finding Using the MaxNearestDist Estimator.In Proceedings of the 12th International Conference on Image Analysis and Proceeding,2003.486-491.
    [116]L.s.Larkey,W.B.Croft.Combining classifiers in text categorization.In Proceedings of SIGIR-96,19th ACM International Conference on Research and Development in Information Retrieval,1996.289-297.
    [117]D.D.Lewis.An evaluation of phrasal and clustered representations text categorization task.In Proceedings of SIGIR 92,15th ACM on a International Conference on Research and Development in Information Retrieval,1992.37-50.
    [118]Lewis,D.D.,Gale,W.A.A sequential algorithm for training text classifiers.In Proceedings of SIGIR-94,17th ACM International Conference on Research and Development in Inforation Retrieval,1994.3-12.
    [119]Y.H.Li,A.K.Jain.Classification of text documents[J].The Computer Journal,1998.41(8):537-546.
    [120]M.E.Ruiz,P.Srinivasan.Combining Machine Learning and Hierarchical Indexing Structures for Text Categorization[D].Ames:Graduate College of University of Iowa,2001.
    [122]V.Vapnik and A.Y.Chetvonenkis.On the Uniform Convergence of Relative Frequencies of Events to their Probabilities[J].Theory of Probab.And its Application,1971.16(2):263-280.
    [124]Yu Jiangsheng.Method of k-Nearest Neighbors.http://www.nlp.org.cn/docs/20020903/36/kNN.pdf.2002-09-03.
    [125]Mineichi Kudo,Hideyuki Imai,Akira Tanaka et al..A Nearest Neighbor Method Using Bisectors[M].Springer Berlin / Heidelberg,2004.
    [126]J.Nievergelt,H.Hinterberger,K.Sevcik.The gddfile:An Adaptable Symmetric Multikey File Stucture.ACM Trans.on Database Systems,1984.9(1):38-71.
    [127]J.L.Bentley.Multidimensional Binary Search Trees in Database Applications.on Software Engineering,1979.5(4):333-340.
    [128]N.Beckmann,H.Kriegel,R.Schneider et al.R~*-tree:An Efficient and Robust Access Method for Points and Rectangles.ACM SIGMOD,1990.322-231.
    [129]S.Berchtold,D.Keim,H.p.Kriegel.The X-tree:An Index Structures for High-Dimensional Data.22th VLDB,1996.28-39.
    [130]White,D.A.;Jzin,R.;Similarity Indexing with the SS-tree.In Proceedings of the Twelfth International Conference on Data Engineering,1996.516-523.
    [131]Jin,H.,Ooi,B.B.,Shen,H.T.,Ao Ying Zhou.An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing.In Proceedings of the 19th International Conference on Data Engineering,2003.87-98.
    [132]M.Flickner,H.Sawhney,W.Niblack et al.Query by Image and Video Content:The QBIC System[J].Computer,1995.28(9).23-32.
    [133]P.Wu,B.S.Manjunath,S.Chandrasekaran.An Adaptive Index Structure for High-Dimensional Similarity Search.PCM 2001,LNCS 2195,2001.71-78.
    [134]David D.Lewi.Marc Ringuette.A Comparison of Two Learning Algorithms for Text Categorization.In Proceedings of SDAIR,Third Annual Symposium on Document Analysis and Information Retrieval,1994.81-93.
    [136] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000.39(2-3): 103-134.
    [137] A. McCallum, K. Nigam. Text classification by bootstrapping with keywords. In ACL Workshop for Unsupervised Learning in Natural Language Processing, 1999.
    [138] J. H. H. Yu, C. Zhai. Text classification from positive and unlabeled documents. In Proceedings of the 12th Annual International ACM Conference on Information and Knowledge Management, 2003. 232-239.
    [139] C.-C. Huang, S.-L. Chuang, L.-F. Chien. Liveclassifier: Creating hierarchical text classifiers through web corpora. In Proceedings of the 10th International World Wide Web Conference, 2004.184-192.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700