Internet has a huge capability of information promulgating, and it brings advantage to web users. At the same time, Internet becomes a carrier of bad information about rebellion, eroticism, and violence. The bad information, especially the sensitive information on national security, diffused in Internet becomes a serious social problem. How to distinguish the bad information rapidly and effectively in order to prevent them from diffusion, to ensure the safety of information in Internet, becomes a serious task in content security.
     Some correlative research concentrates on information filtering and auto-shield at gateway or client computer. But the active check to suspicious site is done by national security department mostly by means of inefficient handiwork. To solve it, many thoughts were established in this paper based on information gathering and contend analysis, and start off the research by surrounding how to gather and process the bad information. On the whole, this paper studied some correlative principles and technologies of the web system, nature language process, artificial intelligence and machine learning, etc. Firstly, this paper researched the Web structure and the way to calculate the hyperlinks’weight, advanced the crawler’s search strategy based on content evaluation. Secondly, it analysed the formalization feather of the bad information, and then researched the repeats-based term extraction algorithm aiming at the bad information character which is concealment and levity. Thirdly, this paper proposed a real-time text categorization method based on Bayesian Theory, and put forward the feedback of file character to improve the performance of classifier. And finally, it advanced a structure of a system to find the bad information in Internet.
     Nowadays, it is well known for the rapid development of the application of internet. This paper has active significance to improve the efficiency of correlative department, clean the web environment, and accelerate to construct harmonious society. It is useful for exploration of content security in Internet. Moreover, the fruit of this paper is valuable to the cooperating of network, nature language process, and artificial intelligence in information security.
