Feature analysis of diagrams with applications to retrieval and classification.

详细信息

作者：Shao ; Mingyan.
学历：Ph.D.
年：2010
导师：Futrelle, Robert P.,eadvisorAslam, Javed A.ecommittee memberBetke, Margritecommittee memberFell, Harriet J.ecommittee memberRonald, Williams J.ecommittee member
毕业院校：Northeastern University
Department：Computer Science
ISBN：9781109733297
CBH：3398929
Country：USA
语种：English
FileSize：3466860
Pages：101

文摘

Millions of diagrams are available online in scientific and technical documents. The knowledge contained in diagrams is a rich resource, but one that has been little exploited, in comparison to text. In this thesis we demonstrate that it is possible to efficiently build compact representations of diagram content. The representations can support a wide variety of diagram-based systems such as retrieval, classification, and the building of knowledge bases that integrate text and diagrams. We demonstrate the strengths of our approach through studies of diagram retrieval as well as supervised and unsupervised machine learning for classification. The techniques are applied to a 700 diagram subset of more than 10,000 diagrams harvested from articles from the Open Access publisher, BioMed Central. A substantial set of Java-based tools was developed exclusively for this research. This will allow others to build on and extend what we have done. The major accomplishment of this thesis has been in devising a representation of diagram content that can support high-quality retrieval and classification. Past work in content-based image retrieval CBIR) has focused on approaches that include region analysis and texture in raster images. At the other extreme, detailed constraint-based grammars have been developed that can discover the full set of components in a diagram - the entities and their relations. CBIR has proven to be too imprecise, plagued by the "semantic gap". Full parsing does not scale well, because it would demand that thousands of complex grammar components would have to be developed. Our results show that there is an intermediate approach, the grapheme, that is relatively simple to work with, scales well, and yet has the power to successfully retrieve and classify diagrams. Graphemes are simple sub-elements of diagrams such as data plots, bar charts, or tree structures. Graphemes include primitives such as horizontal and vertical lines, as well as composite structures such as tick marks, groups of bars rectangles) and data point markers such as circles, triangles, diamonds, and rectangles. In the studies reported here, only twenty-four different types of graphemes were required to produce excellent results for retrieval and classification for a diagram collection made up of five diagram types. The description of a each diagram is a feature set, consisting of twenty-four attribute-value pairs: <； grapheme, count ＞. The feature set can be described as a "bag of graphemes", similar to the "bag of words" description of documents that has been so successful in document retrieval. A feature vector forms a set of points in an n-dimensional feature space which is used as input to unsupervised and supervised machine learning analysis. Unsupervised learning was explored in a diagram information retrieval paradigm. A diagram is chosen as the query. The nearest diagrams, using a feature-based metric, are retrieved as a ranked set. As an example, when each diagram in a collection of bar charts is applied as a query on our full diagram collection, the mean precision of the results is 98.3% for the top ten diagrams returned. The accuracy of the results from supervised learning varies from 81% for point-based data plots to 93% for trees using the AdaBoostM1 algorithm. We have focused on articles in PDF format that contain vector drawing instructions in a discrete and precise form. The technical challenges in extracting vector elements from PDF files are substantial. A PDF file is a sequence of commands for drawing primitives and for altering the graphics state containing line widths, colors, fonts, and other parameters. The sequence of commands does not map in any simple way to the two-dimensional layout of the document it represents. A Java-based emulator was designed to track the command sequences and graphic state changes, with the final result being a collection of static, non-procedural java object instances. Thus, the pair of commands to position at a line start and then draw to the line end, become a single line object defined by its start and end points. The resulting objects are installed in a spatial index that leads to a substantial speed-up in identifying graphemes. It should be obvious that our approach can be extended to hundreds or even thousands of grapheme types. The results of this thesis demonstrate, beyond a doubt, that the grapheme-based approach is a powerful and potentially useful one for future information systems. Combined with text-based search, our techniques should lead to a unique new class of systems with greater versatility and breadth than any that exist today.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700