两元矩阵聚类算法研究

英文题名：Studies on Algorithms for Clustering of Binary Matrices
作者：刘培强
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：算法 ; 复杂性 ; 两元矩阵 ; 聚类 ; 近似算法
英文关键词：Algorithm ; Complexity ; Binary Matrix ; Clustering ; Approximation
英文关键词：Algorithm
学位年度：2013
导师：朱大铭
学科代码：081202
学位授予单位：山东大学
论文提交日期：2013-04-15

摘要

聚类是人们认识和理解世界的最基本方式之一,广泛应用于分子生物学、金融、市场、电子商务等众多领域的数据分析。
     聚类的目标是根据一个对象群体的属性数据,基于相似性函数,寻找多个对象分组,使得同一组内的对象相似度高,而不同组的对象相似度低。同组内具有共同特征的对象称为一个簇(cluster)。在计算对象相似性时,若将对象的全部属性用于比较,即是人们通常所指的聚类,或称为单向聚类；在一些应用场合中,我们期望使用部分属性而不是全部属性衡量对象的相似性,这种利用部分属性来比较对象特征的聚类计算则称为双向聚类。双向聚类结果中,同组内具有共同特征的对象称为一个双向簇(bicluster)。
     聚类数据通常表示为实数矩阵,行对应于对象,列对应于属性。对于单向聚类计算,恰当地重排序行或列后,簇在聚类矩阵中对应于条状子矩阵；而对于双向聚类计算,恰当地重排序行和列后,双向簇在聚类矩阵中对应于块状子矩阵。
     在众多应用领域中,聚类实数矩阵可简化为元素均为0/1的两元矩阵。例如,在文本挖掘技术中,矩阵的行表示文档,列表示关键词。若文档i中包含关键词j,则矩阵的元素(i,j)值为1,否则为O。若一个子矩阵的元素均为1,则这个矩阵中的文档含有相同关键词,这些文档可能属于同一领域。此外,有时为便于分析或计算,也会将实数矩阵转化为两元矩阵。
     由于两元矩阵的重要性和普遍性,两元矩阵聚类分析算法广泛应用于市场购物篮分析、文本挖掘技术、基因表达谱分析、电子商务数据分析、社区发现等领域。本文主要研究两元矩阵上的两个聚类问题：带缺失值的两元指纹向量聚类问题、两元矩阵的k-子矩阵划分问题。
     1.带缺失值的两元指纹向量聚类问题
     带缺失值的两元指纹向量聚类问题源自于计算生物学中的基因表达谱聚类分析。
     基因是控制生命体性状的基本遗传单位,它是基因组序列上的一个片段。了解基因的功能,对于我们研究生命奥秘至关重要。然而,在已测序的基因中,功能尚且未知的超过40%。如何快速、准确推断基因功能是目前分子生物学亟待解决的问题。
     序列相似性比对是发现基因功能的一个重要方法,但遗憾的是在基因的一个功能家族中,基因序列相似性是很弱的,此外,还有一些基因,其功能是相同的,但没有任何序列相似性,故不能完全依靠序列比对推断基因功能。另外一种用来推断基因功能的重要方法是DNA微阵列技术,此项技术是近年来发展起来的分子生物学革命性技术之一。DNA微阵列实验产生的数据称为基因表达谱。由于基因表达谱分析算法的重要性和极具挑战性,使其成为当前分子生物学中研究的热点问题。
     基因表达谱可表示为一个m×n实数矩阵A=[aij]m*n,行表示基因,列表示实验条件。矩阵A的元素aij表示基因i在条件j下的表达水平。基因表达谱中存在较多数据未能真实反应基因的表达水平,称之为缺失数据(missing values)。在基因表达谱中,缺失数据可能会多达10%甚至更高,至少有一个缺失数据的基因占到90%以上。如何处理这些缺失值是分析基因表达谱的一个重要问题。
     聚类是分析基因表达谱的重要方法。在大多数基因表达谱聚类算法中,或是没有考虑缺失值,或是简单地替之以随机数。Figueroa、Borneman等人将基因表达谱转换为0-1-N向量集,称为两元指纹向量集,提出了带缺失值的指纹向量聚类计算问题,称为带缺失值的两元聚类问题(the binary clustering with missing values problem,简记为BCMV).在聚类的同时,此算法考虑了如何通过计算确定缺失值的数值。记BCMV(p)为含有常量参数p的BCMV I问题,p表示指纹向量中缺失值个数均不超过po Figueroa等给出了BCMV(1)的多项式时间精确算法；证明了BCMV(3)是NP-难的；给出了求解BCMV问题的启发式算法—贪心图划分算法(GCP)及其散列表实现法。BCMV(2)的复杂性则一直没有确定。本文讨论了BCMV(2)问题的复杂性,以三元素严格覆盖问题作为归约问题,证明其为NP-难的。此外,本文基了Figueroa等给出的GCP算法,给出了一种改进实现方式：链表法。理论证明链表法的时间、空间复杂度均低于散列表法,求解速度快于散列表法。实践表明,求解相同实例时,链表法计算耗时平均为散列表法的20%;当计算缺失值位数超过6的实例时,链表法计算耗时几乎总不超过散列表法的11%。此外,本文提出了求解BCMV(?)司题的基于线性规划舍入法的算法(the BCMV algorithm based on LP-rounding,简记为BLP),其近似比为BLP算法的速度略快于GCP链表法。本文使用闵可夫斯基测度和雅可比相似度系数比较了BLP算法与GCP算法的聚类质量,比较结果表明,BLP算法略优于GCP算法。
     2.两元矩阵的k-子矩阵划分问题
     以矩阵A=[aij]m*n表示m个对象、且每个对象有n个属性的数据集,其中aij表示第i个对象的第j个属性值。双向聚类的最简单求解目标是寻找矩阵行(对象)的一个子集,及列(属性)的一个子集,使得此行子集中的对象在此列子集中的属性下是高度相似的。这样的对象子集及属性子集下的数据称为双向簇。如果双向聚类问题的实例是两元矩阵,则双向聚类的目标是在矩阵中寻找元素值全为“1”的子矩阵(双向簇)。
     在许多应用,双向聚类的目标是同时寻找多个子矩阵(双向簇)。实际上,Madeira和Oliveira等人已经讨论了这个问题,并总结出了8种双向簇解格式：行与列排他式、行排他式、列排他式、无重叠棋盘式、树结构无重叠式、任意重叠式等。本文所研究的两元矩阵的k-子矩阵划分问题属于行列排他式解格式的双向聚类问题。两元矩阵的子矩阵划分问题(the submatrix partition of binary ma-trices problem,简记为SPBM)是双向聚类的一种计算模型。SPBM要求在两元矩阵中寻找多个行、列排他,且元素均为1的子矩阵,且使得矩阵的每一行(列)出现且仅出现在一个子矩阵中。记κ-SPBM为含有常量参数k的SPBM问题,k表示要求寻找的子矩阵为常数k个。
     本文中,我们讨论了κ-SPBM的复杂性。证明中用到了单调三取一可满足问题(M03)和二分图的二分团k划分问题(the partition of a bipartite into k bicliques problem,简记为κ-PBB),其中k是一个正整数常量。MO3是由Schaefer证明的一个著名NP-完全问题；k≥3时,κ-PBB的复杂性尚且是未知的。本文首先证明了M03问题的一个变体形式是NP-完全的。然后,以此变体形式的M03问题为归约问题,证明了3-PBB是NP-完全的,进而证明了κ-PBB(κ≥3)亦是NP-完全的。最后,以κ-PBB (κ≥3)为归约问题,证明了κ-SPBM (κ≥3)是NP-完全的。
     此外,本文还给了一个求解κ-SPBM算法的指数精确算法。
     本文的进一步研究工作主要包括二个方面：
     1.真实世界的数据大多是有噪声的数据,如分子生物学领域的基因表达谱、物联网领域的传感器数据、金融领域的股票数据等。因此,设计可以处理噪声数据的方法是十分必要的。目前,对于聚类分析,用于处理噪声数据的方法大体可分三类：图方法、贝叶斯法、概率法。准二分团法(quasi-biclique)是图方法中的主要方法。初步猜想准二分团问题的两个变体形式：最多边准二分团、准二分团划分问题均是NP-难的,下步将尝试证明这类未解决问题的复杂性,并设计求解算法。
     2.尽管不存在比较聚类算法性能的“黄金标准”,但是如何针对不同数据,选择恰当的聚类算法?如何比较某些算法的性能?仍是非常重要的。因此,不同聚类算法的性能比较方法及如何根据数据特点、聚类要求选择恰当聚类方法也是下一步的重要研究内容。
Clustering is one of the most fundamental modes of perceiving and understand-ing the world, and due to its importance and prevalence, it has been widely applied in many domains such as computational biology, marketing, finance, e-commerce.
     Given a data set of n objects and m attributes, clustering aims at finding groups of object based on a measure function of similarity such that the objects in the same groups are highly similar to each other, while the objects from different groups have low similarity to each other. The objects in the same groups are called clusters. In some applications, it is natural to expected that each associated part of objects recog-nized as a cluster is induced by a certain subset of attributes. The main goal of biclus-tering is to find a subset of objects which exhibit high similarity across a subset of at-tributes. In this case, the combination of the subset of objects and the subset of attrib-utes is called a bicluster.
     Usually, a data set of clustering is arranged in a data matrix with rows corre-sponding to objects and columns to attributes. After an appropriate reordering of rows and columns, a cluster forms a strip submatrix and a bicluster forms a block subma-trix.
     In many applications, a data matrix can be simply expressed as a matrix with en-tries "1" or "0", which are also called binary matrices. For example, a date set of m documents and n words is arranged in a binary matrix A=[a(?)]m*n with rows corre-sponding to documents and columns to words in a text mining problem. If an entry (i, j) of A is "1", then word j is present in the document i. If "0", the word is not. If each entry of a submatrix is "1", then the documents in the submatrix contain the same words, and these documents have a good chance of belonging to the same domain. In addition, we can transform a real matrix into a binary matrix in clustering for conven-ient analysis.
     Due to its importance and prevalence, binary matrix clustering algorithms are applied to many applications, such as market-basket data analysis, text mining, com-munity detection, and numerous other applications. In this paper, we investigate two biclustering problems in binary matrices:the binary clustering with missing values problem and the κ-submatrix partition of binary matrices problem.
     1. The binary clustering with missing values problem (BCMV)
     The binary clustering with missing values problem comes from gene expression data analysis in computational biology.
     A gene is locatable region of genomic sequence, corresponding to a unit of in-heritance. The basic knowledge of functions of genes is a powerful tool to understand mechanisms of living. However, the genes in sequenced genomes, whose functions are still unknown, are greater than40%. How to rapidly and accurately discover func-tions of genes is a problem to be solved urgently in molecular biology.
     We can discover the function of a newly sequenced gene by sequence compari-son, i.e., similarity comparison between it and previously sequenced genes with known functions. Unfortunately, since for some genes in a functional family, perhaps the sequence similarity of them is so weak, and for some genes with the same function, perhaps they have no sequence similarity at all, we cannot reliably infer the function of the newly sequenced gene based on sequence similarity. Another approach used to derive gene functions is to analyze gene expression data from DNA microarray. Re-cently, DNA microarray is one of revolutionary technologies in molecular biology. Gene expression data analysis has generated considerable recent research interest, be-cause of its importance and the significant challenge it poses.
     Usually, gene expression data is arranged in a data matrix A=[a(?)]m*n with rows corresponding to genes and columns to conditions. Entry a,j of A is a real a number, and represents the expression level of gene/under conditionj.
     Due to a variety of reasons including hybridization failures, insufficient resolu-tion, image noise and corruption, portions of data are unavailable or unobserved. These data are referred to as missing values. Up to10%of gene expression data are missing values. Even up to90%of genes have one or more missing values in some data sets. Methods of handling missing values is a challenging problem for gene ex-pression data analysis.
     Clustering is a vital approach of gene expression data analysis. Most of the gene expression data biclustering algorithms do not concern missing values or simply re-place them with randomized numbers. Figueroa et al. transformed gene expression data into0-1-TV vectors, called fingerprint vectors, and proposed the binary clustering with missing values problem (BCMV) to identify clusters and resolve the missing values simultaneously. Let BCMV(p) denote the BCMV problem containing the con-stant parameter p, where each fingerprint vector has at most p missing values. Figueroa et al. showed that BCMV(1) can be solved in polynomial time and proved that BCMV(3) is NP-hard. Further, they presented a heuristic algorithm, called the greedy clique partition algorithm (GCP). and gave an implementation of GCP based on hash tables.The hardness of BCMV(2) is still an interesting open problem. In this paper, we discuss the complexity of BCMV(2), and show that BCMV(2) is NP-hard by reduction from the exact cover by3-sets problem (X3C). Moreover, we propose a novel implementation of GCP based on linked lists. The performance of the novel im-plementation is better than the original one. We use a linked list to store the sets of compatible vertices. The linked list can be obtained by scanning the fingerprint vector bits one by one such that the time complexity for obtaining the sets of compatible ver-tices is reduced from0(m· n·2P) to O(m·(n-p+1)·2P), and the the running time of finding a unique maximal clique or a maximum clique is improved from O(m· p·2P) to O(m·2P). The results on real datasets and synthetic datasets all display that the improved implementation takes49%or lower space complexity of the original one averagely for computing the same instance. It can use20%time of the original one for solving the same instance. Particularly, the novel implementation can almost always use not more than11%time of the original one to solve the instance with more than6missing values in each fingerprint. Furthermore, we present a linear programming based approximation algorithm for resolving the BCMV problem, called the BCMV algorithm based on LP-rounding (BLP). We show that the approximation guarantee of BLP is proved to be O(logn) and the expect number of the running times for obtaining an approximation solution is proved to be at most3. We compare the clustering quali-ty of BLP and GCP using the Minkowski measure and the Jaccard's coefficient, and the results on both real and simulated data show that BLP runs faster and performs better than GCP.
     2. The κ-submatrix partition of binary matrices problem (κ-SPBM)
     Let a data set of m objects and n attributes be given as a matrix A=[aij]m*n, where the value of a,(?) is the value of jth attribute in ith object. The simplest aim of bi-clustering is to find a subset of rows (objects) that exhibit similar behavior across a subset of columns (attributes), or vice versa. The combination of the subset of objects and the subset of attributes is called a bicluster. A bicluster forms a contiguous rectan-gle after an appropriate reordering of rows and columns, i.e., a bicluster is a submatrix of A. If an instance of the biclustering problem is a binary matrix, then biclustering aims at finding submatrices with entries "1", i.e., each element of biclusters is "1"
     In some applications, the main goal of biclustering is to simultaneously find many submatrices (biclusters) in a matrix. Madeira and Oliveira discussed this issue and summarized eight biclustering patterns:exclusive row and column biclusters, ex-clusive row biclusters, exclusive column biclusters, checkerboard structure, arbitrarily positioned overlapping biclusters, nonoverlapping biclusters with tree structure, nonoverlapping nonexclusive biclusters, and overlapping biclusters with hierarchical structure. The biclustering pattern of the κ-submatrix partition of binary matrices problem (κ-SPBM) is exclusive row and column biclusters.κ-SPBM is a variant of the biclustering problem. Its input contains an n x m binary matrix and a constant positive integer k. The main goal of κ-SPBM is to find exactly k submatrices with entries "1" such that these k submatrices are pairwise row and column exclusive and each row (column) of the matrix occurs in exactly one of these k submatrices.
     In this paper, we discuss the complexity of κ-SPBM. There are two problems used to prove the complexity of κ-SPBM:the monotone one-in-three3SAT problem (MO3) and the partition of a bipartite into k bicliques problem (κ-PBB), where k is a constant positive integer. MO3is a well known NP-complete problem proved by Schaefer, but κ-PBB proposed by Bein et al. is still an open problem for κ≥3, where k is a constant positive integer. We first show that a variant of MO3is NP-complete. Then we show that3-PBB is NP-complete by reduction from the variant of MO3. Moreover, we show that κ-PBB (k>3) is also NP-complete by reduction from3-PBB. Finally, we prove that κ-SPBM is NP-complete for κ≥3, where k is a constant posi-tive integer.
     In addition, we present an exact exponential algorithm for κ-SPBM.
     In the past60years, thousands of clustering algorithms have been successfully applied to different domains. Despite this, many novel issues to improve our under-standing of clustering are addressed in many applications, including computational biology, social network, and collaborative filtering. Future work should focus on the following directions.
     1. Usually, real-world data are noisy, e.g.. gene expression data, sensors data, and financial ratios. Therefore, it is important to handle noisy data in clustering. There are three main different approaches presented to handle noisy data:the graph based ap-proach, the Bayesian based approach, and the probability based approach. The qua- si-biclique based approach is a graph based approach. We conjecture that the maxi-mum edge quasi-biclique problem and the quasi-biclique partition of a bipartite are all NP-hard. We should try to prove the complexity of these two problems and design efficient algorithms.
     2. There is no single clustering algorithm simultaneously satisfying all clustering problem, and there is no "golden" criterion to evaluate performances of clustering al-gorithms. In spite of this, it is still important that how to choose a right clustering al-gorithms by guidance, and how to compare two different clustering algorithms by cri-teria. It is another topic of future research.

引文

[1]S Arora, B Barak. Computational complexity:a modern approach. Cambridge University Press Cambridge, UK,2009.
    [2]堵丁柱,葛可一,王洁.计算复杂性导论.高等教育出版社,2002.
    [3]M Garey, D Johnson. Computers and intractability. A guide to the theory of NP-completeness. WH Freeman and Company, San Francisco, Calif,1979.
    [4]朱大铭,马绍汉.算法设计分析.高等教育出版社,2009.
    [5]DZ Du, KI Ko, X Hu. Design and Analysis of Approximation Algorithms. Springer New York,2012.
    [1]P-N Tan. Introduction to data mining. Pearson Education India,2007.
    [2]AK Jain. Data clustering:50 years beyond K-means. Pattern Recognition Letters, 2010,31 (8):651-666.
    [3]Y Cheng, GM Church. Biclustering of Expression Data. In:Bourne PE, Gribskov M, Altman RB, eds. Proc. of ISMB'00:AAAI Press,2000.93-103.
    [4]SC Madeira, AL Oliveira. Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Trans Comput Biol Bioinformalics,2004, 1(1):24-45.
    [5]A Tanay. R Sharan, R Shamir.Biclustering Algorithms:A Survey. Handbook of Computational Molecular Biology. Aluru S. ed. London; Chapman and Hall/CRC Press.2006:2621-2617.
    [6]K Sim, V Gopalkrishnan, A Zimek, et al. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery,2013,26(2):332-397.
    [7]BGe Mirkin. Mathematical classification and clustering. Springer.1996.
    [8]J Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association,1972:123-129.
    [9]S Busygin, O Prokopyev, P Pardalos. Biclustering in data mining. Computers & Operations Research,2008.35(9):2964-2987.
    [10]H Wang. W Wang, J Yang, et al. Clustering by pattern similarity in large data sets.In:Proc. of the 2002 ACM SIGMOD international conference on Management of data:ACM.2002.394-405.
    [11]R Agrawal, J Gehrke. D Gunopulos, et al. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec.1998, 27(2):94-105.
    [12]F Sebastiani. Machine learning in automated text categorization. ACM computing surveys (CSUR),2002,34(1):1-47.
    [13]J Yang, W Wang, H Wang, et al.6-clusters:capturing subspace correlation in a large data set. In:Agrawal R, Dittrich K, Ngu AH, eds. Proc. of the 18th International Conference on Data Engineering:2002.517-528.
    [14]H-P Kriegel, P Kroger, A Zimek. Clustering high-dimensional data:A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(1):1-58.
    [15]A Figueroa, J Borneman, T Jiang. Clustering binary fingerprint vectors with missing values for DNA array data analysis. Journal of Computational biology, 2004, 11(5):887-901.
    [16]ZY Zhang, T Li, C Ding, et al. Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery,2010,20(1):28-52.
    [17]A Tanay, R Sharan, R Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002,18 Suppl 1:S136-144.
    [18]A Prelic, S Bleuler, P Zimmermann, et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 2006,22(9):1122-1129.
    [19]L Wang, Y Lin, X Liu. Approximation Algorithms for Biclustering Problems. SIAM Journal on Computing,2008,38(4):1504-1518.
    [20]R Gelbard, O Goldman, I Spiegler. Investigating diversity of clustering methods: An empirical comparison. Data& Knowledge Engineering,2007, 63(1):155-166.
    [21]刘培强,朱大铭,谢青松,等.两元指纹向量聚类问题的复杂性与改进启发式算法.软件学报,2008,19(3)：500-510.
    [22]AW-C Liew. N-F Law, H Yan. Missing value imputation for gene expression data:computational techniques to recover missing data from available information. Briefings in bioinformatics,2011,12(5):498-513.
    [1]B Alberts, A Johnson, J Lewis, et al. Molecular Biology of the Cell. New York; Garland Science Inc.2002.
    [2]R Shamir, R Sharan, D Tsur. Cluster graph modification problems. Discrete Applied Mathematics,2004.144(1-2):173-182.
    [3]SC Madeira, AL Oliveira. Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Trans Comput Biol Bioinformatics,2004, 1(1):24-45.
    [4]A Tanay, R Sharan, R Shamir.Biclustering Algorithms:A Survey. Handbook of Computational Molecular Biology.Aluru S, ed. London; Chapman and Hall/CRC Press.2006:2621-2617.
    [5]S Busygin, O Prokopyev, P Pardalos. Biclustering in data mining. Computers& Operations Research,2008,35(9):2964-2987.
    [6]CHQ Ding. Analysis of gene expression profiles:class discovery and leaf ordering.Jn:Proc. of the sixth annual international conference on Computational biology:ACM,2002.127-136.
    [7]E Hartuv, AO Schmitt, J Lange, et al. An algorithm for clustering cDNA fingerprints. Genomics,2000,66(3):249-256.
    [8]R Sharan, R Shamir. CLICK:a clustering algorithm with applications to gene expression analysis. In:Bourne P, ed. Proc. of the Eighth International Conference on Intelligent Systems for Molecular Biology. San Diego, California, USA:AAAI Press,2000.307-316.
    [9]EP Xing, RM Karp. CLIFF:clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics,2001,17(suppl l):S306-S315.
    [10]R Herwig, AJ Poustka, C Muller, et al. Large-scale clustering of cDNA-fingerprinting data. Genome research,1999,9(11):1093-1105.
    [11]R Drmanac, S Drmanac. cDNA screening by array hybridization. Methods in enzymology,1999,303:165-178.
    [12]S Drmanac, N Stavropoulos, I Labat, et al. Gene-representing cDNA clusters defined by hybridization of 57,419 clones from infant brain libraries with short oligonucleotide probes. Genomics,1996,37(1):29-40.
    [13]MB Eisen, PT Spellman, PO Brown, et al. Cluster analysis and display of genome-wide expression patterns.In:Proc. of the National Academy of Sciences, 1998,95(25):14863-14868.
    [14]P Sneath. R Sokal. Numerical taxonomy:The principles and practice of numerical classification. San Francisco:CA:Freeman,1973.
    [15]P Tamayo. D Slonim, J Mesirov, et al. Interpreting patterns of gene expression with self-organizing maps:methods and application to hematopoietic differentiation. In:Proc. of the National Academy of Sciences 1999,1999, 96(6):2907-2912.
    [16]P Toronen, M Kolehmainen, G Wong, et al. Analysis of gene expression data using self-organizing maps. FEBS letters,1999,451(2):142-146.
    [17]Y Barash. N Friedman. Context-specific Bayesian clustering for gene expression data. Journal of Computational Biology,2002,9(2):169-191.
    [18]A Ben-Dor, R Shamir, Z Yakhini. Clustering gene expression patterns. Journal of computational biology,1999,6(3-4):281-297.
    [19]GJ McLachlan, R Bean, D Peel. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics,2002,18(3):413-422.
    [20]W Pan, J Lin, CT Le. Model-based cluster analysis of microarray gene-expression data. Genome Biology,2002,3(2):1-0009.0008.
    [21]Q Sheng, Y Moreau, B De Moor. Biclustering microarray data by Gibbs sampling. Bioinformalics-Oxford,2003,19(2):196-205.
    [22]A Ben-Dor, B Chor, R Karp, et al. Discovering local structure in gene expression data:the order-preserving submatrix problem. In:Myers G, Hannenhalli S, Sankoff D, eds. Proc. of the sixth annual international conference on Computational biology. New York, NY, USA:ACM,2002.49-57.
    [23]R Shamir, R Sharan.Algorithmic approaches to clustering gene expression data. Current Topics in Computational Biology MIT Press.2002:269-300.
    [24]AK Jain. Data clustering:50 years beyond K-means. Pattern Recognition Letters, 2010,31(8):651-666.
    [25]K Sim, V Gopalkrishnan, A Zimek, et al. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery,2013,26(2):332-397.
    [26]J Tuikkala, L Elo, OS Nevalainen, et al. Improving missing value estimation in microarray data with gene ontology. Bioinformatics,2006,22(5):566-572.
    [27]AG De Brevern, S Hazout, A Malpertuy. Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC bioinformatics,2004,5(1):114.
    [28]A Figueroa, J Borneman, T Jiang. Clustering binary fingerprint vectors with missing values for DNA array data analysis. Journal of Computational biology. 2004, 11(5):887-901.
    [29]R Karp. Reducibility among combinatorial problems. Complexity of Computer Computations,(RE Miller and JM Thatcher, eds.),85-103. Plenum Press.1972.
    [30]M Garey, D Johnson. Computers and intractability. A guide to the theory of NP-completeness. WH Freeman and Company, San Francisco, Calif,1979.
    [1]A Figueroa, J Borneman, T Jiang. Clustering binary fingerprint vectors with missing values for DNA array data analysis. Journal of Computational biology, 2004, 11(5):887-901.
    [2]R Motwani, P Raghavan. Randomized algorithms. Cambridge university press, 1995.
    [3]L Valinsky, G Delia Vedova, AJ Scupham, et al. Analysis of bacterial community composition by oligonucleotide fingerprinting of rRNA genes. Applied and Environmental Microbiology,2002,68(7):3243-3250.
    [4]L Valinsky, G Delia Vedova, T Jiang, et al. Oligonucleotide fingerprinting of rRNA genes for analysis of fungal community composition. Applied and Environmental Microbiology,2002,68(12):5999-6004.
    [1]J Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association,1972:123-129.
    [2]Y Cheng, GM Church. Biclustering of Expression Data. In:Bourne PE, Gribskov M, Altman RB, eds. Proc. of ISMB'00: AAAI Press,2000.93-103.
    [3]SC Madeira, AL Oliveira. Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Trans Comput Biol Bioinformatics,2004, 1(1):24-45.
    [4]S Busygin, O Prokopyev, P Pardalos. Biclustering in data mining. Computers& Operations Research,2008,35(9):2964-2987.
    [5]F Sebastiani. Machine learning in automated text categorization. ACM computing surveys (CSUR),2002,34(1):1-47.
    [6]H Wang, W Wang, J Yang, et al. Clustering by pattern similarity in large data sets.In:Proc. of the 2002 ACM SIGMOD international conference on Management of data:ACM,2002.394-405.
    [7]J Yang, W Wang, H Wang, et al.8-clusters:capturing subspace correlation in a large data set. In:Agrawal R, Dittrich K, Ngu AH, eds. Proc. of the 18th International Conference on Data Engineering:2002.517-528.
    [8]IS Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning.In:Proc. of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining:ACM,2001.269-274.
    [9]Q Sheng, Y Moreau, B De Moor. Biclustering microarray data by Gibbs sampling. Bioinformatics-Oxford,2003,19(2):196-205.
    [10]A Ben-Dor, B Chor, R Karp, et al. Discovering local structure in gene expression data:the order-preserving submatrix problem. In:Myers G. Hannenhalli S. Sankoff D, eds. Proc. of the sixth annual international conference on Computational biology. New York, NY, USA:ACM,2002.49-57.
    [11]A Tanay, R Sharan. R Shamir.Biclustering Algorithms:A Survey. Handbook of Computational Molecular Biology.Aluru S, ed. London; Chapman and Hall/CRC Press.2006:2621-2617.
    [12]A Prelic, S Bleuler, P Zimmermann, et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 2006,22(9):1122-1129.
    [13]A Tanay, R Sharan, R Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002,18 Suppl 1:S136-144.
    [14]L Wang, Y Lin, X Liu. Approximation Algorithms for Biclustering Problems. SIAM Journal on Computing,2008,38(4):1504-1518.
    [15]ZY Zhang, T Li, C Ding, et al. Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery,2010,20(1):28-52.
    [16]A Figueroa, J Borneman, T Jiang. Clustering binary fingerprint vectors with missing values for DNA array data analysis. Journal of Computational biology, 2004, 11(5):887-901.
    [17]R Gelbard, O Goldman, I Spiegler. Investigating diversity of clustering methods: An empirical comparison. Data & Knowledge Engineering,2007,63(1): 155-166.
    [18]刘培强,朱大铭,谢青松,等.两元指纹向量聚类问题的复杂性与改进启发式算法.软件学报,2008,1 9(3)：500-510.
    [19]M Dawande, P Keskinocak, JM Swaminathan, et al. On Bipartite and Multipartite Clique Problems. Journal of Algorithms,2001,41(2):388-403.
    [20]R Peeters. The maximum edge biclique problem is NP-complete. Discrete Applied Mathematics,2003,131(3):651-654.
    [21]N Amit. The Bicluster Graph Editing problem.Master, Tel Aviv:Tel Aviv University,2004.
    [22]M Dawande, P Keskinocak, S Tayur. On the biclique problem in bipartite graphs.GSIA Working Paper, Pittsburgh, PA 15213, USA:Carnegie Mellon University,1996.
    [23]MH Heydari, L Morales, J C. O. Shields, et al. Computing Cross Associations for Attack Graphs and Other Applications. In:Ralph H. Sprague J, ed. Proc. of the 40th Annual Hawaii International Conference on System Sciences. Washington, DC, USA:IEEE Computer Society.2007.270b.
    [24]D Bein, L Morales, W Bein, et al. Clustering and the Biclique Partition Problem. In:Ralph H. Sprague J, ed. Proc. of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008). Big Island. Hawaii:IEEE Computer Society Press,2008.475.
    [25]TJ Schaefer. The complexity of satisfiability problems. In:Lipton RJ, Burkhard W, Savitch W, eds. Proc. of the tenth annual ACM symposium on Theory of computing. New York, NY, USA:ACM,1978.216-226.
    [26]M Garey, D Johnson. Computers and intractability. A guide to the theory of NP-completeness. WH Freeman and Company, San Francisco, Calif,1979.
    [1]AW-C Liew, N-F Law, H Yan. Missing value imputation for gene expression data:computational techniques to recover missing data from available information. Briefings in bioinformatics,2011,12(5):498-513.
    [2]K Sim, G Liu, V Gopalkrishnan, et al. A case study on financial ratios via cross-graph quasi-bicliques. Information Sciences,2011,181(1):201-216.
    [3]K Sim, V Gopalkrishnan, A Zimek, et al. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery,2013,26(2):332-397.
    [4]W-C Chang, S Vakati, R Krause, et al. Exploring biological interaction networks with tailored weighted quasi-bicliques. BMC Bioinformatics,2012,13(Suppl 10):S16.
    [5]X Liu, J Li, L Wang. Modeling Protein Interacting Groups by Quasi-Bicliques: Complexity, Algorithm, and Application. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2010,7(2):354-364.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700