用户名: 密码: 验证码:
Fast and scalable support vector clustering for large-scale data analysis
详细信息    查看全文
  • 作者:Yuan Ping (1) (2)
    Yun Feng Chang (3)
    Yajian Zhou (2)
    Ying Jie Tian (4)
    Yi Xian Yang (2) (5)
    Zhili Zhang (1)

    1. School of Information Engineering
    ; Xuchang University ; Xuchang ; 461000 ; China
    2. Information Security Center
    ; Beijing University of Posts and Telecommunications ; Beijing ; 聽100876 ; China
    3. College of Science
    ; China Three Gorges University YiChang ; Hubei聽 ; 443002 ; China
    4. Graduate University of Chinese Academy of Sciences
    ; Beijing ; 100190 ; China
    5. School of Information Engineering
    ; Beijing Institute of Graphic Communication ; Beijing ; 102600 ; China
  • 关键词:Large ; scale problem ; Support vector clustering ; Convex decomposition ; Cluster boundary ; Cluster labeling
  • 刊名:Knowledge and Information Systems
  • 出版年:2015
  • 出版时间:May 2015
  • 年:2015
  • 卷:43
  • 期:2
  • 页码:281-310
  • 全文大小:1,487 KB
  • 参考文献:1. Alzate, C, Suykens, JAK (2010) Multiway spectral clustering with out-of-sample extensions through weighted kernel pca. IEEE Trans Patt Anal Mach Intell 32: pp. 335-347 CrossRef
    2. Alzate, C, Suykens, JAK (2011) Sparse kernel spectral clustering models for large-scale data analysis. Neurocomputing 74: pp. 1382-1390 CrossRef
    3. Ban T, Abe S (2004) Spatially chunking support vector clustering algorithm. In: Proceedings of international joint conference on neural networks, pp 414鈥?18
    4. Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2000) A support vector cluster method. In: Proceedings of 15th international conference on pattern recognition, vol 2, Barcelona, Spain, pp 724鈥?27
    5. Ben-Hur, A, Horn, D, Siegelmann, HT, Vapnik, VN (2001) Support vector clustering. J Mach Learn Res 2: pp. 125-137
    6. Bergmann G, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. In: Multiple Hypotheses Testing, vol 70, pp. 100鈥?15
    7. Boyd, S, Vandenberghe, L (2009) Convex Optimization. cambridge university press, Cambridge
    8. Burges, C (1998) A tutorial on support vector machines for pattern recognition. Data Min Know Dis 2: pp. 121-167 CrossRef
    9. Camastra, F, Verri, A (2005) A novel kernel method for clustering. IEEE Trans Patt Anal Mach Intell 27: pp. 801-805 CrossRef
    10. Chiang, JH, Hao, PY (2003) A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing. IEEE Trans Fuzzy Syst 11: pp. 518-527 CrossRef
    11. Estivill-Castro V, Lee I (2000) Amoeba: Hierarchical clustering based on spatial proximity using delaunay diagram. In: Proceedings of the 9th international symposium on spatial data handling, pp 7a.26-7a.4
    12. Estivill-Castro V, Lee I, Murray AT (2001) Criteria on proximity graphs for boundary extraction and spatial clustering. In: Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining (PAKDD鈥?1), vol 2035, pp 348鈥?57
    13. Forghani, Y, Yazdi, HS, Effati, S (2012) An extension to fuzzy support vector data description (fsvdd*). Patt Anal Appl 15: pp. 237-247 CrossRef
    14. Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
    15. Garcia, S, Herrera, F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9: pp. 2677-2694
    16. Graven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge form the world wide web. In: Proceedings of 15th Nat鈥檒 conference for artificial intelligence (AAAI鈥?8), pp 509鈥?16. Madison, Wisconsin
    17. Guo, CH, Li, F (2011) An improved algorithm for support vector clustering based on maximum entropy principle and kernel matrix. Exp Syst Appl 38: pp. 8138-8143 CrossRef
    18. Hersh WR, Buckley C, Leone TJ, Hickam DH (1994) Ohsumed: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of 17th annual ACM SIGIR conference, pp 192鈥?01
    19. Hubert, PAL (1985) Comparing partitions. J Classif 2: pp. 193-218 CrossRef
    20. Hurley, J, Garcia-Palacios, E, Sezer, S (2011) Classifying network protocols: a 鈥榯wo-way鈥?flow approach. IET Commun 5: pp. 79-89 CrossRef
    21. Jung, KH, Kim, N, Lee, J (2011) Dynamic pattern denoising method using multi-basin system with kernels. Patt Recogn 44: pp. 1698-1707 CrossRef
    22. Jung, KH, Lee, D, Lee, J (2010) Fast support-based clustering method for large-scale problems. Patt Recogn 43: pp. 1975-1983 CrossRef
    23. Kim, HC, Lee, J (2007) Clustering based on gaussian processes. Neural Comput 19: pp. 3088-3107 CrossRef
    24. Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings 12th international conference machine learning (ICML鈥?5), pp 331鈥?39. Tahoe City.
    25. Lee, CH, Yang, HC (2009) Construction of supervised and unsupervised learning systems for multilingual text categorization. Exp Syst Appl 2鈥?: pp. 2400-2410 CrossRef
    26. Lee, D, Jung, KH, Lee, J (2009) Constructing sparse kernel machines using attractors. IEEE Trans Pat Anal Mach Intell 20: pp. 721-729
    27. Lee, D, Lee, J (2007) Equilibrium-based support vector machine for semisupervised classification. IEEE Trans Neural Netw 18: pp. 578-583 CrossRef
    28. Lee, J, Lee, D (2005) An improved cluster labeling method for support vector clustering. IEEE Trans Patt Anal Mach Intell 27: pp. 461-464 CrossRef
    29. Lee, J, Lee, D (2006) Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Trans Patt Anal Mach Intell 28: pp. 1869-1874 CrossRef
    30. Lee SH, Daniels KM (2005) Gaussian kernel width selection and fast cluster labeling for support vector clustering. Technical Report No. 2005鈥?09, USA
    31. Lee SH, Daniels KM (2005) Cone cluster labeling for support vector clustering. In: Proceedings of 6th SIAM conference on data mining, pp 484鈥?88
    32. Lewis DD (1997) Reuters-21578 text categorization collection. http://kdd.ics.uci.edu/databases/reuters21578/
    33. Li, YH (2011) Selecting training points for one-class support vector machines. Patt Recogn Lett 32: pp. 1517-1522 CrossRef
    34. Li, YH, Maguire, L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans Patt Anal Mach Intell 33: pp. 1189-1201 CrossRef
    35. Ling, P, Zhou, CG, Zhou, X (2010) Improved support vector clustering. Eng Appl Artif Intell 23: pp. 552-559 CrossRef
    36. Mehrotra S (1992) On the implementation of a primal-dual interior point method. SIAM J Optim 2: 575鈥?01
    37. Peng, JF, Zhou, YJ, Wang, C, Yang, YX, Ping, Y (2011) Early TCP traffic classification. J Appl Sci Electron Inform Eng 29: pp. 73-77
    38. Ping, Y, Tian, YJ, Zhou, YJ, Yang, YX (2012) Convex decomposition based cluster labeling method for support vector clustering. J Comput Sci Technol 27: pp. 428-442 CrossRef
    39. Ping Y, Zhou YJ, Xue C, Yang YX (2012) Efficient representation of text with multiple perspectives. J China Uni Posts Telecommun 19(1):101鈥?11
    40. Ping Y, Zhou YJ, Yang YX (2011) A novel scheme for accelerating support vector clustering (in press)
    41. Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT Press, Cambridge
    42. Puma-Villanueva WJ, Bezerra GB, Lima CAM, Zuben FJV (2005) Improving support vector clustering with ensembles. In: Proceedings of international joint conference on neural networks, Montreal, pp 13鈥?5
    43. Scholkopf, B, Platt, JC, Shawe-Taylor, J, Smola, AJ, Williamson, RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13: pp. 1443-1472 CrossRef
    44. Shamir O, Tishby N (2010) Stability and model selection in k-means clustering. Mach Learn 81(1): 213鈥?43
    45. Sheskin, D (2003) Handbook of parametric and nonparametric statistical procedures. Chapman & Hall, London CrossRef
    46. Su, MY (2011) Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification. J Netw Comput Appl 34: pp. 722-730 CrossRef
    47. Tax, DMJ, Duin, PRW (1999) Support vector domain description. Patt Recogn Lett 11鈥?3: pp. 1191-1199 CrossRef
    48. UNIBS: The UNIBS anonymized 2009 internet traces (Mar 18, 2010)
    49. Veenman, CJ, Reinders, MJT, Backer, E (2002) A maximum variance cluster algorithm. IEEE Trans Patt Anal Mach Intell 24: pp. 1273-1280 CrossRef
    50. Wang CD, Lai JH (2013) Position regularized support vector domain description. Patt Recogn 46(3): 875鈥?84
    51. Wang CD, Lai JH, Huang D, Zheng WS (2012) Svstream: A support vector based algorithm for clustering data streams. IEEE Trans Know Data Eng (in press), 1鈥?4 . doi: 10.1109/TKDE.2011.263
    52. Wang, DF, Shi, L, Yeung, DS, Tsang, ECC, Heng, PA (2007) Ellipsoidal support vector clustering for functional MRI analysis. Patt Recogn 40: pp. 2685-2695 CrossRef
    53. Wang JS, Chiang JC (2009) An efficient data preprocessing procedure for support vector clustering. J Univ Comput Sci 15(4):705鈥?21
    54. Wu M, Scholkopf B (2007) A local learning approach for clustering. In: Advances in neural information processing systems (NIPS 2007), vol 19, pp 1529鈥?536. Vancouver, Canada (2007)
    55. Xu, R, Wunsch, DC (2008) Clustering. Wiley, Hoboken CrossRef
    56. Yang JH, Estivill-Castro V, Chalup SK (2002) Support vector clustering through proximity graph modelling. In: Proceedings of 9th international conference on neural information processing (ICONIP鈥?2), pp 898鈥?03. Orchid Country Club, Singapore
  • 刊物类别:Computer Science
  • 刊物主题:Information Systems and Communication Service
    Business Information Systems
  • 出版者:Springer London
  • ISSN:0219-3116
文摘
As an important boundary-based clustering algorithm, support vector clustering (SVC) benefits multiple applications for its capability of handling arbitrary cluster shapes. However, its popularity is degraded by both its highly intensive pricey computation and poor label performance which are due to redundant kernel function matrix required by estimating a support function and ineffectively checking segmers between all pairs of data points, respectively. To address these two problems, a fast and scalable SVC (FSSVC) method is proposed in this paper to achieve significant improvement on efficiency while guarantees a comparable accuracy with the state-of-the-art methods. The heart of our approach includes (1) constructing the hypersphere and support function by cluster boundaries which prunes unnecessary computation and storage of kernel functions and (2) presenting an adaptive labeling strategy which decomposes clusters into convex hulls and then employs a convex-decomposition-based cluster labeling algorithm or cone cluster labeling algorithm on the basis of whether the radius of the hypersphere is greater than 1. Both theoretical analysis and experimental results (e.g., the first rank of a nonparametric statistical test) show the superiority of our method over the others, especially for large-scale data analysis under limited memory requirements.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700