CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework
详细信息    查看全文
  • 作者:Zhuo Tang ; Lingang Jiang ; Li Yang ; Kenli Li ; Keqin Li
  • 关键词:Biomedical big data ; Conditional random fields ; MapReduce ; Named entity recognition ; Parallel algorithm
  • 刊名:Cluster Computing
  • 出版年:2015
  • 出版时间:June 2015
  • 年:2015
  • 卷:18
  • 期:2
  • 页码:493-505
  • 全文大小:1,157 KB
  • 参考文献:1.Wikipedia, Text mining [EB/OL]. http://鈥媏n.鈥媤ikipedia.鈥媜rg/鈥媤iki/鈥婽ext_鈥媘ining . 24 Oct 2013
    2.Wikipedia, Named-entity recognition [EB/OL]. http://鈥媏n.鈥媤ikipedia.鈥媜rg/鈥媤iki/鈥婲amed_鈥媏ntity_鈥媟ecognition . 22 Aug 2013
    3.Wikipedia, MEDLINE [EB/OL]. http://鈥媏n.鈥媤ikipedia.鈥媜rg/鈥媤iki/鈥婱EDLINE . 14 Sep 2013
    4.Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, San Francisco (2010). doi:10.鈥?200/鈥婼00274ED1V01Y201鈥?06HLT007
    5.Shen, L., Shen, H., Cheng, L.: New algorithms for efficient mining of association rules. In: The Seventh Symposium on the Frontiers of Massively Parallel Computation, pp. 234鈥?41 (1999)
    6.Li, L., Zhou, R., Huang, D.: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33(4), 334鈥?38 (2009)View Article
    7.Finkel, J., Dingare, S., Nguyen, H.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 88鈥?1 (2004)
    8.Wang, H., Zhao, T., Li, S., Yu, H.: A conditional random fields approach to biomedical named entity recognition. J. Electron. 6(24), 838鈥?44 (2007)
    9.Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 104鈥?07 (2004)
    10.Li, L., Fan, W., Huang, D.: A two-phase bio-NER system based on integrated classifiers and multi-agent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. (2013). doi:10.鈥?109/鈥婽CBB.鈥?013.鈥?06
    11.Yang, L., Zhou, Y.: Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs. Knowl. Inf. Syst. (2013). doi:10.鈥?007/鈥媠10115-013-0637-7
    12.Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine (BioMed), pp. 33鈥?0 (2003)
    13.Kim, S., Yoon, J., Park, K.-M., Rim, H.-C.: Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd International Joint Conference (IJCNLP), pp. 646鈥?57 (2005)
    14.Kim, S., Yoon, J.: Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans. Inf. Syst. 7(E90鈥揇), 1103鈥?110 (2007)View Article
    15.Li, Lishuang, Zhou, Rongpeng, Huang, Degen: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33, 334鈥?38 (2009)View Article
    16.Wang, L., Ke, L., Liu, P., Ranjan, R., Chen, L.: IK-SVD: dictionary learning for spatial big data via incremental atom update. Comput. Sci. Eng. 16(4), 41鈥?2 (2014)View Article
    17.Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137鈥?46 (2010)View Article MATH
    18.Wittek, P., Dar谩nyi, S.: Accelerating text mining workloads in a MapReduce-based distributed GPU environment. J. Parallel Distrib. Comput. 2(73), 98鈥?06 (2013)
    19.Wang, L., Tao, J., Marten, H., Streit, A., Khan, S.U., Kolodziej, J., Chen, D.: MapReduce across distributed clusters for data-intensive applications. In: The 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Workshops 2012: 2004鈥?011
    20.Laclavik, M., Seleng, M., Hluchy, L.: Towards large scale semantic annotation built on MapReduce architecture. Lecture Notes in Computer Science 3(5103), 331鈥?38 (2008)
    21.Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., Chen, D.: G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739鈥?50 (2013)View Article
    22.Whitney, M., Clifton, A., Sarkar, A., Fedorova, A.: Making the most of a distributed perceptron for NLP. In: Pacific Northwest Regional NLP Workshop, Redmond, Washington, USA (2012)
    23.Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: 27th Proceedings of the International Conference on Machine Learning (ICML), pp. 282鈥?89 (2010)
    24.Atkinson, J., Bull, V.: A multi-strategy approach to biological named entity recognition. Expert Syst. Appl. 39(17), 12968鈥?2974 (2012)View Article
    25.Forney, G.D. Jr.: The viterbi algorithm. In: Proceedings of the IEEE, vol. 3(61), pp. 268鈥?78. Codex Corporation. Newton, MA (2005)
    26.Vijay Sundar Ram, R., Akilandeswari, A., Lalitha Devi, S.: Linguistic features for named entity recognition using CRFs. In: International Conference on Asian Language Processing (IALP), pp. 158鈥?61 (2010)
    27.Langford, J.: Parallel machine learning on big data, XRDS: crossroads. ACM Mag. Stud. 1(19), 60鈥?2 (2012)
    28.Meraji, S., Tropper, C.: A machine learning approach for optimizing parallel logic simulation. In: 39th International Conference on Parallel Processing (ICPP), pp. 545鈥?54 (2010)
    29.Livieris, I.E., Apostolopoulou, M.S., Sotiropoulos, D.G., Sioutas, S., Pintelas, P.: Classification of large biomedical data using ANNs based on BFGS method. In: 13th Panhellenic Conference on Informatics (PCI), pp. 87鈥?1 (2009)
    30.Munkhdalai, T., Li, M., Kim, T., Namsrai, O.-E., Jeong, S.-p., Shin, J., Ryu, K.H.: Bio named entity recognition based on co-training algorithm. In: 26th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 857鈥?62 (2012)
    31.Zhang, J., Shen, D., Zhou, G., Tan, C.-L.: Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J. Biomed. Inform. 6(37), 411鈥?22 (2004)View Article
    32.Mathur, A., Chakrabarti, S.: Accelerating newton optimization for log-linear models through feature redundancy. In: 6th International Conference on Data Mining, pp. 404鈥?13 (2006)
    33.Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773鈥?82 (1980)View Article MATH MathSciNet
    34.Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. J. Math. Program. B 3(45), 503鈥?28 (1989)View Article MathSciNet
    35.Wang, L., Chen, D., Ranjan, R., Khan, S.U., Kolodziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. In: The 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 164鈥?71 (2012)
    36.Guodong, Z., Jian, S.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 96鈥?9 (2004)
    37.Okanohara, D., Miyao, Y., Tsuruoka, Y., Tsujii, J.: Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 465鈥?72 (2006)
    38.Zhao, Jiaqi, Wang, Lizhe, Tao, Jie, Chen, Jinjun, Sun, Weiye, Ranjan, Rajiv, Kolodziej, Joanna, Streit, Achim, Georgakopoulos, Dimitrios: A security framework in G-Hadoop for big data computing across distributed cloud data centres. J. Comput. Syst. Sci. 80(5), 994鈥?007 (2014)View Article MATH MathSciNet
    39.Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1鈥? (2010)
  • 作者单位:Zhuo Tang (1)
    Lingang Jiang (1)
    Li Yang (2)
    Kenli Li (1)
    Keqin Li (3)

    1. College of Information Science and Engineering, Hunan University, Changsha, 410082, China
    2. College of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, 410004, China
    3. Department of Computer Science, State University of New York, New Paltz, NY, 12561, USA
  • 刊物类别:Computer Science
  • 刊物主题:Processor Architectures
    Operating Systems
    Computer Communication Networks
  • 出版者:Springer Netherlands
  • ISSN:1573-7543
文摘
As the rapid growth of the biomedical literature, the model training time in biomedical named entity recognition increases sharply when dealing with large-scale training samples. How to increase the efficiency of named entity recognition in biomedical big data becomes one of the key problems in biomedical text mining. For the purposes of improving the recognition performance and reducing the training time, this paper proposes an optimization method for two-phase recognition using conditional random fields. In the first stage, each named entity boundary is detected to distinguish all real entities. In the second stage, we label the semantic class of the entity detected. To expedite the training speed, in these two phases, we implement the model training process on a parallel optimization program framework based on MapReduce. Through dividing the training set into several parts, the iterations in the training algorithm are designed as map tasks which can be executed simultaneously in a cluster, where each map function is designed to complete the calculation of a gradient vector component for each part in the training set. Our experiments show that the proposed method in this paper can achieve high performance with short training time, which has important implications for the current biological big data processing.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700