Robust semantic text similarity using LSA, machine learning, and linguistic resources

详细信息查看全文

作者：Abhay Kashyap ; Lushan Han ; Roberto Yus…
关键词：Latent semantic analysis ; WordNet ; Term alignment ; Semantic similarity
刊名：Language Resources and Evaluation
出版年：2016
出版时间：March 2016
年：2016
卷：50
期：1
页码：125-161
全文大小：1,351 KB
参考文献：ACLwiki. (2015). WordSimilarity-353 test collection. http://bit.ly/ACLws
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. (2014). SemEval-2014 task 10: Multilingual semantic textual similarity. In 8th international workshop on semantic evaluation (SemEval 2014) (pp. 81–91).
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). *SEM 2013 shared task: Semantic textual similarity, including a pilot on typed-similarity. In Second joint conference on lexical and computational semantics (*SEM 2013).
Agirre, E., Diab, M., Cer, D., & Gonzalez-Agirre, A. (2012). SemEval-2012 task 6: A pilot on semantic textual similarity. In First joint conference on lexical and computational semantics (*SEM 2012) (pp. 385–393).
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.CrossRef
Bird, S. (2006). NLTK: The natural language toolkit. In ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (COLING-ACL 2006) (pp. 69–72).
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., et al. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3), 890–907.CrossRef
Burgess, C., Livesay, K., & Lund, K. (1998). Explorations in context space: Words, sentences, discourse. Discourse Processes, 25, 211–257.CrossRef
Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27.CrossRef
Coelho, T., Pereira Calado, P., Vieira Souza, L., Ribeiro-Neto, B., & Muntz, R. (2004). Image retrieval using multiple evidence ranking. IEEE Transactions on Knowledge and Data Engineering, 16(4), 408–417.CrossRef
Collins, M. J. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.
Davidson, S. (2013). Wordnik. The Charleston Advisor, 15(2), 54–58.CrossRef
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRef
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In 20th international conference on computational linguistics (COLING 2004).
Finkel, J. R., Grenager, T., & Manning, C. D. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In 43rd annual meeting of the ACL (ACL 2005) (pp. 363–370).
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), 116–131.CrossRef
Google word frequency counts. (2008). http://bit.ly/10JdTRz
Gonzalez-Agirre, A., Laparra, E., & Rigau, G. (2012). Multilingual central repository version 3.0. In 8th international conference on language resources and evaluation (LREC 2012) (pp. 2525–2529).
Han, L. (2014). Schema free querying of semantic data. Ph.D. thesis. Baltimore County: University of Maryland.
Han, L., & Finin, T. (2013). UMBC webbase corpus. http://ebiq.org/r/351
Han, L., Finin, T., & Joshi, A. (2012). Schema-free structured querying of DBpedia data. In 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) (pp. 2090–2093).
Han, L., Finin, T., Joshi, A., & Cheng, D. (2015). Querying RDF data with text annotated graphs. In 27th international conference on scientific and statistical database management (SSDBM 2015).
Han, L., Finin, T., McNamee, P., Joshi, A., & Yesha, Y. (2013). Improving word similarity by augmenting PMI with estimates of word polysemy. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1307–1322.CrossRef
Han, L., Kashyap, A. L., Finin, T., Mayfield, J., & Weese, J. (2013). \(\text{ UMBC }\_\text{ EBIQUITY }\) -CORE: Semantic textual similarity systems. In Second joint conference on lexical and computational semantics (*SEM 2013).
Harris, Z. (1968). Mathematical structures of language. New York: Wiley.
Hart, M. (1997). Project Gutenberg. http://www.gutenberg.org/wiki/Main_Page
Hatcher, E., Gospodnetic, O., & McCandless, M. (2004). Lucene in action. Greenwich, CT: Manning.
Jurgens, D., Pilehvar, M. T., & Navigli, R. (2014). SemEval-2014 task 3: Cross-level semantic similarity. In 8th international workshop on semantic evaluation (SemEval 2014) (pp. 17–26).
Kashyap, A., Han, L., Yus, R., Sleeman, J., Satyapanich, T., Gandhi, S., & Finin, T. (2014). Meerkat mafia: Multilingual and cross-level semantic textual similarity systems. In 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 416–423).
Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. In Human language technology conference of the North American chapter of the ACL (HLT-NAACL 2006) (pp. 455–462).
Landauer, T., & Dumais, S. (1997). A solution to plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.CrossRef
Lapesa, G., & Evert, S. (2014). A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. Transactions of the Association for Computational Linguistics, 2, 531–545.
Li, Y., Bandar, Z., & McLean, D. (2003). An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering, 15(4), 871–882.CrossRef
Lin, D. (1998). Automatic retrieval and clustering of similar words. In 17th international conference on computational linguistics (ACL 1998) (pp. 768–774).
Lin, D. (1998). An information-theoretic definition of similarity. In 15th international conference on machine learning (ICML 1998) (pp. 296–304).
Meadow, C. T. (1992). Text information retrieval systems. San Diego: Academic press.
Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. In 29th European conference on IR research (ECIR 2007) (pp. 16–27).
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.CrossRef
Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In 21st national conference on Artificial Intelligence (AAAI 2006) (pp. 775–780).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS (pp. 3111–3119).
Miller, G. A. (1995). WordNet: A lexical database for english. Communications of the ACM, 38(11), 39–41.CrossRef
Mohammad, S., Dorr, B., & Hirst, G. (2008). Computing word-pair antonymy. In Conference on empirical methods in natural language processing and computational natural language learning (EMNLP 2008) (pp. 982–991).
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In 9th machine translation summit (pp. 315–322).
Ravin, Y., & Leacock, C. (2000). Polysemy: Theoretical and computational approaches: Theoretical and computational approaches. New York: Oxford University Press.
Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1—From yesterday’s news to tomorrow’s language resources. In 3rd International conference on language resources and evaluation (LREC 2002) (pp. 29–31).
Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In 15th international world wide web conference (WWW 2006) (pp. 377–386).
Saric, F., Glavas, G., Karan, M., Snajder, J., & Basic, B. D. (2012). TakeLab: Systems for measuring semantic text similarity. In First joint conference on lexical and computational semantics (*SEM 2012) (pp. 441–448).
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., & Demirbas, M. (2010). Short text classification in twitter to improve information filtering. In 33rd international acm sigir conference on research and development in information retrieval (SIGIR 2010) (pp. 841–842).
Stanford. (2001). Stanford WebBase project. http://bit.ly/WebBase
Toutanova, K., Klein, D., Manning, C., Morgan, W., Rafferty, A., & Galley, M. (2000). Stanford log-linear part-of-speech tagger. http://nlp.stanford.edu/software/tagger.shtml
UMBC. (2013). Graph of relations project. http://ebiq.org/j/95
UMBC. (2013). Semantic similarity demo. http://swoogle.umbc.edu/SimService/
Urban dictionary. (2014). http://urbandictionary.com/
Wu, Z., & Palmer, M. (1994). Verb semantic and lexical selection. In 32nd annual meeting of the Association for Computational Linguistics (ACL 1994) (pp. 133–138).
作者单位：Abhay Kashyap (1)
Lushan Han (1)
Roberto Yus (2)
Jennifer Sleeman (1)
Taneeya Satyapanich (1)
Sunil Gandhi (1)
Tim Finin (1)

1. University of Maryland, Baltimore County, MD, USA
2. University of Zaragoza, Zaragoza, Spain
刊物类别：Humanities, Social Sciences and Law
刊物主题：Linguistics
Computational Linguistics
Computer Science, general
Linguistics
Languages and Literature
出版者：Springer Netherlands
ISSN：1574-0218

文摘

Semantic textual similarity is a measure of the degree of semantic equivalence between two pieces of text. We describe the SemSim system and its performance in the *SEM 2013 and SemEval-2014 tasks on semantic textual similarity. At the core of our system lies a robust distributional word similarity component that combines latent semantic analysis and machine learning augmented with data from several linguistic resources. We used a simple term alignment algorithm to handle longer pieces of text. Additional wrappers and resources were used to handle task specific challenges that include processing Spanish text, comparing text sequences of different lengths, handling informal words and phrases, and matching words with sense definitions. In the *SEM 2013 task on Semantic Textual Similarity, our best performing system ranked first among the 89 submitted runs. In the SemEval-2014 task on Multilingual Semantic Textual Similarity, we ranked a close second in both the English and Spanish subtasks. In the SemEval-2014 task on Cross-Level Semantic Similarity, we ranked first in Sentence–Phrase, Phrase–Word, and Word–Sense subtasks and second in the Paragraph–Sentence subtask.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700