Modifying Jaccard Coefficient for Texts Similarity
Resumen
Calculating similarities between texts written in any language remains one of the extremely important challenges encounter natural language processing. This paper presents the modified Jaccard similarity coefficient for the texts; the main aim from this modification is to count the number of similar sen- tences between texts instead of counting the number of similar words between them as in previous works. This modification is applied by produced an equa- tion which combining the Jaccard coefficient and the similarity coefficient, furthermore, two criteria are employed in the proposed equation; where the first one is multiplied by the Jaccard coefficient and the second criterion is multiplied by the similarity coefficient. The objective of these criteria is to keep the similarity degree between 0 and 1. The experimental results are logi- cal, in which the similarity degree of the proposed equation increased approx- imately 3% on Jaccard coefficient degree when chosen texts from the same class, while it became less than the Jaccard coefficient degree when chosen texts from the various classes.Citas
Khuat, T. T., Nguyen, D. H. and Le, T. M. H. (2015). A Comparison of Algorithms used to measure the Similarity between two documents. Inter- national Journal of Advanced Research in Computer Engineering & Tech- nology (IJARCET), Vol. 4, No. 4, pp. 1117-1121, Australia.
Lisna, Z. (2016). Comparison Jaccard similarity, Cosine Similarity and Combined Both of the Data Clustering With Shared Nearest Neighbor Method. Computer Engineering and Applications. Vol. 5, No. 1, pp. 11- 18, Indonesia.
Neha, A., Mukesh, R. and Vijay, M. (2014). Comparative Analysis of Jaccard Coefficient and Cosine Similarity for Web Document Similarity Measure. International Journal for Advance Research in Engineering and Technology, Volume 2, Issue X, India.
Nitesh, P., Manasi, G. and Rajesh, W. (2015). A Review on Text Similarity Technique used in IR and its Application. International Journal of Comput- er Applications, vol. 120, No. 9, pp. 29-34, India.
Praveenkumar, V. and Harinarayana, N.S. (2018). Social Semantics and Similarities from User-generated Keywords to Information Retrieval: A Case Study of Social Tags in Marine Science. Journal of Library & Infor- mation Technology, Vol. 38, No. 1, pp. 11-15,India.
Sheetal, A. T. and Sushma, S. N. (2010). Measuring Semantic Similarity between Words Using Web Documents. International Journal of Advanced Computer Science and Applications, (IJACSA), Vol. 1, No.4,pp. 78-85, India.
Su, G. C. and Seoung, B. K. (2017). A Data-Driven Text Similarity Meas- ure Based on Classification Algorithms. Journal of Industrial Engineering International, vol. 24, No. 3, pp. 328-339, Korea.
Suphakit, N., Jatsada, S., Ekkachai, N. and Supachanun, W. (2013). Using of Jaccard Coefficient for Keyword Similarity. Proceedings of the Inter- national MultiConference of Engineers and Computer Scientists, Vol. I, Hong Kong.
Vijaymeena, M.K. and Kavitha, K. (2016). A Survey on Similarity Meas- ures in Text Mining. Machine Learning and Applications: An International Journal (MLAIJ), Vol. 3, No. 1, pp. 19-28,India.
Vikas, T. and Vivek, J. (2013). Comparison of Jaccard, Dice, Cosine Sim- ilarity Coefficient To Find Best Fitness Value for Web Retrieved Docu- ments Using Genetic Algorithm. International Journal of Innovations in Engineering and Technology (IJIET), Vol. 2 Issue 4, pp. 202-205,India.
Wael, H. G. and Aly, A. F. (2012). Short Answer Grading Using String Similarity and Corpus-Based Similarity. International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 3, No. 11, pp. 115-121,Eygpt.