Minería de texto en la Encuesta Nacional de Transparencia 2019

González-Évora, Felipe; Centeno-Mora, Óscar; González-Évora, Felipe; Centeno-Mora, Óscar

doi:10.15517/rmta.v29i2.46379

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Revista de Matemática Teoría y Aplicaciones

Print version ISSN 1409-2433

Rev. Mat vol.29 n.2 San José Jul./Dec. 2022

http://dx.doi.org/10.15517/rmta.v29i2.46379

Artículo

Minería de texto en la Encuesta Nacional de Transparencia 2019

Text minig in the National Transparency Survey 2019

Felipe González-Évora¹

Óscar Centeno-Mora²

^¹Universidad de Costa Rica, Escuela de Estadística, San José, Costa Rica; juan.gonzalezevora@ucr.ac.cr

^²Universidad de Costa Rica, Escuela de Tecnologías en Salud, San José, Costa Rica; oscar.centenomora@ucr.ac.cr

Resumen

Codificar y analizar preguntas abiertas provenientes de encuestas de opinión suele ser laborioso. La minería de texto ofrece una alternativa para ese tipo de problemática. Se utilizaron los datos de preguntas abiertas provenientes de la Encuesta Nacional de Percepción sobre la Transparencia 2019. Se aplica la minería de texto desde un enfoque descriptivo como predictivo: este último posee un interés predominante al realizar la codificación automática de respuestas o categorías a partir del aprendizaje automático supervisado. Se emplean algoritmos de máquinas de soporte vectorial, clasificador ingenuo de Bayes, bosques aleatorios, XGBoost y vecinos más cercanos. Los resultados del análisis descriptivo permiten apreciar las descripciones, visualizaciones y relaciones en el análisis de las preguntas abiertas. El análisis predictivo reseña que los algoritmos seleccionados con mayor ocurrencia para las preguntas abiertas fueron el clasificador ingenuo de Bayes y los bosques aleatorios, mostrando precisiones de entre 48% y 76%. Se obtuvieron resultados similares en comparación con las categorías que fueron codificadas manualmente. Se aprecian resultados satisfactorios en el análisis integral de las 12 preguntas de la encuesta.

Palabras clave: encuesta de opinión; preguntas abiertas; minería de texto; aprendizaje automático supervisado.

Abstract

Coding and analyzing open-ended questions from opinion survey is often time consuming. Text mining offers an alternative for this type of problem. Data comes from the 2019 National Survey of Perception on Transparency open-ended questions. Text mining is applied from a descriptive and predictive approach: the latter has a predominant interest in performing the automatic coding of responses or categories using supervised machine learning. Support vector machine algorithms, naive Bayes classifier, random forests, XGBoost, and closest neighbors are used. The results of the descriptive analysis improve the descriptions, visualizations and relationships in the analysis of the open-ended questions. The predictive analysis reports that the algorithms with the highest selection occurrence for the open-ended questions were the naive Bayes classifier and the random forests, showing accuracies between 48% and 76%. Similar results were obtained compared with the pre-established categories. Satisfactory results are seen in the comprehensive analysis of the 12 survey

questions.

Keywords: opinion surveys; open questions; text mining; supervised machine learning.

Mathematics Subject Classification: 68T45, 68T50.

Ver contenido completo en PDF.

Agradecimientos y financiamiento

Agradecemos a la Contraloría General de la República por proporcionarnos la información, la cual fue el insumo para el análisis antes visto. Además, agradecemos a la Universidad de Costa Rica por patrocinar y difundir la presente investigación.

Referencias

M, Allahyari; S, Pouriyeh; M, Assefi; S, Safaei; E,D, Trippe; J,B, Gutiérrez; K, Kochut. A brief survey of text mining: Classification, clustering and extraction techniques, arXiv, 2017. Doi: https://arxiv.org/abs/1707.02919 [ Links ]

S, Ananiadou; D,B, Kell; J,i, Tsujii.. Text mining and its potential applications in systems biology, Trends in Biotechnology 24 (2006), no. 12, 571-579. Doi: 10.1016/j.tibtech.2006.10.002 [ Links ]

N, P,Araujo.Método semisupervisado para la clasificación automática de textos de opinión. Masters Thesis in Computer Science, Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, México, 2009. Available from: Link [ Links ]

A, Ben-Hur; J, Weston. A User's Guide to Support Vector Machines, in: O, Carugo; F, Eisenhaber. (Eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology 609, Humana Press, Springer, New York, 2009, pp. 223-239. Doi: 10.1007/978-1-60327-241-4_13 [ Links ]

Contraloría General de la República, Memoria Anual 2018, San José. Costa Rica, 2019. Available from: Link [ Links ]

S,V, Guttula; A,A, Rao; G,R, Sridhar; M,S, Chakravarthy; K, Nageshwararo; P,V, Rao. Cluster analysis and phylogenetic relationship in biomarker indentification of type 2 diabetes and nephropathy, International Journal of Diabetes in Developing Countries 30 (2010), 52-56. Doi:10.4103/0973-3930.60003 [ Links ]

T, Hastie; R, Tibshirani; J, Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction, 2nd Edition, Springer, New York, 2009. Doi: 10.1007/978-0-387-84858-7 [ Links ]

M,C, Justicia de la Torre . Nuevas técnicas de minería de textos: Aplicaciones. Doctorate Thesis in Communication Sciences and Artificial Intelligence, University of Granada, Spain, 2017. https://digibug.ugr.es/handle/10481/46975 [ Links ]

S, Kannan; V, Gurusamy. Preprocessing Techniques for Text Mining. Preprint, Madurai Kamaraj University, India, 2014. Available from: Link [ Links ]

M, Maheswari; J,G,R, Sathiaseelan. Text mining: Survey on techniques and applications, International Journal of Science and Research 6 (2017), no. 6, 1660-1664. Link [ Links ]

J, D,Mateo Vásquez. Competición de Kaggle.com: Santander Customer Satisfaction Master Thesis, Universidad Internacional de Andalucía, Huelva, España, 2014. Available from: Link [ Links ]

E,E, Milios; M,M, Shafiei; S, Wang; R, Zhang; B, Tang; J, Tougas. A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering, Preprint, Faculty of Computer Science, Dalhouse University, 2007. Available from: Link [ Links ]

F, Murtagh; P, Legendre. Hierarchical agglomerative clustering method: Which algorithms implement Ward's criterion?, Journal of Classification 31 (2014), 274-295. Doi: https://doi.org/10.1007/s00357-014-9161-z [ Links ]

B, Nguyen Cong; J, Rivero Pérez; C, Morell. Aprendizaje supervisado de funciones de distancia: estado del arte Revista Cubana de Ciencias Informáticas 9(2015), no. 2, 14-28. Available from: Link [ Links ]

J, Silge; D, Robinson. Text Mining with R. A Tidy Approach. O'Reilly, Sebastopol CA, 2019. https://www.tidytextmining.com/ [ Links ]

J,L,Solka. Text data mining: Theory and methods, Statistics Surveys 2 (2008), 94-112. Doi: 10.1214/07-SS016 [ Links ]

S, Tufféry. Data Mining and Statistics for Decision Making, John Wiley & Sons, New York, 2011. Doi: 10.1002/9780470979174 [ Links ]

J, Xu; X, Liu; Z, Huo; C, Deng; F, Nie; H, Huang. Multi-class support vector machine via maximizing multi-class margins, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017 pp. 3154-3160. Doi: 10.24963/ijcai.2017/440 [ Links ]

O,R, Zaïane. Introduction to Data Mining, Chapter 1 in: Principles of Knowledge Discovery in Databases, Departament of Computer Science, University of Alberta. Canada. Available from: Link [ Links ]

Recibido: 20 de Abril de 2021; Revisado: 18 de Noviembre de 2021; Aprobado: 30 de Mayo de 2022

Este es un artículo publicado en acceso abierto bajo una licencia Creative Commons