SciELO - Scientific Electronic Library Online

 
vol.33 issue1Layout design: how sequential and simultaneous information displays affect decision-making processes in digital environmentsMonitoring the vertical oscillation of a spray bars through an electronic system author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Revista Tecnología en Marcha

On-line version ISSN 0379-3982Print version ISSN 0379-3982

Abstract

CALVO-VALVERDE, Luis Alexander  and  MENA-ARIAS, José Andrés. Evaluation of different text representation techniques and distance metrics using KNN for documents classification. Tecnología en Marcha [online]. 2020, vol.33, n.1, pp.64-79. ISSN 0379-3982.  http://dx.doi.org/10.18845/tm.v33i1.5022.

Nowadays, text data is a fundamental part in databases around the world and one of the biggest challenges has been the extraction of meaningful information from large sets of text. Existing literature about text classification is extensive, however, during the last 25 years the statistical methods (where similarity functions are applied over vectors of words) have achieved good results in many areas of text mining. Additionally, several models have been proposed to achieve dimensional reduction and incorporate the semantic factor, such as the topic modelling. In this paper we evaluate different text representation techniques including traditional bag of words and topics modelling. The evaluation is done by testing different combinations of text representations and text distance metrics (Cosine, Jaccard and Kullback-Leibler Divergence) using K-NearestNeighbors in order to determine the effectiveness of using topic modelling representations for dimensional reduction when classifying text. The results show that the simplest version of bag of words and the Jaccard similarity outperformed the rest of combinations in most of the cases. A statistical test showed that the accuracy values obtained when using supervised Latent Dirichlet Allocation representations, combined with the relative entropy metric, were no significantly different to the ones obtained by using traditional text classification techniques. LDA managed to abstract thousands of words in less than 60 topics for the main set of experiments. Additional experiments suggest that topic modelling can perform better when used for short text documents or when increasing the parameter of number of topics (dimensions) at the moment of generating the model.

Keywords : Text similarity; text classification; KNN; topic modeling.

        · abstract in Spanish     · text in Spanish     · Spanish ( pdf )