A STUDY ON TEXT CLASSIFICATION TECHNIQUES AND APPROACHES
Keywords:
Data Mining, Clustering, Classification, Similarity Measure, Feature SelectionAbstract
The world around us is changing very rapidly each day. The main cause of this rapid change is the growth of technologies like internet, world wide web, mobile communications etc. These new technologies gave rise to the concept of data driven era in which we live today. Data refers to knowledge or information that is represented in different formats. Data mining effectively mines information useful for various application domains. The input dataset for mining is collected from the application domain which is huge and may contain noisy and irrelevant data. To remove the irrelevant data the dataset is subjected to cleaning and pre-processing which results a complete dataset for extraction of information. The data set is then reduced and the useful features are identified. Using this transformed data mining process is done which gives the researcher some information which is later produced in an understandable form according to the application domain perspectives. It is a complex and innovative field of research. This study focuses on various techniques and methods adopted for each phase in clustering task. A comparative study of various similar methods are also done by which the efficient methods were identified.
References
. Usama Fayyad and Gregory Piatetsky-Shapiro and Padhraic Smyth, "From Data Mining to Knowledge Discovery in Databases" In Proc. of Advances in Knowledge Discovery and Data Mining- AAAI Press, 1996, pp.1-34.
. Shu-Hsien Liao and Pei-Hui Chu and, Pei-Yuan, "Data Mining Techniques and Applications - A Decade Review from 2000 to 2011", In Proc. of Expert Systems with Applications, 2012, pp. 11303–11311.
. Feng Chen and Pan Deng and JiafuWan and Daqiang Zhang and Athanasios V Vasilakos and XiaohuiRong, "Data Mining for the Internet of Things: Literature Review and Challenges" ,In Proc. of Hindawi Publishing Corporation International Journal of Distributed Sensor Networks, 2015, url: http://dx.doi.org/10.1155/2015/431047
. Vandana Korde and C Namrata Mahender, "Text Classification And Classifiers: A Survey", In Proc. of International Journal of Artificial Intelligence and Applications (IJAIA), March 2012,vol.3
. T. A. Pawar and N. D. Karande, "Effective Pattern Discovery for Text Mining Using Pattern Based Approach ", In Proc. of International Journal of Advance Research in Computer Science and Management Studies, September 2014, Vol. 2, Issue 9.
. Ciro Donalek, "Supervised and Unsupervised Learning" url: http://www.astro.caltech.edu/~george/aybi199/Donalek-Classif.pdf
. Nimit Kumar and Krishna Kummamuru, "Semisupervised Clustering with Metric Learning Using Relative Comparisons", IEEE Transactions on Knowledge and Data Engineering, April 2008, Vol. 20, Issue.4.
. Vikram Singh and Balwinder Saini, "An Effective Pre-Processing Algorithm for Information Retrieval Systems", International Journal of Database Management Systems - IJDMS, December 2014, Vol.6, Issue 6.
. Gowtham S and Mausumi Goswami and Balachandran K and Bipul Syam Purkayastha, "An approach for document Pre-Processing and K Means Algorithm Implementation", In Proc. of Fourth International Conference on Advances in Computing and Communications - IEEE Computer Society, 2014.
. TF-IDF calculation, url: http://www.tfidf.com/.
. Wen Zhang and Taketoshi Yoshida and Xijin Tang, " A comparative study of TF-IDF LSI and multi-words for text classification" In Proc. of Expert Systems with Applications 38, 2011, pp. 2758–2765, url: www.sciencedirect.com.
. Bruno Trstenjaka and Sasa Mikacb and Dzenana Donkoc, "KNN with TF-IDF Based Framework for Text Categorization: In Proc. of Procedia Engineering 69, 2014, pp. 1356 – 1364, url: www.sciencedirect.com.
. Cosine Similarity, url:https://en.wikipedia.org/wiki/Cosinesimilarity
. Yung-Shen Lin and Jung-Yi Jiang and Shie-Jue Lee, "A Similarity Measure for Text Classification and Clustering", IEEE Transactions on Knowledge and Data Engineering, July 2014, pp. 1575–1590, Vol.26.
. Xiaofei Zhoua and Yue Hua, Li Guoa, "Text Categorization Based on Clustering Feature Selection" In Proc. of Procedia Computer Science 31, 2014, pp. 398-405, url: www.sciencedirect.com.
. Jiliang Tang and Salem Alelyani and Huan Liu, "Feature Selection for Classification: A Review",url: http:// citeseerx. ist. psu. edu
. Pushpalata Pujari and Jyoti Bala Gupta, " Improving Classification Accuracy by Using Feature Selection and Ensemble Model" International Journal of Soft Computing and Engineering IJSCE, May 2012
. Yuan Mana," Feature Extension for Short Text Categorization Using Frequent Term Sets", In Proc. of Procedia Engineering 31, 2014, pp. 663 – 670, url: www.sciencedirect.com.
. Ying Yu and Witold Pedrycz and Duoqian Miao, "Multi-label classification by exploiting label correlations" In Proc. of Expert Systems with Applications 41, 2014, pp. 2989–3004, url: www.sciencedirect.com.
. Wen Zhang and Xijin Tang and Taketoshi Yoshida, "TESC: An approach to TExt classification using Semi-supervised Clustering" In Proc. of Knowledge-Based Systems 75, 2015, pp. 152–160, www.sciencedirect.com.
. Viktor Pekar and Steffen Staab, "Word classification based on combined measures of distributional and semantic similarity" url :http://www.aifb.uni-karlsruhe.de/WBS.
. Clifford A Tawiah and Victor S Sheng, " Empirical Comparison of Multi-Label Classification Algorithms", Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013.
. Oscar Luaces and Jorge Díez and José Barranquero and Juan José del Coz and Antonio Bahamonde, "Binary Relevance Efficacy for Multilabel Classification" in Proc. of SpringerVerlag Berlin Heidelberg- ProgArtif Intell,2012.
. Everton Alvares Cherman and Maria Carolina Monard and Jean Metz, "Multi-label Problem Transformation Methods: a Case Study", In Proc. of CLEI Electronic Journal, April 2011, Vol. 14.