Using Topological Data Analysis for Text Classification

Doshi, Pratik

Using Topological Data Analysis for Text Classification

Search for this publication on Google Scholar

Doshi, P. (2018). Using Topological Data Analysis for Text Classification. Unc Charlotte Electronic Theses And Dissertations.

Download PDF

Analytics

1130 views ◎
1004 downloads ⇓

Abstract

I show that by applying discourse features derived through topological data analysis(TDA), namely homological persistence, we can improve classification results on thetask of movie genre detection, including identification of overlapping movie genres.On the IMDB dataset we improve prior art results, namely we increase the Jaccardscore by 4.7% over a recent results by [1]. I also significantly improve the F-score(by over 15%) and slightly improve the hit rate (by 0.5%, ibid.). The limitations ofmy work, mostly due to the smaller data set, are also discussed in the end. I see mycontribution as threefold: (a) for general audience of computational linguists, I wantto increase their awareness about topology as a possible source of semantic features;(b) for researchers using machine learning for NLP tasks, I want to propose the useof topological features when the number of training examples is small; and (c) forthose already aware of the existence of computational topology, I see this work ascontributing to the discussion about the value of topology for NLP, in view of mixedresults reported by others.

Details

Author: Doshi, Pratik
Title: Using Topological Data Analysis for Text Classification
Physical Description: 1 online resource (44 pages) : PDF
Date: 2018
Degree Granting Institution: University of North Carolina at Charlotte
Abstract: I show that by applying discourse features derived through topological data analysis(TDA), namely homological persistence, we can improve classification results on thetask of movie genre detection, including identification of overlapping movie genres.On the IMDB dataset we improve prior art results, namely we increase the Jaccardscore by 4.7% over a recent results by [1]. I also significantly improve the F-score(by over 15%) and slightly improve the hit rate (by 0.5%, ibid.). The limitations ofmy work, mostly due to the smaller data set, are also discussed in the end. I see mycontribution as threefold: (a) for general audience of computational linguists, I wantto increase their awareness about topology as a possible source of semantic features;(b) for researchers using machine learning for NLP tasks, I want to propose the useof topological features when the number of training examples is small; and (c) forthose already aware of the existence of computational topology, I see this work ascontributing to the discussion about the value of topology for NLP, in view of mixedresults reported by others.
Genre: masters theses
Subjects--Topics: Computer science
Degree: M.S.
Keywords: Barcodes
Movie Genre Classification
Persistent Homology
TDA
Text Classification
Topological Data Analysis
Subject Area: Computer Science
Advisor(s): Zadrozny, Wlodek
Committee Members: Akella, Srinivas
Wartell, Zackery
Degree Note: Thesis (M.S.)--University of North Carolina at Charlotte, 2018.
Rights Statement: This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). For additional information, see http://rightsstatements.org/page/InC/1.0/.
Rights Holder Information: Copyright is held by the author unless otherwise indicated.
Identifier: Doshi_uncc_0694N_11775
Permalink: http://hdl.handle.net/20.500.13093/etd:1098

J. Murrey Atkins Library

J. Murrey Atkins Library