List of text mining methods

{{Short description|none}}

Different text mining methods are used based on their suitability for a data set. Text mining is the process of extracting data from unstructured text and finding patterns or relations. Below is a list of text mining methodologies.

  • Centroid-based Clustering: Unsupervised learning method. Clusters are determined based on data points.{{Cite web |date=2018-01-15 |title=Different Types of Clustering Algorithm |url=https://www.geeksforgeeks.org/different-types-clustering-algorithm/ |access-date=2024-04-04 |website=GeeksforGeeks |language=en-US}}
  • Fast Global KMeans: Made to accelerate Global KMeans.{{Cite journal |last=Jalil |first=Abdennour Mohamed |last2=Hafidi |first2=Imad |last3=Alami |first3=Lamiae |last4=Khouribga |first4=Ensa |date=2016 |title=Comparative Study of Clustering Algorithms in Text Mining Context |url=https://reunir.unir.net/bitstream/123456789/11227/1/ijimai20163_7_6_pdf_27159.pdf |journal=International Journal of Interactive Multimedia and Artificial Intelligence |language=en |volume=3 |issue=7 |pages=42 |doi=10.9781/ijimai.2016.376 |issn=1989-1660}}
  • Global-K Means: Global K-means is an algorithm that begins with one cluster, and then divides in to multiple clusters based on the number required.
  • KMeans: An algorithm that requires two parameters 1. K (a number of clusters) 2. Set of data.
  • FW-KMeans: Used with vector space model. Uses the methodology of weight to decrease noise.
  • Two-Level-KMeans: Regular KMeans algorithm takes place first. Clusters are then selected for subdivision into subclasses if they do not reach the threshold. thumb
  • Cluster Algorithm
  • Hierarchical Clustering
  • Agglomerative Clustering: Bottom-up approach. Each cluster is small and then aggregates together to form larger clusters.{{Cite web |date=2021-02-01 |title=Agglomerative Methods in Machine Learning |url=https://www.geeksforgeeks.org/agglomerative-methods-in-machine-learning/ |access-date=2024-04-04 |website=GeeksforGeeks |language=en-US}}
  • Divisive Clustering: Top-down approach. Large clusters are split into smaller clusters.
  • Density-based Clustering: A structure is determined by the density of data points.{{Cite web |last=Hahsler |display-authors=etal |first=Michael |title=dbscan: Fast Density-based Clustering with R |url=https://cran.r-project.org/web/packages/dbscan/vignettes/dbscan.pdf |access-date=4 March 2024 |website=cran.r-project.org}}
  • DBSCAN
  • thumbDistribution-based Clustering: Clusters are formed based on mathematical methods from data.thumb
  • Expectation-maximization algorithm
  • Collocation
  • Stemming Algorithm
  • Truncating Methods: Removing the suffix or prefix of a word.
  • Lovins Stemmer: Removes longest suffix.
  • Porters Stemmer: Allows programmers to stem words based on their own criteria.
  • Statistical Methods: Statistical procedure is involved and typically results in affixes being removed.
  • N-Gram Stemmer: A set of 'n' characters that are consecutive taken from a word
  • Hidden Markov Model (HMM) Stemmer: Moves between states are based on probability functions.
  • Yet Another Suffix Stripper (YASS) Stemmer: Hierarchal approach in creating clusters. Clusters are then considered a set of elements in classes and their centroids are the stems.
  • Inflectional & Derivational Methods
  • Krovetz Stemmer: Changes words to word stems that are valid English words.
  • Xerox Stemmer: Removes prefixes.{{Cite web |last=Ganesh Jivani |first=Anjali |title=A Comparative Study of Stemming Algorithms |url=https://kenbenoit.net/assets/courses/tcd2014qta/readings/Jivani_ijcta2011020632.pdf}}
  • Term Frequency
  • Term Frequency Inverse Document Frequency
  • Topic Modeling
  • Latent Semantic Analysis (LSA)
  • Latent Dirichlet Allocation (LDA)
  • Non-Negative Matrix Factorization (NMF)
  • Bidirectional Encoder Representations from Transformers (BERT)
  • Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.{{Cite journal |last=Lowe |first=Will |date=2008 |title=Understanding Wordscores |url=https://faculty.washington.edu/jwilker/559/Lowe.pdf |journal=Methods and Data Institute, School of Politics and International Relations, University of Nottingham, Nottingham |doi=10.2139/ssrn.1095280 |issn=1556-5068}}

References