List of text mining methods
From HandWiki
Short description: none
Text mining methods are different forms of text mining whose usage is based on their suitability for a given data set. Text mining is the process of extracting data from unstructured text and finding patterns or relations. Below is a list of text mining methodologies.
- Centroid-based Clustering: Unsupervised learning method. Clusters are determined based on data points.[1]
- Fast Global K-Means: Made to accelerate Global K-Means.[2]
- Global K-Means: Global K-Means is an algorithm that begins with one cluster, and then divides into multiple clusters based on the number required.[2]
- K-Means: An algorithm that requires two parameters: K, a number of clusters, and a set of data.[2]
- FW-K-Means: Used with vector space model. Uses the methodology of weight to decrease noise.[2]
- Two-Level-K-Means: Regular K-Means algorithm takes place first. Clusters are then selected for subdivision into subclasses if they do not reach the threshold.[2] thumb
- Cluster Algorithm
- Hierarchical Clustering
- Agglomerative Clustering: Bottom-up approach. Each cluster starts small and then aggregates together to form larger clusters.[3]
- Divisive Clustering: Top-down approach. Large clusters are split into smaller clusters.[3]
- Density-based Clustering: A structure is determined by the density of data points.[4]
- thumbDistribution-based Clustering: Clusters are formed based on mathematical methods from data.[1]thumb
- Hierarchical Clustering
- Collocation
- Stemming Algorithm
- Truncating Methods: Removing the suffix or prefix of a word.
- Lovins Stemmer: Removes longest suffix.
- Porters Stemmer: Allows programmers to stem words based on their own criteria.
- Statistical Methods: Statistical procedure is involved and typically results in affixes being removed.
- N-Gram Stemmer: A set of n characters that are consecutive taken from a word
- Hidden Markov Model (HMM) Stemmer: Moves between states are based on probability functions.
- Yet Another Suffix Stripper (YASS) Stemmer: Hierarchal approach in creating clusters. Clusters are then considered a set of elements in classes and their centroids are the stems.
- Inflectional & Derivational Methods
- Krovetz Stemmer: Changes words to word stems that are valid English words.
- Xerox Stemmer: Removes prefixes.[5]
- Truncating Methods: Removing the suffix or prefix of a word.
- Term Frequency
- Topic Modeling
- Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.[6]
References
- ↑ 1.0 1.1 "Different Types of Clustering Algorithm" (in en-US). 2018-01-15. https://www.geeksforgeeks.org/different-types-clustering-algorithm/.
- ↑ 2.0 2.1 2.2 2.3 2.4 Jalil, Abdennour Mohamed; Hafidi, Imad; Alami, Lamiae; Khouribga, Ensa (2016). "Comparative Study of Clustering Algorithms in Text Mining Context" (in en). International Journal of Interactive Multimedia and Artificial Intelligence 3 (7): 42. doi:10.9781/ijimai.2016.376. ISSN 1989-1660. https://reunir.unir.net/bitstream/123456789/11227/1/ijimai20163_7_6_pdf_27159.pdf.
- ↑ 3.0 3.1 "Agglomerative Methods in Machine Learning" (in en-US). 2021-02-01. https://www.geeksforgeeks.org/agglomerative-methods-in-machine-learning/.
- ↑ Hahsler, Michael. "dbscan: Fast Density-based Clustering with R". https://cran.r-project.org/web/packages/dbscan/vignettes/dbscan.pdf.
- ↑ Ganesh Jivani, Anjali. "A Comparative Study of Stemming Algorithms". https://kenbenoit.net/assets/courses/tcd2014qta/readings/Jivani_ijcta2011020632.pdf.
- ↑ Lowe, Will (2008). "Understanding Wordscores". Methods and Data Institute, School of Politics and International Relations, University of Nottingham, Nottingham. doi:10.2139/ssrn.1095280. ISSN 1556-5068. https://faculty.washington.edu/jwilker/559/Lowe.pdf.
