Most frequent k characters

From HandWiki

In information theory, MostFreqKDistance is a string metric technique for quickly estimating how similar two ordered sets or strings are. The scheme was invented by Sadi Evren Seker (2014),[1] and initially used in text mining applications like author recognition.[1] The method is originally based on a hashing function MaxFreqKChars[2] classical author recognition problem and idea first came out while studying on data stream mining.[3] The algorithm is suitable for coding in any turing complete programming language.[4]

Definition

Method has two steps.

  • Hash input strings str1 and str2 separately using MostFreqKHashing and output hstr1 and hstr2 respectively
  • Calculate string distance (or string similarity coefficient) of two hash outputs, hstr1 and hstr2 and output an integer value

Most frequent K hashing

The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps:

String function MostFreqKHashing (String inputString, int K)
    def string outputString
    for each distinct character
        count occurrence of each character
    for i := 0 to K
        char c = next most freq ith character  (if two chars have same frequency then get the first occurrence in inputString)
        int count = number of occurrence of the character
        append to outputString, c and count
    end for
    return outputString

Above function, simply gets an input string and an integer K value and outputs the most frequent K characters from the input string. The only condition during the creation of output string is adding the first occurring character first, if the frequencies of two characters are equal. Similar to the most of hashing functions, Most Frequent K Hashing is also a one way function.

Most frequent K distance

The second step of algorithm works on two outputs from two different input strings and outputs the similarity coefficient (or distance metric).

int function MostFreqKSimilarity (String inputStr1, String inputStr2, int limit)
    def int similarity
    for each c = next character from inputStr1
        lookup c in inputStr2
        if c is null
             continue
        similarity += frequency of c in inputStr1
    return limit - similarity

Above function, simply gets two input strings, previously outputted from the MostFreqKHashing function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings. The limit is usually taken to be 10 and in the end the function returns the result of the subtraction of the sum of similarities from limit.

In some implementations, the distance metric is required instead of similarity coefficient. In order to convert the output of above similarity coefficient to distance metric, the output can be subtracted from any constant value (like the maximum possible output value). For the case, it is also possible to implement a wrapper function over above two functions.

String distance wrapper function

In order to calculate the distance between two strings, the below function can be implemented

int function MostFreqKSDF (String inputStr1, String inputStr2, int K, int maxDistance)
    return MostFreqKSimilarity(MostFreqKHashing(inputStr1, K), MostFreqKHashing(inputStr2, K), maxDistance)

Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple additive inverse of similarity.

Examples

Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’. MostFreqKHashing('research', 2) = r2e2 because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string. MostFreqKHashing('seeking', 2) = e2s1 Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'. Finally we make the comparison: MostFreqKSimilarity('r2e2', 'e2s1') = 2 We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2. Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below: MostFreqKSDF('research', 'seeking', 2) = 2

Below table holds some sample runs between example inputs for K=2:

Inputs Hash outputs SDF output (max from 10)
'night'
'nacht'
n1i1
n1a1
9
'my'
'a'
m1y1
a1NULL0
10
‘research’
‘research’	
r2e2
r2e2	
6
‘aaaaabbbb’
‘ababababa’	
a5b4
a5b4	
1
‘significant’
‘capabilities’	
i3n2
i3a2	
7

Method is also suitable for bioinformatics to compare the genetic strings like in FASTA format.

Str1 = LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
MostFreqKHashing(str1, 2) = L9T8
MostFreqKHashing(str2, 2) = F9L8
MostFreqKSDF(str1, str2, 2, 100) = 83

Algorithm complexity and comparison

The motivation behind algorithm is calculating the similarity between two input strings. So, the hashing function should be able to reduce the size of input and at the same time keep the characteristics of the input. Other hashing algorithms like MD5 or SHA-1, the output is completely unrelated with the input and those hashing algorithms are not suitable for string similarity check.

On the other hand, string similarity functions like Levenshtein distance have the algorithm complexity problem.

Also algorithms like Hamming distance, Jaccard coefficient or Tanimoto coefficient have relatively low algorithm complexity but the success rate in text mining studies are also low.

Time complexity

The calculation of time complexity of 'most frequent k char string similarity' is quite simple. In order to get the maximum frequent K characters from a string, the first step is sorting the string in a lexiconical manner. After this sort, the input with highest occurrence can be achieved with a simple pass in linear time complexity. Since major classical sorting algorithms are working in O(nlogn) complexity like merge sort or quick sort, we can sort the first string in O(nlogn) and second string on O(mlogm) times. The total complexity would be O(nlog n ) + O (m log m) which is O(n log n) as the upper bound worst case analysis.

Comparison

Below table compares the complexity of algorithms:

Algorithm Time complexity
Levenshtein distance [math]\displaystyle{ O(nm) = O(n^2) }[/math]
Jaccard index [math]\displaystyle{ O(n+m) = O(n) }[/math]
MostFreqKSDF [math]\displaystyle{ O(n \log n + m \log m) = O(n \log n) }[/math]

For the above table, n is the length of first string and m is the length of second string.

Success on text mining

The success of string similarity algorithms are compared on a study. The study is based on IMDB62 dataset which is holding 1000 comment entries in Internet Movie Database from each 62 people. The data set is challenged for three string similarity functions and the success rates are as below:

Algorithm Running time Error (RMSE) Error (RAE)
Levenshtein distance 3647286.54 sec 29 0.47
Jaccard index 228647.22 sec 45 0.68
MostFreqKSDF 2712323.51 sec 32 0.49

The running times for author recognition are in seconds and the error rates are root mean square error (RMSE) and relative absolute error (RAE).

Above table shows, the 'most frequent k similarity' is better than Levenshtein distance by time and Jaccard index by success rate.

For the time performance and the success rates, the bitwise similarity functions like Sørensen–Dice index, Tversky index or Hamming distance are all in the same category with similar success rates and running times. There are obviously slight differences but the idea behind bitwise operation, loses the string operations like deletion or addition. For example, a single bit addition to the front of one of the input strings would yield a catastrophic result on the similarity for bitwise operators while Levenshtein distance is successfully catching.

Unfortunately, big data studies often require a faster algorithm that still has an acceptable success rate. In such situations, the 'max frequent k characters' is a conceptually simpler algorithm that is also straight forward to implement.

See also

References

  1. 1.0 1.1 "A Novel String Distance Function based on Most Frequent K Characters". International Journal of Machine Learning and Computing (IJMLC) (International Association of Computer Science and Information Technology Press (IACSIT Press)) 4 (2): 177–183. 2014. Bibcode2014arXiv1401.6596E. 
  2. "A Novel Feature Hashing For Text Mining". Journal of Technical Science and Technologies (International Black Sea University) 2 (1): 37–41. 2013. ISSN 2298-0032. http://journal.ibsu.edu.ge/index.php/jtst/article/view/428. 
  3. "Author attribution on streaming data". Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, San Francisco, USA, August 14–16, 2013 (IEEE): 497–503. 2013. doi:10.1109/IRI.2013.6642511. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6642511. 
  4. Rosetta Code: Most frequent k chars distance, http://rosettacode.org/wiki/Most_frequent_k_chars_distance, retrieved 2014-10-16