Fuzzy hashing
Fuzzy hashing, also known as similarity hashing,[1] is a technique for detecting data that is similar, but not exactly the same, as other data. This is in contrast to cryptographic hash functions, which are designed to have significantly different hashes for even minor differences. Fuzzy hashing has been used to identify malware[2][3] and has potential for other applications, like data loss prevention and detecting multiple versions of code.[4][5]
Background
A hash function is a mathematical algorithm which maps arbitrary-sized data to a fixed size output. Many solutions use cryptographic hash functions like SHA-256 to detect duplicates or check for known files within large collection of files.[4] However, cryptographic hash functions cannot be used for determining if a file is similar to a known file, because one of the requirements of a cryptographic hash function is that a small change to the input should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect).[6]
Fuzzy hashing exists to solve this problem of detecting data that is similar, but not exactly the same, as other data. Fuzzy hashing algorithms specifically use algorithms in which two similar inputs will generate two similar hash values. This property is the exact opposite of the avalanche effect desired in cryptographic hash functions.
Fuzzy hashing can also be used to detect when one object is contained within another.[1]
Approaches
There are a few approaches used for building fuzzy hash algorithms:[7][5]
- Context-triggered piecewise hashing (CTPH), which constructs a hash by splitting the input into multiple pieces, calculating traditional hashes for each piece, and then combining those traditional hashes into a single string.[8]
- Locality-sensitive hashing places similar input items into the same "buckets", which can be used for data clustering and nearest neighbor searches.
Notable tools and algorithms
- spamsum is a tool written by Andrew Tridgell that uses fuzzy hashing to determine whether an email is similar to known spam. It operates by generating a fuzzy hash for an email that it compares against the fuzzy hashes from known spam emails to generate a match result between 0 (complete mismatch) to 100 (perfect match). If the match result is high enough, the email is classified as spam.[9][10]
- Nilsimsa Hash is an anti-spam–focused locality-sensitive hashing algorithm.
- ssdeep is a fuzzy hashing tool based on context-triggered pairwise hashing to compare files.[4]
- sdhash is a fuzzy hashing tool based on using Bloom filters to determine whether one file is contained within another or how similar two files are to each other.[11]
- TLSH is a locality-sensitive hashing scheme for comparing whether files are similar to each other. It has been used for malware clustering.[12]
- Rspamd uses fuzzy hashing to detect spam emails, using the shingles algorithm for this purpose.[13][14]
See also
References
- ↑ 1.0 1.1 Breitinger, Frank (May 2014). NIST Special Publication 800-168. doi:10.6028/NIST.SP.800-168. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-168.pdf. Retrieved January 11, 2023.
- ↑ Pagani, Fabio; Dell'Amico, Matteo; Balzarotti, Davide (2018-03-13). "Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy". New York, NY, USA: ACM. pp. 354–365. doi:10.1145/3176258.3176306. ISBN 9781450356329.
- ↑ Sarantinos, Nikolaos; Benzaïd, Chafika; Arabiat, Omar (2016). "Forensic Malware Analysis: The Value of Fuzzy Hashing Algorithms in Identifying Similarities". 2016 IEEE Trustcom/BigDataSE/ISPA. pp. 1782–1787. doi:10.1109/TrustCom.2016.0274. 10.1109/TrustCom.2016.0274. ISBN 978-1-5090-3205-1. https://ieeexplore.ieee.org/document/7847157.
- ↑ 4.0 4.1 4.2 Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation 3, Supplement (September 2006): 91–97. doi:10.1016/j.diin.2006.06.015.
- ↑ 5.0 5.1 Oliver, Jonathan; Cheng, Chun; Chen, Yanggui (2013). "2013 Fourth Cybercrime and Trustworthy Computing Workshop". IEEE. pp. 7–13. doi:10.1109/ctc.2013.9. ISBN 978-1-4799-3076-0.
- ↑ Al-Kuwari, Saif; Davenport, James H.; Bradford, Russell J. (2011). "Cryptographic Hash Functions: Recent Design Trends and Security Notions". Cryptology ePrint Archive. Report 2011/565. https://eprint.iacr.org/2011/565.
- ↑ Oliver, Jonathan; Hagen, Josiah (2021). "2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC)". IEEE. pp. 1–6. doi:10.1109/euc53437.2021.00028. ISBN 978-1-6654-0036-7.
- ↑ "Open Source Similarity Digests DFRWS August 2016". https://github.com/trendmicro/tlsh/blob/master/TLSH_Introduction.pdf.
- ↑ "spamsum README". https://www.samba.org/ftp/unpacked/junkcode/spamsum/README.
- ↑ "spamsum.c". https://download.samba.org/pub/unpacked/junkcode/spamsum/spamsum.c.
- ↑ Roussev, Vassil (2010). "Data Fingerprinting with Similarity Digests". Advances in Digital Forensics VI. IFIP Advances in Information and Communication Technology. 337. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 207–226. doi:10.1007/978-3-642-15506-2_15. ISBN 978-3-642-15505-5. https://inria.hal.science/hal-01060620.
- ↑ "Fast Clustering of High Dimensional Data Clustering the Malware Bazaar Dataset". https://tlsh.org/papersDir/n21_opt_cluster.pdf.
- ↑ "Usage of fuzzy hashes". Rspamd. https://docs.rspamd.com/tutorials/fuzzy_storage/.
"Fuzzy check module". Rspamd. https://docs.rspamd.com/modules/fuzzy_check/. - ↑ Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey G.. "Syntactic clustering of the Web". Computer Networks and ISDN Systems (Elsevier Science) 29 (September 1997): 1157–1166. doi:10.1016/S0169-7552(97)00031-7. https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf. Retrieved 2025-09-25.
