Software:Ssdeep

From HandWiki
ssdeep
Original author(s)Jesse D. Kornblum
Repositorygithub.com/ssdeep-project/ssdeep
Written inC, C++
LicenseGNU General Public License
Websitessdeep-project.github.io/ssdeep/index.html

ssdeep is a utility for generating and comparing Context Triggered Piecewise Hashes (CTPH), more commonly known as fuzzy hashes.[1] Unlike traditional cryptographic hashes such as MD5 and SHA-256, ssdeep can detect similar files even if they have been modified, repacked, or had data inserted or deleted.[2] Ssdeep is a vital tool in digital forensics.

Algorithm

In contrast to traditional piecewise hashing, Ssdeep uses a rolling hash to create context triggered boundaries. [3] As Ssdeep reads each byte of a file, it keeps track of a small window containing the last 7 bytes of data. It then constantly computes a sequential fast hash of those 7 bytes with a modified Adler-32 checksum. Next, the algorithm defines a trigger value. When the rolling hash window Modulo the current block size equals block size − 1, a boundary is triggered. Because the boundary is triggered by the content instead of a fixed count, inserting or removing bytes only changes the block where the change occurred. The rest of the file's boundaries are resynchronized.[4] Every time the rolling hash hits a trigger, Ssdeep marks the end of a block. The algorithm takes all the bytes in the current block and runs them through the non-rolling, stronger Fowler–Noll–Vo hash function (FNV-1). The resulting FNV-1 value is trimmed down to a single 6-bit integer. The integer is mapped to a custom Base64 character set. This one character represents that entire chunk of the file.[5]

Dynamic block size selection

The output of an Ssdeep hash must be small enough to easily compare. To achieve this whether a file is one megabyte or several gigabytes, Ssdeep dynamically guesses the best block size. It starts with a base block size of 3, simulates the hash, and if the file is too large and the resulting signature is over 64 characters, Ssdeep doubles the block size and tries again. This is continued until a signature under 64 characters is obtained. [4]

Dual hashing input

To make sure a file can still be matched even if automatic block size selection is off, Ssdeep calculates two hashes simultaneously; a hash using the chosen block size, and a hash using twice the chosen block size.[6]

Edit distance

When two ssdeep hashes are compared to get a similarity score, the tool doesn't look at the files, only the two text strings. Ssdeep only compares two hashes if their block size matches or one is double the other exactly. If the block sizes are completely different, an output of 0 is instantly returned.[7] The algorithm runs a weighted Edit distance algorithm on the Base64 strings, calculating the minimum amount of operations required to transform the first string into the second. Before running the edit distance, consecutive identical characters are compressed (e.g. AAAAAAA -> AA) so that large blocks of repeating padding data don't artificially inflate the similarity score. Finally, the edit distance is normalized against the total string length and converted into a percentage between 0 and 100%.[7]

References