String interning

From HandWiki
Short description: Data structure for reusing strings

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable.[1] Interning strings makes some string processing tasks more time-efficient or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

The single copy of each string is called its intern and is typically looked up by a method of the string class, for example String.intern()[2] in Java. All compile-time constant strings in Java are automatically interned using this method.[3]

String interning is supported by some modern object-oriented programming languages, including Java, Python, PHP (since 5.4), Lua[4] and .NET languages.[5] Lisp, Scheme, Julia, Ruby and Smalltalk are among the languages with a symbol type that are basically interned strings. The library of the Standard ML of New Jersey contains an atom type that does the same thing. Objective-C's selectors, which are mainly used as method names, are interned strings.

Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any boolean, any byte, any char from 0 to 127, and any short or int between −128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.[6]

History

Lisp introduced the notion of interned strings for its symbols. Historically, the data structure used as a string intern pool was called an oblist (when it was implemented as a linked list) or an obarray (when it was implemented as an array).

Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.

Motivation

String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on associative arrays with string keys to look up the attributes and methods of an object. Without interning, comparing two distinct strings may involve examining every character of both.[Note 1] This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.

String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization.[7]

Issues

Multithreading

If the interned strings are not immutable, one source of drawbacks is that string interning may be problematic when mixed with multithreading. In many systems, string interns are required to be global across all threads within an address space (or across any contexts which may share pointers), thus the intern pool(s) are global resources that should be synchronized for safe concurrent access. While this only affects string creation (where the intern pool must be checked and modified if necessary), and double-checked locking may be used on platforms where this is a safe optimization, the need for mutual exclusion when modifying the intern pool can be expensive.[8]

Contention can also be reduced by partitioning the string space into multiple pools, which can be synchronized independently of one another.

Reclaiming unused interned strings

Many implementations of interned strings do not attempt to reclaim (manually or otherwise) strings that are no longer used. For applications where the number of interned strings is small or fixed, or which are short-lived, the loss of system resources may be tolerable. But for long-running systems where large numbers of string interns are created at runtime, the need to reclaim unused interns may arise. This task can be handled by a garbage collector, though for this to work correctly weak references to string interns must be stored in the intern pool.

See also

Notes

  1. The string comparison can halt at the first character mismatch. For strict equality, the lengths of the strings can also be compared before traversing the string: but finding the length of null-terminated strings does itself require traversing the string.

References

External links