Database storage structures
Database tables and indexes may be stored on disk in one of a number of forms, including ordered/unordered flat files, ISAM, heap files, hash buckets, or B+ trees. Each form has its own particular advantages and disadvantages. The most commonly used forms are B-trees and ISAM. Such forms or structures are one aspect of the overall schema used by a database engine to store information.
Unordered
Unordered storage typically stores the records in the order they are inserted. Such storage offers good insertion efficiency ([math]\displaystyle{ O\left(1\right) }[/math]), but inefficient retrieval times ([math]\displaystyle{ O\left(n\right) }[/math]). Typically these retrieval times are better, however, as most databases use indexes on the primary keys, resulting in retrieval times of [math]\displaystyle{ O\left(\log n\right) }[/math] or [math]\displaystyle{ O\left(1\right) }[/math] for keys that are the same as the database row offsets within the storage system.[citation needed]
Ordered
Ordered storage typically stores the records in order and may have to rearrange or increase the file size when a new record is inserted, resulting in lower insertion efficiency. However, ordered storage provides more efficient retrieval as the records are pre-sorted, resulting in a complexity of [math]\displaystyle{ O\left(\log n\right) }[/math].[citation needed]
Structured files
Heap files
Heap files are lists of unordered records of variable size. Although sharing a similar name, heap files are widely different from in-memory heaps. In-memory heaps are ordered, as opposed to heap files.
- Simplest and most basic method
- insert efficient, with new records added at the end of the file, providing chronological order
- retrieval efficient when the handle to the memory is the address of the memory
- search inefficient, as searching has to be linear
- deletion is accomplished by marking selected records as "deleted"
- requires periodic reorganization if file is very volatile (changed frequently)
- Advantages
- efficient for bulk loading data
- efficient for relatively small relations as indexing overheads are avoided
- efficient when retrievals involve large proportion of stored records
- Disadvantages
- not efficient for selective retrieval using key values, especially if large
- sorting may be time-consuming
- not suitable for volatile tables
Hash buckets
- Hash functions calculate the address of the page in which the record is to be stored based on one or more fields in the record
- hashing functions chosen to ensure that addresses are spread evenly across the address space
- ‘occupancy’ is generally 40% to 60% of the total file size
- unique address not guaranteed so collision detection and collision resolution mechanisms are required
- Open addressing
- Chained/unchained overflow
- Pros and cons
- efficient for exact matches on key field
- not suitable for range retrieval, which requires sequential storage
- calculates where the record is stored based on fields in the record
- hash functions ensure even spread of data
- collisions are possible, so collision detection and restoration is required
B+ trees
These are the most commonly used in practice.
- Time taken to access any record is the same because the same number of nodes is searched
- Index is a full index so data file does not have to be ordered
- Pros and cons
- versatile data structure – sequential as well as random access
- access is fast
- supports exact, range, part key and pattern matches efficiently.
- volatile files are handled efficiently because index is dynamic – expands and contracts as table grows and shrinks
- less well suited to relatively stable files – in this case, ISAM is more efficient
Data orientation
Most conventional relational databases use "row-oriented" storage, meaning that all data associated with a given row is stored together. By contrast, column-oriented DBMS store all data from a given column together in order to more quickly serve data warehouse-style queries. Correlation databases are similar to row-based databases, but apply a layer of indirection to map multiple instances of the same value to the same numerical identifier.
See also
Original source: https://en.wikipedia.org/wiki/Database storage structures.
Read more |