Database storage structures

From HandWiki
Short description: None

Database tables and indexes may be stored on disk in one of a number of forms, including ordered/unordered flat files, ISAM, heap files, hash buckets, or B+ trees. Each form has its own particular advantages and disadvantages. The most commonly used forms are B-trees and ISAM. Such forms or structures are one aspect of the overall schema used by a database engine to store information.

Unordered

Unordered storage typically stores the records in the order they are inserted. Such storage offers good insertion efficiency ([math]\displaystyle{ O\left(1\right) }[/math]), but inefficient retrieval times ([math]\displaystyle{ O\left(n\right) }[/math]). Typically these retrieval times are better, however, as most databases use indexes on the primary keys, resulting in retrieval times of [math]\displaystyle{ O\left(\log n\right) }[/math] or [math]\displaystyle{ O\left(1\right) }[/math] for keys that are the same as the database row offsets within the storage system.[citation needed]

Ordered

Ordered storage typically stores the records in order and may have to rearrange or increase the file size when a new record is inserted, resulting in lower insertion efficiency. However, ordered storage provides more efficient retrieval as the records are pre-sorted, resulting in a complexity of [math]\displaystyle{ O\left(\log n\right) }[/math].[citation needed]

Structured files

Heap files

Heap files are lists of unordered records of variable size. Although sharing a similar name, heap files are widely different from in-memory heaps. In-memory heaps are ordered, as opposed to heap files.

  • Simplest and most basic method
    • insert efficient, with new records added at the end of the file, providing chronological order
    • retrieval efficient when the handle to the memory is the address of the memory
    • search inefficient, as searching has to be linear
    • deletion is accomplished by marking selected records as "deleted"
    • requires periodic reorganization if file is very volatile (changed frequently)
  • Advantages
    • efficient for bulk loading data
    • efficient for relatively small relations as indexing overheads are avoided
    • efficient when retrievals involve large proportion of stored records
  • Disadvantages
    • not efficient for selective retrieval using key values, especially if large
    • sorting may be time-consuming
    • not suitable for volatile tables

Hash buckets

Main page: Hash table
  • Hash functions calculate the address of the page in which the record is to be stored based on one or more fields in the record
    • hashing functions chosen to ensure that addresses are spread evenly across the address space
    • ‘occupancy’ is generally 40% to 60% of the total file size
    • unique address not guaranteed so collision detection and collision resolution mechanisms are required
  • Open addressing
  • Chained/unchained overflow
  • Pros and cons
    • efficient for exact matches on key field
    • not suitable for range retrieval, which requires sequential storage
    • calculates where the record is stored based on fields in the record
    • hash functions ensure even spread of data
    • collisions are possible, so collision detection and restoration is required

B+ trees

These are the most commonly used in practice.

  • Time taken to access any record is the same because the same number of nodes is searched
  • Index is a full index so data file does not have to be ordered
  • Pros and cons
    • versatile data structure – sequential as well as random access
    • access is fast
    • supports exact, range, part key and pattern matches efficiently.
    • volatile files are handled efficiently because index is dynamic – expands and contracts as table grows and shrinks
    • less well suited to relatively stable files – in this case, ISAM is more efficient

Data orientation

Most conventional relational databases use "row-oriented" storage, meaning that all data associated with a given row is stored together. By contrast, column-oriented DBMS store all data from a given column together in order to more quickly serve data warehouse-style queries. Correlation databases are similar to row-based databases, but apply a layer of indirection to map multiple instances of the same value to the same numerical identifier.

See also