One way to organize data entries is to hash data entries on the sea.rch key. Another way to organize data entries is to build a tree-like data structure that directs a search for data entries.
In hash-based indexing, a hash function is used to convert a key into a hash code. This hash code serves as an index where the value associated with that key is stored. The goal is to distribute the keys uniformly across an array, so that access time is, on average, constant.
Let's break down some of these elements to further understand how hash-based indexing works in practice:
In hash-based indexing, the data space is divided into a fixed number of slots known as "buckets." A bucket usually contains a single page (also known as a block), but it may have additional pages linked in a chain if the primary page becomes full. This is known as overflow.
The hash function is a mapping function that takes the search key as an input and returns the bucket number where the record should be located. Hash functions aim to distribute records uniformly across buckets to minimize the number of collisions (two different keys hashing to the same bucket).
Hash-based indexing is particularly efficient when it comes to disk I/O operations. Given a search key, the hash function quickly identifies the bucket (and thereby the disk page) where the desired record is located. This often requires only one or two disk I/Os, making the retrieval process very fast.
When a new record is inserted into the dataset, its search key is hashed to find the appropriate bucket. If the primary page of the bucket is full, an additional overflow page is allocated and linked to the primary page. The new record is then stored on this overflow page.
To find a record with a specific search key, the hash function is applied to the search key to identify the bucket. All pages (primary and overflow) in that bucket are then examined to find the desired record.
Hash-based indexing is not suitable for range queries or when the search key is not known. In such cases, a full scan of all pages is required, which is resource-intensive.
Let's consider a simple example using employee names as the search key.
Employee Records
| Name | Age | Salary
|-----------|----------|--------
| Alice | 28 | 50000
| Bob | 35 | 60000
| Carol | 40 | 70000
Hash Function: H(x) = ASCII value of first letter of the name mod 3
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice
The most commonly used tree-based index structure is the B-Tree, and its variations like B+ Trees and B* Trees. In tree-based indexing, data is organized into a tree-like structure. Each node represents a range of key values, and leaf nodes contain the actual data or pointers to the data.
Tree-based indexes like B-Trees offer a number of advantages:
Continuing with the "Students" table:
ID Name
1 Abhi
2 Bharath
3 Chinni
4 Devid
A simplified B-Tree index could look like this:
[1, 3]
/ \
[1] [3, 4]
/ \ / \
1 2 3 4
In the tree, navigating from the root to the leaf nodes will lead us to the desired data record.