DBMS Menu

File Organization and Indexing in DBMS

File organization and indexing are fundamental concepts in database management systems (DBMS) that deal with efficient storage, retrieval, and management of data.

File Organization

File organization refers to the arrangement of data on storage devices. The method chosen can have a profound effect on the efficiency of various database operations.

Common methods of file organization include:

Sequential (or Serial) File Organization

In sequential file organization, records are stored in sequence, one after the other, based on a key field. This key field is a unique identifier for records, ensuring that they have some order. The records are inserted at the end of the file, ensuring the sequence is maintained.

Features of Sequential File Organization

Ordered Records: Records in a sequential file are stored based on a key field.
Continuous Memory Allocation: The records are stored in contiguous memory locations.
No Direct Access: To access a record, you have to traverse from the first record until you find the desired one.

Advantages Sequential File Organization

Simplicity: The design and logic behind sequential file organization are straightforward.
Efficient for Batch Processing: Since records are stored in sequence, sequential processing (like batch updates) can be done efficiently.
Less Overhead: There's no need for complex algorithms or mechanisms to store records.

Disadvantages Sequential File Organization

Inefficient for Random Access**: If you need a specific record, you may have to go through many records before finding the desired one. This makes random access slow.
Insertion and Deletion**: Inserting or deleting a record (other than at the end) can be time-consuming since you may need to shift records to maintain the order.
Redundancy Issues**: There's a risk of redundancy if checks are not made before inserting records. For example, a record with the same key might get added twice if not checked.

Practical Application: Suppose you have a file of students ordered by their roll number:


Roll No | Name
--------|--------
1       | Madhu
2       | Naveen
4       | Shivaji
5       | Durga

In a sequential file, if you wanted to add a student with roll number 6, you would append them at the end. However, if you wanted to insert a student with a roll number 3 which is between 2 and 4, you would need to shift all subsequent records to maintain the sequence, which can be time-consuming.

Direct (or Hashed) File Organization

In hash file organization, a hash function is used to compute the address of a block (or bucket) where the record is stored. The value returned by the hash function using a record's key value is its address in the database.

Features of Hash File Organization

Hash Function: A hash function converts a record's key value into an address.
Buckets: A bucket typically stores one or more records. A hash function might map multiple keys to the same bucket.
No Ordering of Records: Records are not stored in any specific logical order.

Advantages Hash File Organization

Rapid Access: If the hash function is efficient and there's minimal collision, the retrieval of a record is very quick.
Uniform Distribution: A good hash function will distribute records uniformly across all buckets.
Efficient Search: Searching becomes efficient as only a specific bucket needs to be searched rather than the entire file.

Disadvantages Hash File Organization

Collisions**: A collision occurs when two different keys hash to the same bucket. Handling collisions can be tricky and might affect access time.
Dependency on Hash Function**: The efficiency of a hash file organization heavily depends on the hash function used. A bad hash function can lead to clustering and inefficient utilization of space.
Dynamic Growth and Shrinking**: If the number of records grows or shrinks significantly, rehashing might be needed which is an expensive operation.

Practical Application: Imagine a database that holds information about books.

Each book has a unique ISBN number. A hash function takes an ISBN and returns an address. When you want to find a particular book's details, you hash the ISBN, which directs you to a particular bucket. If two books' ISBNs hash to the same value, you handle that collision, maybe by placing the new record in a linked list associated with that bucket.

Indexed File Organization

Indexed file organization is a method used to store and retrieve data in databases. It is designed to provide quick random access to records based on key values. In this organization, an index is created which helps in achieving faster search and access times.

Features Indexed File Organization:

Primary Data File: The actual database file where records are stored.
Index: An auxiliary file that contains key values and pointers to the corresponding records in the data file.
Multi-level Index: Sometimes, if the index becomes large, a secondary (or even tertiary) index can be created on the primary index to expedite searching further.

Advantages Indexed File Organization:

Quick Random Access: Direct access to records is possible using the index.
Flexible Searches: Since an index provides a mechanism to jump directly to records, different types of search operations (like range queries) can be efficiently supported.
Ordered Access: If the primary file is ordered, then indexed file organization can support efficient sequential access too.

Disadvantages Indexed File Organization:

Overhead of Maintaining Index: Every time a record is added, deleted, or updated, the index also needs to be updated. This can introduce overhead.
Space Overhead: Indexes consume additional storage space.
Complexity: Maintaining multiple levels of indexes can introduce complexity in terms of design and implementation.

Practical Application: Consider a database that holds information about students.

where each student has a unique student ID. The main file would contain detailed records for each student. A separate index file would contain student IDs and pointers to the location of the detailed records in the main file. When you want to find a specific student's details, you first search the index to find the pointer and then use that pointer to fetch the record directly from the main file.

Note: Indexed file organization strikes a balance between the direct and sequential access of records. While it offers faster search capabilities, there's a trade-off in terms of space and maintenance overhead. It is ideal for databases where search operations are predominant, and where the overhead of maintaining the index can be justified by the improved access times.

Indexed Sequential Access Method (ISAM)

ISAM is a popular method for indexed file organization. In ISAM:

The primary file is stored in a sequential manner based on a primary key.
There's a static primary index built on the primary key.
Overflow areas are designated for insertion of new records, which keeps the main file in sequence. Periodically, the overflow area can be merged back into the main file.

Indexing

Indexing involves creating an auxiliary structure (an index) to improve data retrieval times. Just like the index in the back of a book, a database index provides pointers to the locations of records.

Structure of Index

We can create indices using some columns of the database.


|-------------|----------------|
|  Search Key | Data Reference |
|-------------|----------------|

The search key is the database’s first column, and it contains a duplicate or copy of the table’s candidate key or primary key. The primary key values are saved in sorted order so that the related data can be quickly accessible.
The data reference is the database’s second column. It contains a group of pointers that point to the disk block where the value of a specific key can be found.

Types of Indexes:

Single-level Index: A single index table that contains pointers to the actual data records.
Multi-level Index: An index of indexes. This hierarchical approach reduces the number of accesses (disk I/O operations) required to find an entry.
Dense and Sparse Indexes:
- In a dense index, there's an index entry for every search key value in the database.
- In a sparse index, there are fewer index entries. One entry might point to several records.
Primary and Secondary Indexes:
- A primary index is an ordered file whose records are of fixed length with two fields. The first field is the same as the primary key, and the second field is a pointer to the data block. There's a one-to-one relationship between the number of entries in the index and the number of records in the main file.
- A secondary index provides a secondary means of accessing data. For each secondary key value, the index points to all the records with that key value.
Clustered vs. Non-clustered Index:
- In a clustered index, the rows of data in the table are stored on disk in the same order as the index. There can only be one clustered index per table.
- In a non-clustered index, the order of rows does not match the index's order. You can have multiple non-clustered indexes.
Bitmap Index: Used mainly for data warehousing setups, a bitmap index uses bit arrays (bitmaps) and usually involves columns that have a limited number of distinct values.
B-trees and B+ trees: Balanced tree structures that ensure logarithmic access time. B+ trees are particularly popular in DBMS for their efficiency in disk I/O operations.

Benefits of Indexing:

Faster search and retrieval times for database operations.

Drawbacks of Indexing:

Overhead for insert, update, and delete operations, as indexes need to be maintained.
Additional storage requirements for the index structures.

Next Topic :Cluster Indexes in DBMS