MySQL interviews: What is the difference between a clustered index and a non-clustered index?

Jan 3 2024 mysql 16 minutes read (About 2326 words)

Have you ever been asked a similar question in an interview? Or you will meet in the future, let’s explore and master it together

Thank you for reading this article. More Interview Questions here:
https://programmerscareer.com/software-interview-set/

Topic: 1.1 Introduction to Clustered and Non-Clustered Indexes

Knowing how your data is stored and retrieved can greatly impact the performance of your databases. More specifically, understanding how MySQL makes use of indexes is crucial to this understanding. This is where the concepts of Clustered and Non-Clustered Indexes come into play.

Indexes, as you may have learned from your prerequisite studies, are a vital component of databases. They are essentially lookup tables that the database engine utilizes to speed up the data retrieval process, much like how an index in a book helps you quickly locate information without having to read each page.

Now, let’s specifically dive into what clustered and non-clustered indexes are:

Clustered Indexes:

Just as it sounds, a clustered index dictates the physical order of data in a table. To understand this better, consider a clustered index as a dictionary. In a dictionary, words are not randomly scattered across but are stored alphabetically, so you can quickly jump to a section for a particular letter and find the word you’re looking for. In the context of a database, this “word” is your data row. MySQL, in fact, organizes rows of data based on the clustered index so that retrieval is faster. Please note that there can only be one clustered index per table.

Non-Clustered Indexes:

Non-clustered indexes, on the other hand, do not dictate the physical order of data. Yet, they carry a ‘pointer’ to the data. To understand better, if a clustered index is like a dictionary, then a non-clustered index can be considered like an index in a book. The index points you to the pages where information resides, but it doesn’t contain the information itself. This means that once the database engine looks through the non-clustered index, it must perform extra work to ‘go’ to the data row, compared to a clustered index where the data resides with the index. Hence, data retrieval via non-clustered indexes may be slower. However, you can have multiple non-clustered indexes, which can be beneficial for a variety of data retrieval scenarios.

Topic: 1.2 Clustered Indexes Explained

Clustered indexes are all about the physical storage of data. When you create a clustered index on a table, the rows of the table are stored on disk in the same order as the index key. There can only be one clustered index per table, and if you don’t explicitly define a clustered index, SQL Server will automatically create one for you. This is known as a heap.

The structure of a clustered index is also known as a B+ tree structure. At the root of the tree, which is the topmost level, you have the root node. This root node then branches out into multiple leaf nodes at the bottom-most level. These leaf nodes contain the actual data rows in the order of the index.

A primary key constraint automatically creates a clustered index on that particular column. However, in some cases, you might want to manually create a clustered index on a non-primary-key column. This depends on your requirements. For example, if you have a table with employees’ data and constantly query data in the order of employees’ hire date, it might be beneficial to create a clustered index on the hire date column to speed up these queries.

Furthermore, the update of records in a table with a clustered index may be slower than that in a heap. That’s because when a record is updated in a table with a clustered index, the database might need to physically move the entire row to maintain the sort order.

Topic: 1.3 Non-Clustered Indexes Explained

So now we have a good understanding of what a clustered index is and how it sorts and stores data on disk. But sometimes, we don’t always want to retrieve data based on the (single) clustered index. That’s where the non-clustered index comes into play.

A non-clustered index is markedly different from a clustered index. For starters, creating a non-clustered index does not rearrange the physical order of data in the table. Instead, it creates a distinct object within the database that houses a sorted list of pointers that point to the data in the table.

Here’s a neat example to illustrate. Imagine a book — rather than looking through every page for a particular topic, you would generally turn to the book’s index, right? It guides you straight to the pages that contain your specified topic. That quick directing is the function a non-clustered index performs!

The architecture of a non-clustered index is similar to a clustered one — a B-tree data structure with root nodes, intermediate level nodes, and leaf nodes. However, the leaf nodes of a non-clustered index consist only of the index columns and a pointer to the corresponding row in the data table. You can have multiple non-clustered indexes on a single table, with each catering to a specific query you want to speed up.

In MySQL, a non-clustered index is essentially all secondary indexes you create, with each of them containing a copy of the primary key columns for the rows where the search key matches.

Topic: 1.4 Differences Between Clustered and Non-Clustered Indexes

With a solid understanding of what clustered and non-clustered indexes are, let’s now clarify the key differences between the two:

Order of Data: A clustered index determines the physical order of data in a table. On the other hand, a non-clustered index doesn’t alter the way the records are stored but creates a separate object within a database that points back to those original records.
Number of Indexes: A table can have only one clustered index, but multiple non-clustered indexes. Remember, the more indexes, the more disk space required.
Data Retrieval Speed: Clustered indexes can lead to faster data retrieval than non-clustered, but that’s not always the case. If a non-clustered index covers a query (meaning, the query’s data can all be served from the index’s leaf nodes), it can retrieve data faster despite having a few extra hoops to jump through.
Update Performance: Clustered indexes can slow down updates, while non-clustered indexes often have little effect on performance.
Storage Space: As non-clustered indexes are stored separately from the table data, they require additional storage. Each non-clustered index is a separate disk structure that stores a sorted array of column values, whereas a clustered index is the actual table data and forms the lowest level of an index.

In the grand scheme of databases — the speed at which data can be retrieved, the efficiency of storage, the quickness of updates — all of these factors rely heavily on proper indexing. Being clear on when and why to use either clustered and non-clustered indexes puts you in control of optimizing your database performance.

Topic: 1.5 Choosing the Right Index

Great job! Now that we know what clustered and non-clustered indexes are and the key differences between them, let’s dive into choosing the right index.

Choosing the right index for performance optimization in MySQL comes down to understanding the queries that will be executed against your database. It’s not just about whether to use clustered or non-clustered indexes, but also includes understanding columns and their cardinality.

Here are a few key points to help you decide:

Choose a Clustered Index for Wide Column Queries: Since clustered indexes are essentially the table data itself, they are excellent for wide column queries because of the reduced number of reads required.
Choose a Non-Clustered Index for Specific Column Queries: Non-clustered indexes are useful when you need to retrieve a smaller subset of columns often. In such cases, creating a non-clustered index on these columns can be beneficial.
High-Cardinality Columns: When a column has a high cardinality (each row is unique), using it as a clustered index can result in quicker look-ups.
Low-Cardinality Columns: For low-cardinality columns (many rows share the same value), usage of non-clustered indexes is generally more efficient.
Data Modification Operations: If your application entails frequent modifications (INSERT, UPDATE, DELETE operations), non-clustered indexes might be a better choice since they don’t impact the physical ordering of the data on disk.
Space Considerations: Since non-clustered indexes are separate disk structures, they consume additional storage space. If storage space is a constraint, clustered indexes might be a better fit, albeit at the cost of speed, in some cases.

Remember, the best strategy is always dictated by the specific workload at hand. It’s essential to continuously monitor and analyze performance and adjust your indexing strategy accordingly.

Topic: 1.6 Examples and Use-cases

Wonderful! Well done so far. To solidify the understanding, let’s look at a few real-world examples and use-cases.

Starting with a basic example, let’s say you directly manage an online bookstore. You have a Books table that contains the following columns: BookID, Title, Author, Genre, Price, and PublicationDate.

Using a Clustered Index: Let’s say customers frequently search books by the BookID in your store. To enhance the speed of these common pull requests, you can use a clustered index on the BookID column. Since a clustered index determines the physical order of data in a table, row look-ups can be significantly faster.
Using a Non-Clustered Index: On the flip side, if customers often look up books by Genre or Author, it could be beneficial to create a non-clustered index on these columns. As mentioned, non-clustered indexes are particularly useful when you need to retrieve a smaller subset of columns, which perfectly fits our case here.

Use-case: Suppose your database has a Customers table storing millions of records, and you frequently need to retrieve customer information based on CustomerID. A clustered index on CustomerID can speed up these look-ups dramatically. However, if business needs require you to retrieve records based not just on CustomerID but also, let’s say, LastName and ZipCode, then non-clustered indexes on LastName and ZipCode can be more efficient.

Keep in mind that these are just examples and the actual implementation may greatly vary depending on factors like data size, query complexity, and hardware capabilities. Understanding when to use clustered and non-clustered indexes — predicated on intelligent database design — is a crucial aspect of managing SQL databases.

Topic: 1.7 Review and Assessments

Amazing progress! Let’s review the main concepts we’ve covered and then move onto some assessments.

Clustered Index: This type of index determines the physical order of data in a table. A table can have only one clustered index.
Non-Clustered Index: This type of index is a separate disk structure referencing the table data, which helps speed up the queries that are not covered by clustered index. A table can have multiple non-clustered indexes.
High vs. Low Cardinality: High-cardinality refers to columns with unique values on most, if not all, rows. Clustered indexes on high-cardinality columns can result in quicker look-ups. Low-cardinality refers to columns where many rows share the same value. Non-clustered indexes are generally more efficient for such columns.
Choosing the Right Index: This depends on various factors including query types, cardinality, data modification needs, and space constraints.

Now, let’s assess our understanding with a few questions:

What is the key difference between a clustered and a non-clustered index?
In which case would a non-clustered index be a more suitable choice over a clustered index?
What do high-cardinality and low-cardinality mean and how do they affect the choice of index type?

Question: What is the key difference between a clustered and a non-clustered index?
Answer: The key difference between a clustered and a non-clustered index lies in the way they store and reference data. A clustered index determines the physical order of data in a table, essentially being the table itself, whereas a non-clustered index is a separate structure that points to the data located elsewhere in the database.

Question: In which case would a non-clustered index be a more suitable choice over a clustered index?
Answer: A non-clustered index would be more suitable when the database needs to support a lot of search queries on columns that are not part of the clustered index. Moreover, non-clustered indexes would also be preferable when the table undergoes frequent changes as changes do not result in the entire table needing to be reorganized, unlike with a clustered index.

Question: What do high-cardinality and low-cardinality mean and how do they affect the choice of index type?
Answer: Cardinality refers to the uniqueness of data values in a column. High cardinality means a column contains a large percentage of totally unique values, and low cardinality means a column contains many repeated values. High-cardinality columns are good candidates for a clustered index, whilst indexes on low-cardinality columns, where the column values are very repetitive, are less effective.

中文文章: https://programmerscareer.com/zh-cn/mysql-interview13/
Author: Wesley Wei – Twitter Wesley Wei – Medium
Note: If you choose to repost or use this article, please cite the original source.

#interview