Tools: Djangonaut diaries, week 3 - Working on an ORM issue

Tools: Djangonaut diaries, week 3 - Working on an ORM issue

What is an index, anyway? Why not index everything?

How indexes are used

What about M2M relations, composite indexes and unique constraints?

Back to the ticket

Is Django really producing the extra index? The PR from the previous week is currently being reviewed, with some great suggestions from one of the Django fellows, Jacob Walls. In the meantime I picked up one of the issues that were suggested by the navigator for this cohort, Akash, an ORM optimization issue titled "Unnecessary creation of index for ManyToManyField". In summary, when Django creates the migrations for a many-to-many "through" table, it's creates a database index which may be redundant. If you understood every word on this sentence and the ticket title, maybe this blog post is not for you, you can wait for the next one in which I actually fix the issue. I won't judge you. What? you here still? Great!! Let's dig a bit on what are database indexes, how the most common index for relational databases work (balanced trees, aka b-trees), what are many-to-many (m2m) relations and junction tables (also called through-tables), and finally how "unique" indexes for m2m tables work. Knowing the basics of all that, we will have the tools to work on a proper fix. There's no point try to fix something we don't really understand, right? ;) If you have a solid grasp of btree indexes and just want to jump to the Django-related part in which I demonstrate the issue, scroll a bit to the section "Back to the ticket" ;) So, we all know indexes from books: if I want to find a topic or a chapter in a book, instead of reading every page, I can check its index at the beginning or the end of the book, search for the topic there, get the page number and jump straight to that page. In a way, it's a trade-off: we're making the book a bit heavier and expensive (having more pages) in exchange for making it more practical to use. Database indexes also help us find data quicky, they're shortcuts to find data and avoid table scans, which is when the database engine reads every row in a table to find what was requested. An index is a separate data structure from the main table, optimized for speed and size. The most common type is the B-Tree (Balanced Tree), which maintains a consistent depth so that searching for any record takes roughly the same amount of time. Just like book indexes, there's also a trade-off here: to make data retrieval faster, we will use a bit more disk space, and also make writing operations a bit slower, as we need to update not only the table but also the indexes affected. In other words, indexes are not "free", and having unnecessary indexes is a bad idea. In the diagram below, the database is searching for the book "Two Scoops of Django" (an excellent book btw!). It follows a specific logical path through the pages: The search starts at the Root Page. The engine compares "Two Scoops" against the keys J and S. Since "T" comes after "S," it ignores the first two branches and follows the pointer to the third. It lands on an Internal Page for the S-Z range. It sees another split and determines the title must be in the S-U section. Finally, it arrives at the Leaf Page, which contains the actual index entries. it finds "Two Scoops of Django" and its associated rowid: 88, which is where the record is physically located on disk and can be retrieved for the user. By traversing this tree, the database only reads a few pages from disk instead of scanning the entire table. Those who remember big-O notation, that's O(log N), which means much better :D Many-to-many relations, or M2M for short, are relationships where both parties involved can occur several times. For example, if you have a table for Books and a table for Authors, you need a third table to link them: a book can have one or more authors, and an author can have written one or more books. This middle table is called a through-table, or junction table, or a join table, or Larry (just kidding, not Larry). One rule we usually want for our through-table is to enforce that the pairs (book_id, author_id) are unique. There is no reason to record twice that a specific book has a specific author! To enforce this rule (aka constraint) efficiently, a Composite Index combining both IDs is needed. The database needs to check this index before every insert or update to ensure the pair doesn't already exist. It’s the same B-Tree structure we saw before, but instead of sorting by a single value, it sorts by tuples of (book_id, author_id). The order of columns in this index is critical. Because the B-Tree is sorted, the intermediate pages will first split the data by the first value book_id. If there are multiple authors for that same book, it then splits those by the second value author_id. In the database world we call the first the "leftmost" column, as it's the most important of the pair, for the reasons we'll explain now. Check the diagram below, see that when we search for Book ID 10 and author ID 5, it will first find the book, and then its authors. So whenever I try to associate an author to a book, the database engine can use this index to quickly check if the relation already exists, and if so raise an error. From this arrangement we may note two things: Ooookay, I hope you're still with me after this rabbit hole :D Onto our ticket: it says that for M2M relations, Django is creating three indexes: From the explanation above you can agree with the creator of the ticket that the first index is redundant, right? But, let's see if this is actually true in practice, if a real database engine would optimize and just use the composite index if I search by book ID. To do this, we can use a nice tool (mentioned in the ticket comments, which is how I learned about it) called SQL Fiddle. This tool allow us to provide code for tables and queries, pick a database engine, and by running a special SQL command called EXPLAIN QUERY, see how the database would perform the query, its "query plan". Very handy, especially because SQL Fiddle supports all relational databases supported by Django (SQLite, Postgres, MySQL, Oracle, MariaDB). This is the test we did there. Note that, just like issue described, we created three index, one for each column and one for the composite index (unique constraint). ...and the result is...drumroll.... Okay! So we reproduced the issue. We searched by book ID and SQLite rightly picked the composite index. If we use SQL Fiddle to toggle Postgres, Oracle and MariaDB, the same will happen (try it!) If we search by Author, on the other hand, it will use the specific index for author: Finally, it's good to "check the plug": is Django really creating three indexes, including the extra one for the leftmost column? if not, you may ask me why did I make you read such long article... well, at least you didn't have to write it! To test on Django I will create the models. Note that the m2m relation is defined in the Book model. Now I can run the command to create the migration... And use the command manage.py sqlmigrations to see the SQL commands that will produce the tables and indexes There it is! The redundant index on book_id, named blog_book_authors_book_id_35eae5ed. Now that we know that ...we can move on and find a fix to prevent the extra index from being created! But that's the for the next post, soon ;) PS: If you want to know a bit more about other types of indexes, I have some slides for a talk I did for Python Cerrado (a regional conference here in Brazil) in 2024. This is how I got so nerdy about b-trees etc, to have a reason to travel and meet more Pythonistas ;) PPS: Yes I used AI to draw these Mermaid diagrams, I'm not that good on Mermaid. And I learned that dev.to does not support Mermaid too :( Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or

Code Block

Copy

-- 1. Create the main tables CREATE TABLE Book ( ID INTEGER PRIMARY KEY AUTOINCREMENT, Title TEXT ); CREATE TABLE Author ( ID INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT ); -- 2. Create the M2M through-table CREATE TABLE Book_Authors ( id INTEGER PRIMARY KEY AUTOINCREMENT, book_id INTEGER, author_id INTEGER, FOREIGN KEY (book_id) REFERENCES Book(ID), FOREIGN KEY (author_id) REFERENCES Author(ID) ); -- 3. Add the indexes (Fixed Typo and Names) CREATE INDEX idx_book_author_book_id ON Book_Authors(book_id); CREATE INDEX idx_book_author_author_id ON Book_Authors(author_id); CREATE UNIQUE INDEX idx_book_author_unique_book_id_author_id ON Book_Authors(book_id, author_id); -- 4. Gather statistics ANALYZE; -- 5. THE TEST EXPLAIN QUERY PLAN SELECT * FROM Book_Authors WHERE book_id = 10; CODE_BLOCK: -- 1. Create the main tables CREATE TABLE Book ( ID INTEGER PRIMARY KEY AUTOINCREMENT, Title TEXT ); CREATE TABLE Author ( ID INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT ); -- 2. Create the M2M through-table CREATE TABLE Book_Authors ( id INTEGER PRIMARY KEY AUTOINCREMENT, book_id INTEGER, author_id INTEGER, FOREIGN KEY (book_id) REFERENCES Book(ID), FOREIGN KEY (author_id) REFERENCES Author(ID) ); -- 3. Add the indexes (Fixed Typo and Names) CREATE INDEX idx_book_author_book_id ON Book_Authors(book_id); CREATE INDEX idx_book_author_author_id ON Book_Authors(author_id); CREATE UNIQUE INDEX idx_book_author_unique_book_id_author_id ON Book_Authors(book_id, author_id); -- 4. Gather statistics ANALYZE; -- 5. THE TEST EXPLAIN QUERY PLAN SELECT * FROM Book_Authors WHERE book_id = 10; CODE_BLOCK: -- 1. Create the main tables CREATE TABLE Book ( ID INTEGER PRIMARY KEY AUTOINCREMENT, Title TEXT ); CREATE TABLE Author ( ID INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT ); -- 2. Create the M2M through-table CREATE TABLE Book_Authors ( id INTEGER PRIMARY KEY AUTOINCREMENT, book_id INTEGER, author_id INTEGER, FOREIGN KEY (book_id) REFERENCES Book(ID), FOREIGN KEY (author_id) REFERENCES Author(ID) ); -- 3. Add the indexes (Fixed Typo and Names) CREATE INDEX idx_book_author_book_id ON Book_Authors(book_id); CREATE INDEX idx_book_author_author_id ON Book_Authors(author_id); CREATE UNIQUE INDEX idx_book_author_unique_book_id_author_id ON Book_Authors(book_id, author_id); -- 4. Gather statistics ANALYZE; -- 5. THE TEST EXPLAIN QUERY PLAN SELECT * FROM Book_Authors WHERE book_id = 10; CODE_BLOCK: QUERY PLAN `--SEARCH Book_Authors USING COVERING INDEX idx_book_author_unique_book_id_author_id (book_id=?) CODE_BLOCK: QUERY PLAN `--SEARCH Book_Authors USING COVERING INDEX idx_book_author_unique_book_id_author_id (book_id=?) CODE_BLOCK: QUERY PLAN `--SEARCH Book_Authors USING COVERING INDEX idx_book_author_unique_book_id_author_id (book_id=?) CODE_BLOCK: EXPLAIN QUERY PLAN SELECT * FROM Book_Authors WHERE author_id = 5; QUERY PLAN `--SEARCH Book_Authors USING INDEX idx_book_author_author_id (author_id=?) CODE_BLOCK: EXPLAIN QUERY PLAN SELECT * FROM Book_Authors WHERE author_id = 5; QUERY PLAN `--SEARCH Book_Authors USING INDEX idx_book_author_author_id (author_id=?) CODE_BLOCK: EXPLAIN QUERY PLAN SELECT * FROM Book_Authors WHERE author_id = 5; QUERY PLAN `--SEARCH Book_Authors USING INDEX idx_book_author_author_id (author_id=?) CODE_BLOCK: class Author(models.Model): name = models.CharField(max_length=100) def __str__(self): return self.name class Book(models.Model): title = models.CharField(max_length=200) authors = models.ManyToManyField(Author, related_name="books") published_date = models.DateField() def __str__(self): return f"{self.title}" CODE_BLOCK: class Author(models.Model): name = models.CharField(max_length=100) def __str__(self): return self.name class Book(models.Model): title = models.CharField(max_length=200) authors = models.ManyToManyField(Author, related_name="books") published_date = models.DateField() def __str__(self): return f"{self.title}" CODE_BLOCK: class Author(models.Model): name = models.CharField(max_length=100) def __str__(self): return self.name class Book(models.Model): title = models.CharField(max_length=200) authors = models.ManyToManyField(Author, related_name="books") published_date = models.DateField() def __str__(self): return f"{self.title}" COMMAND_BLOCK: > python manage.py makemigrations Migrations for 'blog': blog/migrations/0003_author_book.py + Create model Author + Create model Book COMMAND_BLOCK: > python manage.py makemigrations Migrations for 'blog': blog/migrations/0003_author_book.py + Create model Author + Create model Book COMMAND_BLOCK: > python manage.py makemigrations Migrations for 'blog': blog/migrations/0003_author_book.py + Create model Author + Create model Book CODE_BLOCK: ... create table etc CREATE UNIQUE INDEX "blog_book_authors_book_id_author_id_0a5bb3b3_uniq" ON "blog_book_authors" ("book_id", "author_id"); CREATE INDEX "blog_book_authors_book_id_35eae5ed" ON "blog_book_authors" ("book_id"); CREATE INDEX "blog_book_authors_author_id_fa034e3d" ON "blog_book_authors" ("author_id"); CODE_BLOCK: ... create table etc CREATE UNIQUE INDEX "blog_book_authors_book_id_author_id_0a5bb3b3_uniq" ON "blog_book_authors" ("book_id", "author_id"); CREATE INDEX "blog_book_authors_book_id_35eae5ed" ON "blog_book_authors" ("book_id"); CREATE INDEX "blog_book_authors_author_id_fa034e3d" ON "blog_book_authors" ("author_id"); CODE_BLOCK: ... create table etc CREATE UNIQUE INDEX "blog_book_authors_book_id_author_id_0a5bb3b3_uniq" ON "blog_book_authors" ("book_id", "author_id"); CREATE INDEX "blog_book_authors_book_id_35eae5ed" ON "blog_book_authors" ("book_id"); CREATE INDEX "blog_book_authors_author_id_fa034e3d" ON "blog_book_authors" ("author_id"); - If I want to search by all authors for a given book ID, this index is good enough, I will just get all nodes once I zero down the author ID. - On the other hand, if I want to quickly search for books related to a specific author, this index won't help me at all. In this case, I would need a separate index on this table for author_id. - One for the first entity ID - One for the second entity ID - One for the pair (unique constraint index) - Django creates 3 indexes for m2m through-tables, including one for the leftmost column - Database engines like SQLite, Postgres etc optimize usage of indexes and will prefer using the composite index when possible