Herding your Database Queries in Django

Herding your Database Queries in Django

Django ORM allows one to seamlessly represent database (DB) data as instances of Python classes (models).  This includes relationships between objects, such that when a model (A) has a foreign key relationship (let’s say field name ‘related_b’) to another model (B), all you need to do to navigate from an instance of A (a) to the related instance of B (b) is “a.related_b”. By default, this will fetch “b” from the database, which is both amazingly convenient and also terrifying. I say terrifying with some degree of jest, but also plenty of seriousness that will be appreciated by those of us who know what it’s like to deal with a production database that is melting (struggling) under load. 

While Django provides ways of dealing with this, such as select_related and prefetch_related, as well as cached properties, the need for intervention is not easy to realize until the said database starts melting. In an application that has any degree of complexity, it would be advisable to have means of diagnosing and guarding against excessive queries at some granular level, such as by request.

We learned the hard way

In a run-up to one of our production releases, we (at PathAI) started load testing our application and noticed significant performance problems, primarily caused by serializers aggregating information from different related objects. While relatively easy to write, the code represented one of the typical trade-offs in software development – convenience vs. control.  It is convenient when the framework takes your  “a.related_b” and runs a query under the hood, relieving you of that burden.  But then it isn't easy for you to see and control that behavior.   Sprinkle that inside a few for-loops, or serializers that are applied to lists of objects, and you get thousands of queries per request in some cases.  What’s worse, things will get even slower with time as you accumulate more rows in your tables.  User experience will be compromised at best, or your app will be at risk of a cascading failure when the DB can’t keep up with the load.  Luckily we had some time to debug and address these issues.  The rest of the post shares how we implemented request-level query tracking.  You will find a specific recipe and a pointer to a reference implementation that will help you “herd” your queries such that they do their job without stampeding and stomping all over your database :-).

Some examples to illustrate the issues

For demonstration purposes, let’s assume we have the following models used to represent books in a library that can be borrowed:

 

class Author(models.Model):

    name = models.TextField()



class Book(models.Model):

    title = models.TextField()

    author = models.ForeignKey(Author, on_delete=models.CASCADE)

    published_date = models.DateTimeField(auto_now=True)



class User(models.Model):

    name = models.TextField()



class PhysicalBook(models.Model):

    book = models.ForeignKey(Book, on_delete=models.CASCADE, 

        related_name='physical_books')

    borrowed_at = models.DateTimeField(null=True, blank=True)

    due_by = models.DateTimeField(null=True, blank=True)

    borrowed_by = models.ForeignKey(User, null=True, blank=True, 

        on_delete=models.SET_NULL)

 

  With the above code, here are some things that you can easily do with ORM:

 

# Who borrowed this book?

physical_book.borrowed_by.name


# Who wrote the borrowed book?

Physical_book.book.author.name


# Are there any copies of a book available?

class Book(models.Model):

    ...

    @property

    def num_copies_available(self):

        return self.physical_books.filter(borrowed_at__isnull=True).count()

 

With Django Rest Framework, you can apply the above to surface the data in a RESTful API response via a Serializer implementation:

 

class BookSerializer(serializers.ModelSerializer):

    num_copies_available = serializers.ReadOnlyField()

    author_name = serializers.ReadOnlyField(source='author.name')


    class Meta:

        model = Book

        fields = [<other fields>, 'num_copies_available', 'author_name']

 

What you get in your response might look like

JSON Response with the title, author, author name, and number of copies sold.

 

It is important to realize that unless optimized, the above will result in two DB queries being run per each instance of Book being rendered through this serializer.  And while these queries are going to be fairly inexpensive given the use of primary keys, extra overhead should not be dismissed easily as you scale your application.  Moreover, it is easy to imagine something more complicated that is relatively easy to write:

 

class Book(serializers.ModelSerializer):

    ...

    @property

    def time_until_available(self):

        if self.num_copies_available > 0:

            return 0

        else:

            # find book that is due soonest

            soonest_due = self.physical_books.filter().order_by("due_by").first()

            if soonest_due:

                return soonest_due.due_by - \

                    datetime.datetime.now(datetime.timezone.utc)

            else:

                return datetime.datetime.max

 

And then this property exposed in the serializer will add two more queries, and “order_by” can make it slower in absence of indexes.

In the case of the queries resulting from use of serializers on a list of objects of the same kind, one possible saving grace is paginating your responses, thus reducing the queries per request to “(page size) * (queries per serializer)”.  But even then, depending on the number of requests you are getting, you can still inflict a lot of unnecessary load on your database, not to mention slowing down your response time as a result.

Arguably, one might say all these properties should not be exposed in a single serializer, and instead there should be more purpose-built APIs.  That’s a debate for a different day, but all I will say for now is that while the example above is hypothetical, in real life we have seen clear-case examples of complex data from many related objects that logically made sense to render together in the same response.

It is also important to understand that besides the cost of making the database call (wait time + database load), there is the cost of instantiating one or more model instances in the memory of your backend.

Diagnosing the causes

As you can see from the above examples, queries can spring up from lots of expected and unexpected places, and depending on the execution flow, we can end up with lots of them.  Finding where they are coming from is not always obvious, though there are some commonly known options:

  • django-debug-toolbar is a great option if you are serving fairly simple static pages.  If you have a dynamic front-end, such as React or Vue, backed with Django-based RESTful API, then it’s not going to help directly.  Also, even if it can help in interactive investigations, there is a case to be made for more routing metrics that can be monitored over time.
  • django-silk is something similar to the above and has similar limitations
  • You can log each query by cranking up django.db.backends log level to DEBUG, but that’s a lot of unstructured output that you’d have to sift through.

The suggestion here instead is to take advantage of a couple of open-source ingredients to cook a custom Django middleware that will utilize query wrapping to capture the information we seek.

A fairly straightforward reference implementation can be found here https://github.com/Path-AI/django_request_stats_example/blob/main/req_stats/middleware.py , along with a simple Django project that illustrates its use, the schema being the “lending library” that was described above.

They key open-source pieces here are:

  • The concept of middleware that can wrap requests
  • django.db.connections.all – a way to iterate through all Django DB Connections
  • execute_wrapper – a Connection method for instrumenting DB calls, it creates a cleanup context
  • ExitStack – a stack of contexts that can be collected and cleaned up all at once on exit

With the above, we can create:

  • Our own middleware implementation – in a nutshell, middleware allows you to wrap your request execution in a callable object where you can do all sorts of interesting things, including such things as metrics collection.  
  • QueryStats – our custom callable object that collects the data on each DB call

The core part of the middleware looks like this:

  query_stats =   

       QueryStats(track_details = self.detailed_db_query_diagnostics_active) 

    with ExitStack() as stack:


        for conn in connections.all():

            stack.enter_context(conn.execute_wrapper(query_stats))

        response = self.get_response(request)



The exact mechanics of how QueryStats collects the data and how middleware outputs the logging message can be examined in detail in the code, but in summary:

  • Each unique query is counted
  • We measure the time the query took (simple “wall clock” but it’s a good enough proxy)
  • Optionally (a flag can be provided via settings), within each unique query, each unique call stack is also counted

The logger outputs:

  • Total time for the request
  • Number of queries in the request
  • Total time it took to execute all the queries in the request
  • If detailed query information is being collected, then a breakdown by each unique query”
    • Number of times it was run
    • Each unique call stack initiating that query and how many of the DB-calls each stack is responsible for

Example high level output:

Received request GET /library/books/, status 200, db_query_count=29, db_query_time_ms=136.421, duration_ms=192.56

 

The above already has some important information.  It’s telling us there were 29 queries that took place.  Whether it is reasonable or not is something where you need to exercise judgment and ideally it is something that you want to examine right upon the initial implementation while it’s all fresh in your mind.  

Ask yourself:

  • How many objects am I expected to be reading from the database here per endpoint call?
  • What’s the minimum number of queries that might be needed?
  • How many queries can I afford given the expected load and sizing of my infrastructure?
  • How does this scale with the increase in the number of objects and increases in traffic?

The detailed output below can help in understanding the source of the queries and deciding on the course of action for reducing them.

Example excerpt of the detailed output (redacted for space, only one query mentioned):

Detailed DB query info:

24 instances of the following query:

SELECT COUNT(*) AS "__count" FROM "library_physicalbook" WHERE ("library_physicalbook"."book_id" = %s AND "library_physicalbook"."borrowed_at" IS NULL)

This code location accounted for 12 queries:


<redacted... >


  File "... venv_for_django_con/lib/python3.9/site-packages/rest_framework/fields.py", line 97, in get_attribute

    instance = getattr(instance, attr)

  File "...django_request_stats_example/library/models.py", line 22, in num_copies_available

    return self.physical_books.filter(borrowed_at__isnull=True).count()


<redacted... >

 

The above is an excerpt that only contains one query and a couple of most relevant call stack lines, and perhaps additional filtering options should be added to the actual implementation for ease of use, but suffice it to say that with just a bit of scrolling and squinting it is fairly easy to find the offending call here.  Fairly obvious tip here, but if the same query is executed many times then it is a clear candidate for optimization.

Approaches to query reduction

Frankly this deserves its own post with examples, but briefly here are some ideas you should consider:

  • select_related - a way to get related objects retrieved in one shot via a join
  • prefetch_related - similar to the above, except does a separate query for each relationship
  • when dealing with complex Serializers, retrieve the data in the endpoint before calling serializers (utilizing joins/select_related/prefetch_related as necessary), then pass it to the serializers via context
  • @cached_property - will persist a property in an instance.  Use this with care, and note that it’s usually not the best remedy since your cache can become stale, plus it’s not solving a problem of running a query for different instances, all it can do is safe on re-query/re-computation of a single property.

Example of optimization with select_related + prefetch_related

 

class BookViewSet(viewsets.ModelViewSet):

    queryset = Book.objects.all().order_by('title')

    serializer_class = BookSerializer

 

With the serializer we described, we get (page size) * 2 queries due to: num_copies_available an ReadOnlyField(source='author.name') each invoking a separate query.

The first step:

 

class BookViewSet(viewsets.ModelViewSet):

    queryset = Book.objects.all().

        .select_related('author').order_by('title')

    serializer_class = BookSerializer

 

This takes care of author.name, but not the other field.  So we still have a linear dependency with (page size) * 1 queries.

The next steps:

 

class BookViewSet(viewsets.ModelViewSet):

    queryset = Book.objects.all() \

        .select_related('author') \       

        .prefetch_related('physical_books') \

        .order_by('title')

    serializer_class = BookSerializer


class Book

...

    @property

    def num_copies_available(self):

        return len(list(pb for pb in 

            self.physical_books.all() 

            if pb.borrowed_at is None

        ))

 

With all that, the number of queries no longer grows with the number of books we return!

Note that the property itself had to be modified.  The `filter().count()` will always invoke a query regardless of what related objects are already in memory, so a more explicit generator with count avoids that.  

As often goes with software choices, the above approach represents another trade-off, which is that we retrieve a page-worth of related objects into memory with the original queries, so one has to beware of that too – at some point if there are too many of those or they are heavy, then the trade-off may no longer be worth it, and you’d need to start looking at other optimizations techniques.

It’s all better now, are we done yet?

No rest for the weary.  If you want to be sure your code does not regress, you should install some guardrails against additional queries being introduced.  

You might have already guessed – unit tests are your best bet here.  You will need real DB operations (@pytest.mark.django_db) and django_assert_max_num_queries fixture to verify that a given scope (usually an endpoint) does not exceed the limit.  Of course, there can be legitimate reasons why the number of queries may grow, but at least you will be aware of that happening and most importantly you will prevent linear (or worse) runaway increases.

But what if a new endpoint is added that isn’t covered by a test like this?  You can use coverage context to mark unit tests that guard query counts and then conduct periodic test coverage reviews to see what endpoints or other scopes are not covered in the above contexts.

Finally, you may want to instrument your code for operational use so that you can measure, monitor and alert around some of the data that can now be logged.  Specifically, the high-level per-request summary can be streamed to something like Kibana, and then you can see your DB query count and DB time in real usage (and/or test environments if that helps).  Depending on your operational posture and amount of traffic, you can install real-time alerting on high counts or high DB time, or simply conduct regular reviews of something like 50% or 95% percentile per endpoint.  Some great ideas on metrics and alerting can be found in “Four Golden Signals” for monitoring systems

Acknowledgements

My huge gratitude goes to the Django community for the excellent code and docs that helped in coming up with the solutions and describing these recommendations.

I am also very grateful to PathAI colleagues for their innovative ideas, as well as for helping refine this post.