Content: Blog

All

Making django-filer super fast in 4 steps!

Fabian Braun

June 27, 2023

Turning django-filer into a speedboat!

Django-filer is a file management application for django that makes handling of files and images a “breeze”. So, it says in the GitHub readme file. And indeed, I have been using it in production for years without any problems. The only thing I needed was some patience. Why would it not work faster? I decided to take a look. 

I logged the SQL queries when trying to pick an image from a folder with 71 image files. I was shocked: 972 SQL queries to display a directory listing with 71 items?

Here's what helped...

Step 1: Get rid of the tree library

Filer stores the uploaded files on a file storage, e.g., the local file system, and creates a folder tree using two Django models: `Folder` and `File`. Now, (at least before the advent of Common Table Expression (CTE)), trees are a challenge for SQL databases. Typically, a node element (folder in our case) has a foreign key to itself describing a “parent” relationship. This setup makes it easy to get all, say, files of a specific folder:

File.objects.filter(parent=my_folder)

It is, however, costly to get, say, the full path of a file: While you know the folder and its parent, you will have to query the database to get the parent, then query again for the grandparent, and so on. For a file at nesting level n this requires n+1 database queries. We all learned at the databases 101 course that we need to avoid those, or should we?

There are several remedies out there that mitigate this performance issue, e.g., libraries like django-treebeard or django-mptt. They have in common that they gain reading speed at the expense of writing speed and keeping information redundantly.

What’s the use case for filer: The tree is “only” a convenience for the user and is only needed in user interaction through the admin site. Users typically do not nest their files excessively. If you have 5 nesting levels, you’ll probably have an issue finding files.

The first optimization is to drop filer's django-mptt dependence, accept the 5 queries you have to make occasionally to get the full path of a file or folder, but also save the additional overhead of keeping tree data at several places. It turns out, that using Django without a tree library performed at least as well as with django-mptt, at least for the tests I did. This is obviously only true for django-filer's specific use case. In other situations, tree libraries are extremely helpful.

Step 2: Optimize thumbnailing

Filer uses the easy-thumbnails package to generate appropriate thumbnails for use on the web page. When confronted with a large image, it will generate smaller thumbnails on demand for, say, listing files in a directory. This happens when the directory template is rendered:

  • First, easy thumbnail checks if there is a 80x80 thumbnail available for the image in question.
  • If not, it generates one and stores it in Django’s file storage.
  • Then it renders the <img> tag with a reference to this thumbnail file. Later, the user’s browser will request it from the server to display it.

Since this happens when rendering the template, it is part of the request response cycle. Imagine you have just uploaded 100 high-resolution images, each 20 MB in size, and want to list the directory: filer would ask to generate a 80x80 thumbnail for each of those 100 images. This will consume huge amounts of CPU time and memory. Clearly, the user will have to wait, if not the request does time out.

Easy-thumbnails does support pre-emptive thumbnail generation, which many collect in a dedicated Celery queue. Still, it is unlikely that upon upload all those requests will have been processed already.

The solution is to take the thumbnail generation out of the request response cycle:

  • Only check if there is a thumbnail already and if so, provide its URL to the <img> tag.
  • If the thumbnail is not available, provide a URL of a new Django admin view that will generate a single required thumbnail and redirect to it upon completion.
  • Finally, tell the browser to load those thumbnails lazily, i.e. only when they appear in the viewport. This will initially reduce the number of requests for thumbnails to, say, 20 and add a natural load balancing.

Step 3: Optimize database queries

So far, so good. Now, with the thumbnails out of the way, we can start optimizing the directory_listing view. Looking at the generated SQL queries with the django-debug-toolbar, it turns out that easy-thumbnails needs to query the database for each file in a folder to check if the thumbnail exists. Now, while typically not having a too deeply nested folder structure, people often have many images in a folder. This implies 1,000 database hits before viewing a folder with 1,000 files.

Typically, you can avoid this by using the select_related or prefetch_related methods on querysets to prefetch, say, all thumbnails referring to a file. This requires a foreign key from the thumbnail to the file. Unfortunately, easy thumbnail does not have such a key to filer’s models. They are independent packages.

To rescue comes Django’s excellent ORM: You can annotate queries even with data from totally different models if you use a SubQuery. Easy thumbnail identifies its source by its filename:

thumbnail_qs = (
    Thumbnail.objects
    .filter(
        source__name=OuterRef("file"),
        modified__gte=OuterRef("modified_at"),
    )
    .exclude(name__contains="upscale")  # Heuristic: filer thumbnails do not use upscaling
    .order_by("-modified“)  # Out of all thumbnails of the relevant size, take the newest
)

file_qs = file_qs.annotate(
    thumbnail_name=Subquery(thumbnail_qs.filter(name__contains=f"__{size}_").values_list("name")[:1]),
    thumbnailx2_name=Subquery(thumbnail_qs.filter(name__contains=f"__{size_x2}_").values_list("name")[:1])
).select_related("owner")

What does this do? thumbnail_qs is the subquery which goes through the thumbnail objects which have the same filename for its source (OuterRef("file")) and are newer than the filer object (OuterRef("modiefied_at")).

In the second line, the file query set is annotated by two file names: First, a thumbnail of the required size, secondly a thumbnail of twice the size (for retina displays). It takes one field from the subquery (name - the filename of the thumbnail). Since only one value can be annotated but the subquery can contain multiple items, I use a heuristic, a rule that holds in most cases, but potentially not in all: Just take the newest thumbnail and exclude thumbnails that use the upscale function of easy thumbnails (since this is not used by filer).

And voilà: Only one SQL query will return not only all files in a folder, but also the file names of their thumbnails if they exist.

As a side remark: See the select_related statement at the end: It ensures that also the owner objects are fetched. We want to display the owner’s name.

Step 4: Optimize file access

Now, that the number of database queries to the database is down to essentially one, the directory listing should be fast, right? So, I start up the server and hit enter - and need to wait. But why?

While for some setups Django stores the uploaded data in a dedicated folder on their local file system, many setups do store them at dedicated file servers such as AWS S3. So, while on your test system file access might be fast, on many production systems it requires its own request response cycle with a file server. This may be costly.

It turns out that filer’s directory_listing view would check if the actual file of a file object exists before showing its thumbnail. This is useful if someone did move it within the file system, but did not update the filer database for some reason. This is no problem on a local file storage, but if files reside at a file server at the other end of the world that’s nearly as costly as getting the file.

The fix is easy: Do not check if files still exist for non-local file storage. Assume they do and deal with missing files not in the directory_listing, but when looking at a specific file.

Result

Now, I can easily list 1,000 files within a single directory without noticeable delay. What a change! The improved django-filer is available with version 3.0 - together with many other great improvements. 

At the time of writing, django-filer 3.0 is available as a release candidate. Download and test it using
pip install django-filer>=3.0.0rc1.

My key take aways are:

  • Do not blindly rely on expert libraries. Check your use case instead. Sometimes they are more overhead than benefit.
  • Take computation time out of the request response cycle as much as possible. If not fully possible, at least spread it over many requests.
  • Learn about advanced Django ORM features. They can be extremely helpful.
  • Do not assume files are local and fast to access

Please let me know what you think in the comments!

blog comments powered by Disqus

Do you want to test django CMS?

Try django CMS