Django multitenancy: choosing the architecture for your SaaS application

Filipe Ximenes
September 13, 2017

Picture a healthtech startup building a patient management platform that serves multiple healthcare providers. Each provider needs to securely store and access their patients' data while complying with Django FHIR standards and regulations like HIPAA and GDPR. For this startup, implementing multitenancy Django solutions isn't just good practice — it's essential for maintaining data isolation, security compliance, and scalability.

A well-designed Django SaaS architecture ensures proper healthtech data security while creating Django scalable applications that can grow with their customer base. So what's the best approach for implementing multitenancy in such scenarios?

The answer is: it depends. It depends on the number of customers you are planning to have. It depends on the size of your company. It depends on the technical skills of the developers working on the platform. And it depends on how sensitive the data is.

In this article, I'll go over architectural patterns for Django SaaS applications to tackle this problem and how to apply them in Django. My goal here is to give you an overview so you can decide what is best for your Django SaaS project.

If you already know what multitenancy is and why we need it, feel free to jump to the "Our Application" section.

First of all, let's agree that we are looking to build an application that is capable of attending multiple clients from a single version of the codebase. The reason for that is that we don't want to be setting up a new server for each client that signs up. With that in mind, we can finally start talking about multitenancy.

First things first. Here is the Wikipedia definition for it:

"The term "software multitenancy" refers to a software architecture in which a single instance of software runs on a server and serves multiple tenants."

In the context of this article, every time we say "tenant" we mean "customer". Cool? Cool! Alright, so multitenancy is a very broad concept and it can be observed in many levels of the infrastructure modern applications rely on. Here are some examples of multitenant systems we deal with on a daily basis:

  • Amazon AWS - AWS is an IaaS (Infrastructure as a Service). They provide one infrastructure that allows many companies (tenants) to host their own isolated infrastructure;
  • Heroku - Heroku operates much like AWS but from the PaaS (Platform as a Service) level. Again, many companies (tenants) can run their own applications isolated from one another in Heroku's platform;
  • Basecamp - Basecamp (which is a SaaS) is an application that allows many companies to manage their day-to-day, each in their own isolated context.

As you've guessed by now, we are going to focus on the SaaS scope. But why do we want to build multitenant applications in the first place? The main reason is to benefit from sharing resources amongst our clients. Not everyone is using the software at the same time. So many clients fit into the same infrastructure where only one would exist otherwise. But there are some other very important reasons.

Imagine the situation where you are going to set up a new infrastructure with a server and a database for each client. Even if you try to keep the architecture and code base the same for everyone, it's going to be a maintenance hell. A simple change in the software means multiple deploys. It might work for 10 or 20 clients, but it quickly becomes pretty much undoable.

There are 3 main approaches to multitenancy in this context. These Django multi-tenant strategies each have their own advantages and disadvantages. There's no established convention for naming them, but for this article, we are going to use the names: Single Shared Schema, Multiple Databases, and Multiple Schemas.

Let's explore practical multitenancy Django patterns that work in real-world scenarios.

Our Application

I'm not sure how familiar you are with the enterprise software market, but you would be surprised to know there's an increasing demand for Fidget Spinner software. Because of this, we will be using an Enterprise Fidget Spinner Tracker application in our code samples. Here is the database schema of the application:

Application Schema

Single Shared Schema

This is the approach used by most of the big companies. The reason for that is that it's the most scalable approach. Each of your clients will be sharing every piece of your infrastructure. There's a single database and a single server application.

One Server, One Database

That scalability does not come without some compromises. The biggest problem here is assuring data isolation. Because tables contain data from all clients, you rely on manually filtering the correct information on queries. You can learn how Salesforce does it in this talk. The short version is that they have internal tools that assist developers doing the job. The other thing is that they automatically double check every single query that goes to the database. Please keep in mind that Salesforce has plenty of engineers to take care of building those assisting tools. So if isolation is important to you and you are not Salesforce, this might not be the best approach.

Enough talking, let's see how to do this in Django. The idea is to add a tenant attribute to the request parameter of our views. The way to do this in Django is through a middleware. Historically middlewares in Django are implemented via a class, but since 1.10 we have function based middlewares. That's the style we are going to use.

Here is the code, read through comments to follow each step:

def tenant_middleware(get_response):
    def middleware(request):
        # we are going to use subdomains to identify customers.
        # the first step is to extract the identifier from the url
        host = request.get_host()  # here is the full url, something like this: 'https://ibm.spinnertracking.com'
        host = host.split(':')[1]  # we remove the protocol part: 'ibm.spinnertracking.com'
        subdomain = host.split('.')[0]  # and now we get the subdomain:'ibm'
        
        # now is just a matter of using the subdomain to find the
        # corresponding Customer in our database
        try:
            customer = Customer.objects.get(name=subdomain)
        except Customer.DoesNotExist:
            customer = None
            
        # and it to the request
        request.tenant = customer
        
        # all done, the view will receive a request with a tenant attribute
        return get_response(request)
    
    return middleware

All done, add the middleware in your settings:

MIDDLEWARE_CLASSES = [
    'my_project.middlewares.tenant_middleware',
    ...
]

And finally, you will be able to access it in your views to display things accordingly. In this example we are getting the average duration scoped by customer:

def my_view(request):
    avg_duration = (
        Spin.objects
        .filter(user_spinner__user__customer=request.customer)
        .aggregate(avg=Avg('duration')))['avg']
    
    return render(request, 'show_average.html', {'avg_duration': avg_duration})

Neat, huh? But there's a way to improve this. A common technique to avoid all that nesting in the query is to decorate every model with the customer it belongs to.

Customer annotation in every table

This will denormalize your database but it's, in general, a good thing. First of all, this is a field that will never change so you shouldn't get in too much trouble. It will also help you simplifying queries and performing sanity checks on the data you are working with. Here is the same query we did before using the new field:

avg_duration = (
    Spin.objects
    .filter(customer=request.customer)
    .aggregate(avg=Avg('duration')))['avg']

Cool! Just a little note for you to keep in mind when doing this: you might have some trouble if you try to use a third party lib that introduces models to your application but does not give you access to customize them. Also, in case you are wondering, this is a similar to the Django sites framework approach. You should definitely take a look at it.

There's a lot more you can do to improve your work when dealing with a shared database, but this should be enough to get you understanding the basics of how to do it. You should also check the django-shared-schema-tenants lib.

Now let's try juggling with multiple databases:

Multiple Databases

We'll now move to the extreme opposite of single shared schema. Each tenant will now have its own database instance.

Multiple Databases

While this approach appears more complex, it offers several important advantages that you should consider. The first thing is isolation. Not only logical, but you can do physical data isolation. This isolation is particularly valuable for applications where healthtech data security compliance is non-negotiable.

Each database can be on different hardware, and it will be quite hard for someone to mistakenly show the wrong data to a client. Also, you can tune each instance for the requirements of each customer. It also means you are probably going to spend more money on infrastructure since there's less resource sharing. Alright, let's make this work in Django.

You've probably noticed that the Django DATABASES is a dictionary. Although we normally only set a default key in it, it's possible to have multiple database entries. Every time we instantiate a new client we are going to add a new entry there:

DATABASES = {
    'default': {
        'ENGINE': ...,
        'NAME': ...,
    },
    'ibm': {
        'ENGINE': ...,
        'NAME': ...,
    }
}

We can now use the same middleware we defined in the last approach. This will give us access to the customer object in our views. And we can use the using method from the queryset to select the desired database.

spinners = (
    Spinner.objects
    .using(request.customer.name)
    .annotate(
        avg_duration=Avg('owned_spinners__spins__duration'))
    .order_by('-avg_duration'))

There's also a db_manager for creating objects, check the documentation for more on that.

This is all nice and should work alright. One thing you might not enjoy as much is the requirement of writing using in every query. There's a way to get rid of this, but to do it you will need to use threadlocals. If you don't know what that is I advise that there's no consensus on the use of it. You will find people strongly advising it as a bad practice and saying that you should not use in your code. For educational purposes, I'm going to show how it's done but if someone asks, you didn't hear it from me.

The threadlocal approach

I've prepared an example app that uses the threadlocal approach. If you want to understand more about how threadlocal works you can read the python docs. In the following examples, you will see the use of a @thread_local decorator click here to see the source code for it.

The first thing we will need is a middleware that is very similar to the ones we did so far:

def multidb_middleware(get_response):
    def middleware(request):
        subdomain = get_subdomain(request)
        customer = get_customer(subdomain)
        request.customer = customer
        
        @thread_local(using_db=customer.name)
        def execute_request(request):
            return get_response(request)
        
        response = execute_request(request)
        return response
    
    return middleware

The only real difference is that this time we are setting using_db as a variable to the current thread using the decorator I mentioned before. Django allows us to define a custom database router. This is very useful for things like separate read and write databases. But this time we are going to use it to select the customer database. We will use the using_db variable we set in the middleware:

class TenantRouter(object):
    def db_for_read(self, model, **hints):
        return get_thread_local('using_db', 'default')
        
    def db_for_write(self, model, **hints):
        return get_thread_local('using_db', 'default')
    
    # …

Last thing, change the settings to use the custom router:

DATABASE_ROUTERS = ['multitenancy.routers.TenantRouter']

Database routing is abstracted from querying:

spinners = (
    Spinner.objects
    .annotate(
        avg_duration=Avg('owned_spinners__spins__duration'))
    .order_by('-avg_duration'))

Voilà, we have a per-client database application with all the complexity hidden from the MVC logic. Now, if this is a good thing or not, I'll leave it for you to decide.

Is this multitenancy?

A thing some people might be questioning is whether having multiple databases is actually multitenancy or not. It depends on how you see it. From the database point of view, it's not. Each database instance is serving a single client. But from the application point of view, it is. We have a single instance of our application code serving multiple clients. While we're focusing on a fidget spinner application, these patterns apply equally to more complex domains like Django FHIR implementations.

Multiple Schemas

Multiple Schemas

Django multi-tenant solutions like django-tenant-schemas provide robust tools for schema management. But, what are schemas in the first place? The first thing to notice is that in the context of this blog post, every time we talk about schemas we are referring to PostgreSQL schemas. That said, I can now safely state that if you've ever used PostgreSQL you have already used schemas. Schemas are simply scoped tables in your database that can be accessed through a namespace. If you don't specify a namespace, PostgreSQL assumes the public namespace. Simple enough. Now, how do we create a new schema? Easy:

CREATE SCHEMA ibm;

Just like the public schema, created schemas can have any number of tables. Each schema may or may not have tables with the same name. The important thing to understand is that regardless, tables in different schemas are disjoint. If you want to add a new field to tables in different schemas, you will need to run two different commands. So how do we query a table that is not in the public schema?

SELECT id, name FROM ibm.user WHERE ibm.user.name LIKE 'F%'

Makes sense, right? But there's a better way. PostgreSQL provides a search_path that you can use to scope your queries without the need to keep repeating yourself.

SET search_path TO ibm;
SELECT id, name FROM user WHERE user.name LIKE 'F%';

Good, that's most of what you need to know to understand how schemas work. From there you could use the same middleware techniques we showed before to automate the use of schemas. But fear not, someone already did that that and it's a stable open source project!

Enters django-tenant-schemas.

Django-tenant-schemas will do most of the trick of managing schemas for you. And it does that using many of the tricks we talked before. Here is a block of code from its middleware:

...
connection.set_schema_to_public()
hostname = self.hostname_from_request(request)
TenantModel = get_tenant_model()
try:
    tenant = self.get_tenant(TenantModel, hostname, request)
    assert isinstance(tenant, TenantModel)
except TenantModel.DoesNotExist:
    # ...
request.tenant = tenant
connection.set_tenant(request.tenant)
...

Similar, isn't it? There's an extra interesting bit. Django-tenant-schemas also overwrites the default PostgreSQL database backend. There, it sets the search path just before executing queries. Here is the code:

...
try:
    cursor_for_search_path.execute(
        'SET search_path = {0}'.format(','.join(search_paths)))
except (django.db.utils.DatabaseError, psycopg2.InternalError):
    self.search_path_set = False
else:
    self.search_path_set = True
if name:
    cursor_for_search_path.close()
...

Apart from managing schemas for CRUD operations, Django-tenant-schemas also provides some very useful commands such as a custom python manage.py createsuperuser that will ask you for a schema name, a python manage.py migrate_schemas command to automatically run migrations in each of your customer schemas and a python manage.py tenant_command <manage command here> that will give multitenant support to any naive command.

Querying across schemas

Since schema tables are disjoint, you might be wondering how can you aggregate data from multiple clients. This is something that is pretty easy to do in the single shared schema approach. The answer to that is doing schema UNIONs.

SELECT id, duration FROM ibm.spinner_spin WHERE duration > 120
UNION
SELECT id, duration FROM vinta.spinner_spin WHERE duration > 120;

But how about ID's? When you make unions you will end up with rows from different schemas that have the same ID. A solution to that is to use UUIDs.

SELECT uuid, duration FROM ibm.spinner_spin WHERE duration > 120
UNION
SELECT uuid, duration FROM vinta.spinner_spin WHERE duration > 120;

Now each row has an attribute that can be used to make complex queries and uniquely identify them in the database.

These patterns form the foundation of most successful Django scalable applications in production today. I hope you’ve got a good overview of how multitenancy works and that you are now able to choose the approach that best suits your next project!

Looking for more?
Advanced Django querying: sorting events by date
The case against over-engineering