Building privacy-first Django applications: model-based anonymization

Gustavo Carvalho
May 9, 2025

One of the biggest challenges in software development is ensuring that changes introduced in a controlled environment behave the same way when deployed to production.

A key factor in effective testing is having a staging environment that closely mirrors production, both in terms of data volume and structure. However, dealing with privacy and security concerns prevents us from using raw production data in lower environments when dealing with user data.

To solve this, anonymization techniques allow us to leverage production-like data while ensuring that Personally Identifiable Information (PII) remains protected. In this post, I'll walk through our approach to implementing an anonymization mechanism that enables realistic QA testing without compromising user privacy.

Challenges in database anonymization

Before diving into our solution, let's examine the key challenges that any database anonymization strategy must address:

  • Ensuring data integrity - The anonymized data should maintain the same structure and characteristics as the original data to avoid breaking application logic;
  • Scalability - The anonymization process should efficiently handle large datasets without slowing down development workflows;
  • Developer responsibility - It should be easy for developers to mark fields as sensitive and define appropriate anonymization rules;
  • Automation - The process should minimize manual intervention, reducing the chances of human error.

Fostering developer ownership

We developed a model-integrated approach that makes privacy considerations an inherent part of the development process rather than an afterthought.

Our core goals were to:

  • Make anonymization rules visible and accessible to all developers;
  • Keep the rules close to the model definitions they protect;
  • Ensure rules are updated whenever model changes occur;
  • Distribute responsibility across the entire development team rather than centralizing it;
  • Enable automated checks through AI code review tools.

By embedding anonymization rules directly within model definitions, we ensure that developers can't add or modify sensitive fields without considering their privacy implications. This approach creates a natural checkpoint during code reviews, where team members can verify that appropriate anonymization measures are in place.

Alternatives considered

Before settling on our final approach, we evaluated several existing tools:

  • PostgreSQL Anonymizer: An excellent tool that operates at the database level, but it lacks support for Azure Managed PostgreSQL and requires SQL-based rule definitions, which would introduce a bottleneck in the development workflow. More importantly, it separates anonymization rules from model definitions, reducing developer awareness and ownership.
  • dj-anonymizer: Useful but follows a registry approach that requires defining separate anonymization classes, adding maintenance overhead similar to PostgreSQL Anonymizer. This separation makes it easier for anonymization rules to become out of sync with model changes.
  • django-scrubber: The closest to what we needed, but it had limitations, such as not properly handling non-integer primary key fields. It also didn't fully integrate with Django's existing model definition patterns.

While each of these tools offers valuable features, none provided the tight integration with model definitions that we needed to foster developer ownership of data privacy.

Model-based anonymization: our approach

Instead of maintaining a separate anonymization registry, we decided to define sensitive fields directly within each Django model's Meta class. This approach integrates privacy concerns directly into the data model, similar to how Django already handles permissions, verbose names, and other metadata.

By extending Django's Meta class (which developers are already familiar with), we create a natural place for privacy concerns to be addressed alongside other model metadata. This integration ensures that developers consider privacy implications as an inherent part of model design.

Leveraging Faker for realistic data generation

A critical component of our solution is the Faker library, which generates realistic but fake data. Faker provides various providers such as name, email, address, and text, allowing us to generate meaningful test data while preserving data privacy.

from faker import Faker
faker = Faker()
print(faker.name())   	# John Doe
print(faker.email())  	# johndoe@example.com
print(faker.company())	# ABC Corp
print(faker.phone_number())  # +1-202-555-0143

By using Faker, we can replace sensitive information with realistic-looking data that preserves the format and characteristics of the original data. This ensures that application behavior remains consistent in testing environments while protecting actual user information.

Defining sensitive fields in models

With Faker ready to generate our anonymized data, we introduced a sensitive_fields attribute in the Meta class of each model that needs anonymization. This allows developers to explicitly define which fields require protection and specify how they should be anonymized.

Here's how we define sensitive fields in various scenarios:

class UserProfile(models.Model):
    email = models.EmailField()
    phone = models.CharField(max_length=20)

    class Meta:
        sensitive_fields = {
            "email": "unique.email",
            "phone": "phone_number"
        }

When a developer adds a new field to a model, the proximity of the sensitive_fiels attribute serves as a natural prompt to consider whether the new field contains sensitive information that should be anonymized.

During code reviews, team members can easily verify that appropriate anonymization rules are defined for any new or modified fields that might contain sensitive information. This collective responsibility helps maintain a strong privacy posture throughout the development lifecycle.

Here are some more examples of the flexibility of definition of sensitive fields:

# Passing arguments to faker methods

class UserBarAdmission(APINode):
    jurisdiction = models.CharField(max_length=255)
    license_number = models.CharField(max_length=255)
    admission_year = models.PositiveIntegerField()
    extra_note = models.TextField()
    us_state = models.CharField(max_length=255)

    class Meta:
        sensitive_fields = {
            "jurisdiction": "state",
            "license_number": ("random_int", {"min": 1000000000, "max": 9999999999}),
            "admission_year": "year",
            "extra_note": ("text", {"max_nb_chars": 50}),
            "us_state": "state_abbr",
        }
# Conditional rules

def user_profile_name_rule(obj):
    return 'company' if obj.is_company else 'name'


class UserProfile(models.Model):
    name = models.CharField(max_length=255)
    is_company = models.BooleanField()

    class Meta:
        sensitive_fields = {
            "name": lambda obj: user_profile_name_rule(obj)
        }
# Hardcoded Values

class UserProfile(models.Model):
    ...
    logo_url = models.URLField()
    logo_name = models.CharField(max_length=255)

    class Meta:
        sensitive_fields = {
            "logo_url": lambda: "https://placehold.co/600x400/png",
            "logo_name": ("file_name", {"extension": "png"}),
        }

Custom querysets for partial anonymization

Not all records may require anonymization. Some models need only a subset of records anonymized. To support this, we introduced an optional anonymization_queryset attribute.

class CompanyProfile(models.Model):
    ...
    name = models.TextField(blank=True)

    class Meta:
        sensitive_fields = {"name": "company"}
        anonymization_queryset = ~Q(name="")

In the example above, only instances of CompanyProfile where the field name is not an empty string will be anonymized.

This allows fine-grained control over which records are processed, improving efficiency.

Bulk processing for performance

Handling large datasets efficiently is crucial. Our implementation processes records in batches using Django's bulk_update, significantly reducing database overhead.

However, even with batch processing, we still need to iterate over all database rows and fields that require anonymization. In large databases, this can still take a considerable amount of time.

While bulk processing is the best approach considering performance and speed, it does not eliminate the need to process each relevant record individually, which remains a time-consuming operation.

Automating the process

Our database is quite large, making the anonymization process time-consuming even with bulk processing. However, we can afford for our QA data to be slightly outdated. To balance performance and usability, we decided to run the anonymization process once a week as part of our CI/CD pipeline.

The automation process follows these steps:

  1. Read from the latest backup snapshot available on production;
  2. Make a copy of the backup into the QA database;
  3. Run the anonymization process on the QA database;

We implemented the anonymization logic as a Django management command, making it easy to integrate with our CI/CD tools:

import logging
import random
import time
from inspect import signature

from django.apps import apps
from django.contrib.postgres.fields import ArrayField
from django.core.management.base import BaseCommand
from django.db import transaction
from faker import Faker

LOGGER = logging.getLogger(__name__)
DEFAULT_BATCH_SIZE = 5000
MAX_ARRAY_LENGTH = 5

FAKER = Faker()


def get_faker_method_from_str(rule):
    """
    Retrieve a Faker method based on a string representation.

    This function takes a string that represents a Faker method, potentially
    including nested attributes separated by dots, and returns the  corresponding method from the Faker library.

    Args:
        rule (str): A string representing the Faker method, with nested attributes separated by dots (e.g., 'unique.email', 'company', 'phone_number').

    Returns:
        function: The corresponding Faker method.

    Raises:
        ValueError: If the provided rule does not correspond to a valid Faker method.
	"""
    # Split because FAKER is a generator so we can't use getattr directly for nested fields
    parts = rule.split(".")
    faker_method = FAKER
    for part in parts:
        faker_method = getattr(faker_method, part, None)
        if not faker_method:
            raise ValueError(f"Invalid faker method {rule}")
    return faker_method


def anonymize_field(instance, field_name, rule):
    """
    Anonymizes a field of a given instance based on the provided rule.

    Args:
        instance (object): The instance containing the field to be anonymized.
        field (str): The name of the field to be anonymized.
        rule (str, tuple, list, callable): The rule defining how to anonymize the field.
            - If a string, it is interpreted as the name of a Faker method.
            - If a tuple or list, the first element is the Faker method name and the second element s a dictionary of arguments.
            - If a callable, it can either be a function that takes the instance as an argument or a function that directly returns the anonymized value.

    Returns:
        The anonymized value for the specified field.

    Raises:
        ValueError: If the rule is not a valid type or format.
    """
    arguments = {}
    if isinstance(rule, str):
        faker_method = get_faker_method_from_str(rule)
    elif isinstance(rule, (tuple, list)):
        faker_method = get_faker_method_from_str(rule[0])
        arguments = rule[1]
    elif callable(rule):
        sig = signature(rule)
        if len(sig.parameters) == 1:
            faker_method = get_faker_method_from_str(rule(instance))
        else:
            faker_method = rule
    else:
        raise ValueError(f"Invalid rule for field {field_name}")

    field_value = faker_method(**arguments) if callable(faker_method) else faker_method

    field = instance._meta.get_field(field_name)
    if isinstance(field, ArrayField):
        max_array_length = field.size if field.size else MAX_ARRAY_LENGTH
        amount = random.randint(1, max_array_length)
        return [field_value for _ in range(amount)]
    return field_value


@transaction.atomic
def anonymize_model(
    model,
    field_rules,
    custom_filter=None,
    dry_run=True,
    batch_size=DEFAULT_BATCH_SIZE,
):
    """
    Anonymizes data for a given Django model based on specified field rules.

    Args:
        model (django.db.models.Model): The Django model to anonymize.
        field_rules (dict): A dictionary where keys are field names and values are anonymization rules.
        custom_filter (Q, optional): A Django Q object to filter the queryset.
        dry_run (bool, optional): If True, no changes will be saved to the database.

    Returns:
        None
    """
    start = time.time()
    LOGGER.info(
        "Starting anonymization for model %s (dry_run=%s)", model.__name__, dry_run
    )

    queryset = model.objects.all()
    if custom_filter:
        queryset = queryset.filter(custom_filter)
    queryset = queryset.order_by("pk")
    count = queryset.count()

    for offset in range(0, count, batch_size):
        LOGGER.info("Anonymizing %s %d of %d", model.__name__, offset, count)
        updates = []
        for instance in queryset[offset : offset + batch_size]:
            for field, rule in field_rules.items():
                setattr(instance, field, anonymize_field(instance, field, rule))
            updates.append(instance)
        if not dry_run:
            model.objects.bulk_update(
                updates, field_rules.keys(), batch_size=batch_size
            )

    LOGGER.info(
        "Anonymized %d records for %s in %s", count, model.__name__, time.time() - start
    )


class Command(BaseCommand):
    help = "Anonymize production data"

    def add_arguments(self, parser):
        parser.add_argument(
            "--dry-run",
            action="store_true",
            help="Perform a dry run without updating the database.",
        )
        parser.add_argument(
            "--batch-size",
            type=int,
            default=DEFAULT_BATCH_SIZE,
            help="Number of records to process in each batch (default: %(default)s)",
        )
        parser.add_argument(
            "--model",
            type=str,
            help="Anonymize only the specified model (e.g., 'app.ModelName').",
        )

    def handle(self, *args, **options):
        dry_run = options["dry_run"]
        batch_size = options["batch_size"]

        models = apps.get_models()
        if options["model"]:
            app_label, model_name = options["model"].split(".")
            models = [apps.get_model(app_label, model_name)]

        for model in models:
            sensitive_field_rules = getattr(model._meta, "sensitive_fields", None)
            if not sensitive_field_rules:
                continue

            custom_filter = getattr(model._meta, "anonymization_queryset", None)
            anonymize_model(
                model,
                sensitive_field_rules,
                custom_filter,
                dry_run=dry_run,
                batch_size=batch_size,
            )

Usage examples

Here are some examples of how to run the anonymization command:

Anonymize the entire database:

python manage.py anonymize_database

Anonymize specific models:

python manage.py anonymize_database --models accounts.User customers.Customer

Adjust batch size for performance tuning

python manage.py anonymize_database --batch-size 500

Wrapping up: integrated anonymization transforms testing

Our model-based anonymization approach offers a developer-friendly solution that integrates seamlessly with Django's ORM. By defining sensitive fields directly in the model's Meta class, we ensure that anonymization rules remain connected to the data they protect.

Besides creating a natural checkpoint during development and code reviews, the compatibility with AI code review tools adds an extra layer of verification that helps catch potential privacy issues before they reach production.

While our approach has limitations in performance with large databases and requires familiarity with the Faker library, it has significantly improved our testing capabilities. Our QA team can now work with production-like data without exposing sensitive information, while privacy considerations become embedded directly into our development workflow.