CNK's Blog

Trimming Wagtail Migration Cruft

Django creates migrations for Django model changes that do not alter the database, for example, changes to help text or verbose names. In my previous post, I shared code for telling Django not to track non-database attributes in its migrations. This post is about something similar for Wagtail’s migrations.

At work, we are using Wagtail as our Content Management System (CMS). The Wagtail core team decided to follow Django’s example and record all model changes in migrations - including ones that do not change the database schema. Unfortunately for us, this means that when we add new blocks to our pages, “makemigrations” thinks it should make a new version of our StreamField - even though no SQL will be run when the migration is installed. We have a lot of blocks and they change fairly frequently, so these StreamField migrations take up a lot of space. And because they are large, they are nearly impossible to diff, so even if we kept them, it would be hard to use them to track down changes to our StreamField definitions.

For the most part, we just ignore it when “manage.py migrate” tells us we have changes in our code that are not reflected in our migrations. But when we do need to create a migration for database schema changes, we either need to accept a new large chunk of code that doesn’t do anything - or we have to manually remove those lines before committing the migration to version control.

I have read the discussion about the issue on the Wagtail issue queue. And, while I tend to agree with the policy decision, I still want to see what life is like without including StreamField definitions in our migrations. So I added the following monkey patch to the app we already have for all of our monkey patches.

  # wagtail_patches/monkey_patches.py

  import wagtail.core.fields

  def deconstruct_without_block_definition(self):
      name, path, _, kwargs = super(wagtail.core.fields.StreamField, self).deconstruct()
      block_types = list()
      args = [block_types]
      return name, path, args, kwargs
  wagtail.core.fields.StreamField.deconstruct = deconstruct_without_block_definition

This is simply a copy of the StreamField deconstruct method but I replaced “block_types = self.stream_block.child_blocks.items()” with an empty list. Now any field defined as:

  body = wagtail.core.fields.StreamField([ <large list of blocks here> ])

will be represented in the migration file as the following - with no list of blocks:

  ('body', wagtail.core.fields.StreamField([]))

Then I went through all the apps in our project and squashed migrations. This automatically ‘removed’ the StreamField definitions in the migrations included in the squash. Then I manually edited any migrations prior to the current squashing to remove the StreamField definitions from them. Deploying the squashed migrations went smoothly. Now we just need to do some development and see if there is any reason to want to change our minds and start tracking StreamField definitions in our migration files once more.

Trimming Django Migration Cruft

Django creates migrations for Django model changes that do not alter the database, for example, changes to help text or verbose names. In most cases when I see a migration for a change I am pretty sure doesn’t run any SQL, I check my assumption using python manage.py sqlmigrate <app> <migration_name>, and if it does not produce any SQL, then I edit the most recent migration to have touched that column to match the “changes” Django wants to make. For the most part that isn’t difficult but it is sometimes annoying. Other people have a similar opinion and one of them shared the following code on a Slack channel I am on.

WARNING: I have included the code as it was from the shared file, but my application had some data migrations with RunPython commands that invoke related_name. So in our application, we deleted the code below that removed attributes in MIGRATION_IGNORE_RELATED_FIELD_ATTRS.

  # app/management/commands/__init__.py

  """
  Django creates redundant migrations for Django model changes that do not alter the database.
  Here we patch Django's migration machinery to ignore attrs.

  The management commands `makemigrations` and `migrate` will ignore the attrs defined in:

      - MIGRATION_IGNORE_MODEL_ATTRS
      - MIGRATION_IGNORE_FIELD_ATTRS
      - MIGRATION_IGNORE_FILE_FIELD_ATTRS
      - MIGRATION_IGNORE_RELATED_FIELD_ATTRS

  This will reduce the number of migrations and therefore speed-up development
  """

  import logging
  from functools import wraps

  from django.db.migrations.operations import AlterModelOptions
  from django.db.models import Field, FileField
  from django.db.models.fields.related import RelatedField

  logger = logging.getLogger(__name__)

  MIGRATION_IGNORE_MODEL_ATTRS = ["verbose_name", "verbose_name_plural"]
  MIGRATION_IGNORE_FIELD_ATTRS = ["validators", "choices", "help_text", "verbose_name"]
  MIGRATION_IGNORE_FILE_FIELD_ATTRS = ["upload_to", "storage"]

  MIGRATION_IGNORE_RELATED_FIELD_ATTRS = ["related_name", "related_query_name"]

  for attr in MIGRATION_IGNORE_MODEL_ATTRS:
      logger.info(f"Model {attr} attr will be ignored.")

  for attr in MIGRATION_IGNORE_FIELD_ATTRS:
      logger.info(f"Field {attr} attr will be ignored.")

  for attr in MIGRATION_IGNORE_FILE_FIELD_ATTRS:
      logger.info(f"File field {attr} attr will be ignored.")

  for attr in MIGRATION_IGNORE_RELATED_FIELD_ATTRS:
      logger.info(f"Related field {attr} attr will be ignored.")


  def patch_ignored_model_attrs(cls):
      for attr in MIGRATION_IGNORE_MODEL_ATTRS:
          cls.ALTER_OPTION_KEYS.remove(attr)


  def patch_field_deconstruct(old_func):
      @wraps(old_func)
      def deconstruct_with_ignored_attrs(self):
          name, path, args, kwargs = old_func(self)
          for attr in MIGRATION_IGNORE_FIELD_ATTRS:
              kwargs.pop(attr, None)
          return name, path, args, kwargs

      return deconstruct_with_ignored_attrs


  def patch_file_field_deconstruct(old_func):
      @wraps(old_func)
      def deconstruct_with_ignored_attrs(self):
          name, path, args, kwargs = old_func(self)
          for attr in MIGRATION_IGNORE_FILE_FIELD_ATTRS:
              kwargs.pop(attr, None)
          return name, path, args, kwargs

      return deconstruct_with_ignored_attrs


  def patch_related_field_deconstruct(old_func):
      @wraps(old_func)
      def deconstruct_with_ignored_attrs(self):
          name, path, args, kwargs = old_func(self)
          for attr in MIGRATION_IGNORE_RELATED_FIELD_ATTRS:
              kwargs.pop(attr, None)
          return name, path, args, kwargs

      return deconstruct_with_ignored_attrs


  Field.deconstruct = patch_field_deconstruct(Field.deconstruct)
  FileField.deconstruct = patch_file_field_deconstruct(FileField.deconstruct)
  RelatedField.deconstruct = patch_related_field_deconstruct(RelatedField.deconstruct)
  patch_ignored_model_attrs(AlterModelOptions)

And now, create override files for the two manage commands we need to load our patches: migrate and makemigrations.

  # app/management/commands/makemigrations.py

  """
  Override of Django makemigrations. When we use this version, we
  will load the __init__ file above that patches models.Field.
  """

  from django.core.management.commands.makemigrations import Command  # noqa
  # app/management/commands/migrate.py

  """
  Override of Django migrate. When we use this version, we
  will load the __init__ file above that patches models.Field.
  """

  from django.core.management.commands.migrate import Command  # noqa

Django manage.py Commands

sendtestemail

Someone on the PyDev Slack channel was having trouble getting emails from the Django admin system and another member mentioned there is a sendtestemail manage.py command that can help one debug your email settings. That made me wonder what else is available that I didn’t know about.

diffsettings

The Django settings object is a bit odd in that you can’t do print(settings.__dict__) to figure out what is available - and even if you know (or guess) the name of a setting, how do you know if the value is the default or something you have overridden in your app? There is a manage.py command for that! The most useful version is python manage.py diffsettings --output unified. That gives you all the settings - with the overridden versions in red.

showmigrations

I have used the default version to check to see if I have applied all the existing migrations in my test and prod environments - that’s the python manage.py showmigrations --list version. But there is also a python manage.py showmigrations --plan version. That will show you the order in which Django will apply migrations.

inspectdb

If you run python manage.py inspectdb against an existing database, it will generate the Django models that would have created those tables (and indexes and constraints). This command is for projects that must use a legacy database so all of the models are created with a Meta class with managed = False.

ping_google

If your site has had a bunch of changes that you want Google to recrawl, you can use this command to submit your sitemap: python manage.py ping_google. If you have the Django sitemaps app installed and a url configured for your sitemap, this command will figure out what the url should be.

New Django Query Tricks

Union queries

Union queries are surprisingly easy to create. I need a list of ids and content type ids from a bunch of different models. I was very surpised at how straightforward it is in Django 3.2 to create the UNION query I want.

  union_query = None
  for content_type in <queryset of content types>:
      model = apps.get_model(content_type.app_label, content_type.model)
      query = model.objects.filter(<criteria>).values('pk', <content_type.id>)
      if union_query is None:
         union_query = query
      else:
          union_query = union_query.union(query, all=True)

Note: I used all=True because I will never have duplicates in my (id, content_type_id) tuples and UNION ALL is faster than UNION in this case because we can skip the DISTINCT operation on the final result.

The observant among you will have noticed a bit of pseudocode in the example above. I want to insert the content_type_id from python into my query. In SQL this would be something like:

  SELECT id, 99 FROM myapp_model;

In the Django ORM, that turns out to be something I didn’t know how to do. I can’t leave it as a bare name and I can’t quote it or the ORM tries to turn it into a column name or relation that could be turned into a column name. Turns out I need to use Value:

  query = model.objects \
               .filter(<criteria>) \
               .values('pk', Value(content_type.id, output_field=IntegerField()))

OK so that now will give me a queryset that produces a list of dicts like: [{pk: 3, content_type_id: 44}, {pk: 3, content_type_id: 48}] But when I tried to use those results in the filter section of another query… I had my next problem.

Querying by value - without Foreign Key relationships

So now I need to use those ids and content_type_ids to filter another model that has rows with content_type_id and object_id columns. I want all the lines in the table for the ModelLogEntry model where the (object_id, content_type_id) tuple is in the list of (pk, content_type_id) tuples created by our UNION query above.

If I only needed to match on a single value, I would probably evaluate the UNION query, and then do something like .filter(pk__in=<list of pks>) - as I did to get the list of content types I need. But I need to match the id and content_type_id fields. In SQL, I would do:

  SELECT wagtailcore_modellogentry.*
    FROM wagtailcore_modellogentry
    INNER JOIN (
    (
    ((SELECT `link_farms_audience`.`id`, 104 AS `content_type_id`  FROM `link_farms_audience` WHERE `link_farms_audience`.`site_id` =  12)
     UNION
     (SELECT `link_farms_collection`.`id`, 105 AS `content_type_id` FROM `link_farms_collection` WHERE `link_farms_collection`.`site_id` = 12))
     UNION
     (SELECT `link_farms_link`.`id`, 106 AS `content_type_id` FROM `link_farms_link` WHERE `link_farms_link`.`site_id` = 12))
     UNION
     (SELECT `core_didyouknowitem`.`id`, 110 AS `content_type_id` FROM `core_didyouknowitem` WHERE `core_didyouknowitem`.`site_id` = 12 ORDER BY `core_didyouknowitem`.`text` ASC)
    ) AS models
    ON models.id = wagtailcore_modellogentry.object_id
    AND models.content_type_id = wagtailcore_modellogentry.content_type_id

This was relatively straightforward to write in SQL, so I tried using raw SQL, e.g. ModelLogQuery.objets.raw('<query here>'). That definitely gave me the data I was looking for when I ran it in shell_plus. But when I tried to use it in my monkey patch, the calling function wanted to use values(), which is a function only defined on real ORM QuerySets - and not available when using raw.

At this point I suspect I won’t want to use this in production. Goodness only knows how big the union query is likely to get. But it is bothering me that I don’t know how to get Django to let me do a relatively straightforward join without having defined a ForeignQuery relationship in my Python model code.

I still don’t know how to tell Django “do this join damn it!”, but after some reading and thinking of alternate ways to write the SQL, I think I have found a way to write this in the ORM using Exists to create a correlated subquery.

    from django.apps import apps
    from django.db.models import Exists, IntegerField, OuterRef, Value
    from django.contrib.contenttypes.models import ContentType

    request = get_current_request()
    site = Site.find_for_request(request)
    union_query = None
    content_types = (
        ContentType.objects
                   .filter(id__in=ModelLogEntry.objects.values_list('content_type_id', flat=True).distinct())
    )
    for content_type in content_types:
        model = apps.get_model(content_type.app_label, content_type.model)
        query = (model.objects.filter(site_id=site.id)
                 .values('pk', content_type_id=Value(content_type.id, output_field=IntegerField()))
                 )
        if union_query is None:
            union_query = query
        else:
            union_query = union_query.union(query, all=True)

    return ModelLogEntry.objects.filter(Exists(
        union_query.filter(pk=OuterRef('object_id'), content_type_id=OuterRef('content_type_id'))
    ))

Sigh. One can’t combine .filter with a union query.

    NotSupportedError at /admin/reports/site-history/
    Calling QuerySet.filter() after union() is not supported.

I tested the Exists query by setting the union_query to be just one type and it works fine. So I learned something useful about the Django ORM - even if I can’t apply that knowledge in the context in which I wanted to to use it this time.

Hosting static websites on AWS

I have had a VM for web hosting for ….. about as long as VMs have been a thing. Before that I shared a physical machine with some old work collegues from ArsDigita. But ever since I hosed the PHP on my current VM, I haven’t really been doing anything that needed dynamic hosting. So it’s about time I saved some time and effort and move things to static hosting.

We do some AWS stuff at work so one of the simpler ways to move away from having my own server would be to host static sites in S3. The AWS console is a hot mess AND constantly changing AND isn’t version controlled. But terraform will let me manage my AWS configuration using code that can be version controlled. And will give me some additional experience using terraform and AWS. So win, win, win.

I found this blog post that does a pretty good job of explaining how to use terraform to set up a static web site on aws: The only things I had questions about where:

  1. Why use CloudFront for a low traffic site?
  2. Why not move DNS registration to AWS?
  3. Why redirect everything to www instead of to the bare hostname?

Item 1: CloudFront. I think for the author, part of the answer might be to get good scores on page response speed. I don’t really do a lot of speed optimization (other than mostly having just text on my site) but CloudFront is important for another reason. From the AWS docs as of August 2021:

Amazon S3 website endpoints do not support HTTPS or access points. If you want to use HTTPS, you can use Amazon CloudFront to serve a static website hosted on Amazon S3.

Item 2: DNS registration. I had originally been thinking of moving my DNS registration to AWS so I had everything in one place. I have been using DotYou.com for ages and they are fine, but nothing special. So while I was moving things, why not consolidate everything? Because AWS doesn’t have a super simple way to do email forwarding. With DotYou.com, I can navigate to the configuration for any of my domains and add one or more email addresses and where the email to those addresses should go. It is a personal life goal never to run a mail server. So I’ll be keeping my domains registered elsewhere - at least until setting up email forwarding on AWS is as simple as with my current provider.

Note bene: Because I want to continue to use the email forwarding from DotYou.com, I had to keep using their default nameservers rather than setting the name servers to be the Route53 name servers that Terraform set up for me. AND because CloudFront gives me a name rather than an IP to point my domain at, I can no longer have an A record for *.example.com; it needs to be a cname. I am not 100% there aren’t some implications of this that I am missing - but so far having a cname seems to work just fine….. or mostly, which brings me to item 3, what should be my canonical url?

Item 3: Canonial url. Google prefers if there is one canonical copy of any page - but it treats http://example.com, https://example.com, http://www.example.com, and https://www.example.com as 4 different sites - even if they all serve the exact same content. The best way to consolidate your analytics is to set up redirects so that all of those similar urls actually end up on one canonical url. I usually choose https://example.com as my real url. But the blog post I was following chose to redirect everything to https://www.example.com. It wasn’t until I had everything set up and was updating the DNS records, that I realized why it might have been better if I had redirected everything to www.example.com as in the example. I can create a wildcard cname record and point everything to some CloudFront distribution, and I have a wildcard SSL certificate. But if I try accessing http://foo.example.com, I get “403 Forbidden” and a message about this being a mis-configuration. If I try https://foo.example.com, I get an error with the error code: SSL_ERROR_NO_CYPHER_OVERLAP

CloudFlare supports wildcard CNAMEs so I might be able to set things up to catch all names - but that if that record points to the canonical distribution, then no redirect happens - so no consolidation of requests. So I think what I need to do is reverse my setup and make https://www.example.com my real bucket and real url. I did find this post about wildcard CloudFront distributions not behaving as the author expected - but I am not sure I understand how this is going to affect my setup.


Nope

After much fooling around with DNS settings and S3 bucket settings, I finally concluded that having working email addresses AND a web site served by S3/Cloudflare on my bare domain name (e.g. cynthiakiser.com) are incompatible. So I guess I will be keeping my VM after all.