CNK's Blog

Trimming Django Migration Cruft

Django creates migrations for Django model changes that do not alter the database, for example, changes to help text or verbose names. In most cases when I see a migration for a change I am pretty sure doesn’t run any SQL, I check my assumption using python manage.py sqlmigrate <app> <migration_name>, and if it does not produce any SQL, then I edit the most recent migration to have touched that column to match the “changes” Django wants to make. For the most part that isn’t difficult but it is sometimes annoying. Other people have a similar opinion and one of them shared the following code on a Slack channel I am on.

WARNING: I have included the code as it was from the shared file, but my application had some data migrations with RunPython commands that invoke related_name. So in our application, we deleted the code below that removed attributes in MIGRATION_IGNORE_RELATED_FIELD_ATTRS.

  # app/management/commands/__init__.py

  """
  Django creates redundant migrations for Django model changes that do not alter the database.
  Here we patch Django's migration machinery to ignore attrs.

  The management commands `makemigrations` and `migrate` will ignore the attrs defined in:

      - MIGRATION_IGNORE_MODEL_ATTRS
      - MIGRATION_IGNORE_FIELD_ATTRS
      - MIGRATION_IGNORE_FILE_FIELD_ATTRS
      - MIGRATION_IGNORE_RELATED_FIELD_ATTRS

  This will reduce the number of migrations and therefore speed-up development
  """

  import logging
  from functools import wraps

  from django.db.migrations.operations import AlterModelOptions
  from django.db.models import Field, FileField
  from django.db.models.fields.related import RelatedField

  logger = logging.getLogger(__name__)

  MIGRATION_IGNORE_MODEL_ATTRS = ["verbose_name", "verbose_name_plural"]
  MIGRATION_IGNORE_FIELD_ATTRS = ["validators", "choices", "help_text", "verbose_name"]
  MIGRATION_IGNORE_FILE_FIELD_ATTRS = ["upload_to", "storage"]

  MIGRATION_IGNORE_RELATED_FIELD_ATTRS = ["related_name", "related_query_name"]

  for attr in MIGRATION_IGNORE_MODEL_ATTRS:
      logger.info(f"Model {attr} attr will be ignored.")

  for attr in MIGRATION_IGNORE_FIELD_ATTRS:
      logger.info(f"Field {attr} attr will be ignored.")

  for attr in MIGRATION_IGNORE_FILE_FIELD_ATTRS:
      logger.info(f"File field {attr} attr will be ignored.")

  for attr in MIGRATION_IGNORE_RELATED_FIELD_ATTRS:
      logger.info(f"Related field {attr} attr will be ignored.")


  def patch_ignored_model_attrs(cls):
      for attr in MIGRATION_IGNORE_MODEL_ATTRS:
          cls.ALTER_OPTION_KEYS.remove(attr)


  def patch_field_deconstruct(old_func):
      @wraps(old_func)
      def deconstruct_with_ignored_attrs(self):
          name, path, args, kwargs = old_func(self)
          for attr in MIGRATION_IGNORE_FIELD_ATTRS:
              kwargs.pop(attr, None)
          return name, path, args, kwargs

      return deconstruct_with_ignored_attrs


  def patch_file_field_deconstruct(old_func):
      @wraps(old_func)
      def deconstruct_with_ignored_attrs(self):
          name, path, args, kwargs = old_func(self)
          for attr in MIGRATION_IGNORE_FILE_FIELD_ATTRS:
              kwargs.pop(attr, None)
          return name, path, args, kwargs

      return deconstruct_with_ignored_attrs


  def patch_related_field_deconstruct(old_func):
      @wraps(old_func)
      def deconstruct_with_ignored_attrs(self):
          name, path, args, kwargs = old_func(self)
          for attr in MIGRATION_IGNORE_RELATED_FIELD_ATTRS:
              kwargs.pop(attr, None)
          return name, path, args, kwargs

      return deconstruct_with_ignored_attrs


  Field.deconstruct = patch_field_deconstruct(Field.deconstruct)
  FileField.deconstruct = patch_file_field_deconstruct(FileField.deconstruct)
  RelatedField.deconstruct = patch_related_field_deconstruct(RelatedField.deconstruct)
  patch_ignored_model_attrs(AlterModelOptions)

And now, create override files for the two manage commands we need to load our patches: migrate and makemigrations.

  # app/management/commands/makemigrations.py

  """
  Override of Django makemigrations. When we use this version, we
  will load the __init__ file above that patches models.Field.
  """

  from django.core.management.commands.makemigrations import Command  # noqa
  # app/management/commands/migrate.py

  """
  Override of Django migrate. When we use this version, we
  will load the __init__ file above that patches models.Field.
  """

  from django.core.management.commands.migrate import Command  # noqa

Django manage.py Commands

sendtestemail

Someone on the PyDev Slack channel was having trouble getting emails from the Django admin system and another member mentioned there is a sendtestemail manage.py command that can help one debug your email settings. That made me wonder what else is available that I didn’t know about.

diffsettings

The Django settings object is a bit odd in that you can’t do print(settings.__dict__) to figure out what is available - and even if you know (or guess) the name of a setting, how do you know if the value is the default or something you have overridden in your app? There is a manage.py command for that! The most useful version is python manage.py diffsettings --output unified. That gives you all the settings - with the overridden versions in red.

showmigrations

I have used the default version to check to see if I have applied all the existing migrations in my test and prod environments - that’s the python manage.py showmigrations --list version. But there is also a python manage.py showmigrations --plan version. That will show you the order in which Django will apply migrations.

inspectdb

If you run python manage.py inspectdb against an existing database, it will generate the Django models that would have created those tables (and indexes and constraints). This command is for projects that must use a legacy database so all of the models are created with a Meta class with managed = False.

ping_google

If your site has had a bunch of changes that you want Google to recrawl, you can use this command to submit your sitemap: python manage.py ping_google. If you have the Django sitemaps app installed and a url configured for your sitemap, this command will figure out what the url should be.

New Django Query Tricks

Union queries

Union queries are surprisingly easy to create. I need a list of ids and content type ids from a bunch of different models. I was very surpised at how straightforward it is in Django 3.2 to create the UNION query I want.

  union_query = None
  for content_type in <queryset of content types>:
      model = apps.get_model(content_type.app_label, content_type.model)
      query = model.objects.filter(<criteria>).values('pk', <content_type.id>)
      if union_query is None:
         union_query = query
      else:
          union_query = union_query.union(query, all=True)

Note: I used all=True because I will never have duplicates in my (id, content_type_id) tuples and UNION ALL is faster than UNION in this case because we can skip the DISTINCT operation on the final result.

The observant among you will have noticed a bit of pseudocode in the example above. I want to insert the content_type_id from python into my query. In SQL this would be something like:

  SELECT id, 99 FROM myapp_model;

In the Django ORM, that turns out to be something I didn’t know how to do. I can’t leave it as a bare name and I can’t quote it or the ORM tries to turn it into a column name or relation that could be turned into a column name. Turns out I need to use Value:

  query = model.objects \
               .filter(<criteria>) \
               .values('pk', Value(content_type.id, output_field=IntegerField()))

OK so that now will give me a queryset that produces a list of dicts like: [{pk: 3, content_type_id: 44}, {pk: 3, content_type_id: 48}] But when I tried to use those results in the filter section of another query… I had my next problem.

Querying by value - without Foreign Key relationships

So now I need to use those ids and content_type_ids to filter another model that has rows with content_type_id and object_id columns. I want all the lines in the table for the ModelLogEntry model where the (object_id, content_type_id) tuple is in the list of (pk, content_type_id) tuples created by our UNION query above.

If I only needed to match on a single value, I would probably evaluate the UNION query, and then do something like .filter(pk__in=<list of pks>) - as I did to get the list of content types I need. But I need to match the id and content_type_id fields. In SQL, I would do:

  SELECT wagtailcore_modellogentry.*
    FROM wagtailcore_modellogentry
    INNER JOIN (
    (
    ((SELECT `link_farms_audience`.`id`, 104 AS `content_type_id`  FROM `link_farms_audience` WHERE `link_farms_audience`.`site_id` =  12)
     UNION
     (SELECT `link_farms_collection`.`id`, 105 AS `content_type_id` FROM `link_farms_collection` WHERE `link_farms_collection`.`site_id` = 12))
     UNION
     (SELECT `link_farms_link`.`id`, 106 AS `content_type_id` FROM `link_farms_link` WHERE `link_farms_link`.`site_id` = 12))
     UNION
     (SELECT `core_didyouknowitem`.`id`, 110 AS `content_type_id` FROM `core_didyouknowitem` WHERE `core_didyouknowitem`.`site_id` = 12 ORDER BY `core_didyouknowitem`.`text` ASC)
    ) AS models
    ON models.id = wagtailcore_modellogentry.object_id
    AND models.content_type_id = wagtailcore_modellogentry.content_type_id

This was relatively straightforward to write in SQL, so I tried using raw SQL, e.g. ModelLogQuery.objets.raw('<query here>'). That definitely gave me the data I was looking for when I ran it in shell_plus. But when I tried to use it in my monkey patch, the calling function wanted to use values(), which is a function only defined on real ORM QuerySets - and not available when using raw.

At this point I suspect I won’t want to use this in production. Goodness only knows how big the union query is likely to get. But it is bothering me that I don’t know how to get Django to let me do a relatively straightforward join without having defined a ForeignQuery relationship in my Python model code.

I still don’t know how to tell Django “do this join damn it!”, but after some reading and thinking of alternate ways to write the SQL, I think I have found a way to write this in the ORM using Exists to create a correlated subquery.

    from django.apps import apps
    from django.db.models import Exists, IntegerField, OuterRef, Value
    from django.contrib.contenttypes.models import ContentType

    request = get_current_request()
    site = Site.find_for_request(request)
    union_query = None
    content_types = (
        ContentType.objects
                   .filter(id__in=ModelLogEntry.objects.values_list('content_type_id', flat=True).distinct())
    )
    for content_type in content_types:
        model = apps.get_model(content_type.app_label, content_type.model)
        query = (model.objects.filter(site_id=site.id)
                 .values('pk', content_type_id=Value(content_type.id, output_field=IntegerField()))
                 )
        if union_query is None:
            union_query = query
        else:
            union_query = union_query.union(query, all=True)

    return ModelLogEntry.objects.filter(Exists(
        union_query.filter(pk=OuterRef('object_id'), content_type_id=OuterRef('content_type_id'))
    ))

Sigh. One can’t combine .filter with a union query.

    NotSupportedError at /admin/reports/site-history/
    Calling QuerySet.filter() after union() is not supported.

I tested the Exists query by setting the union_query to be just one type and it works fine. So I learned something useful about the Django ORM - even if I can’t apply that knowledge in the context in which I wanted to to use it this time.

Hosting static websites on AWS

I have had a VM for web hosting for ….. about as long as VMs have been a thing. Before that I shared a physical machine with some old work collegues from ArsDigita. But ever since I hosed the PHP on my current VM, I haven’t really been doing anything that needed dynamic hosting. So it’s about time I saved some time and effort and move things to static hosting.

We do some AWS stuff at work so one of the simpler ways to move away from having my own server would be to host static sites in S3. The AWS console is a hot mess AND constantly changing AND isn’t version controlled. But terraform will let me manage my AWS configuration using code that can be version controlled. And will give me some additional experience using terraform and AWS. So win, win, win.

I found this blog post that does a pretty good job of explaining how to use terraform to set up a static web site on aws: The only things I had questions about where:

  1. Why use CloudFront for a low traffic site?
  2. Why not move DNS registration to AWS?
  3. Why redirect everything to www instead of to the bare hostname?

Item 1: CloudFront. I think for the author, part of the answer might be to get good scores on page response speed. I don’t really do a lot of speed optimization (other than mostly having just text on my site) but CloudFront is important for another reason. From the AWS docs as of August 2021:

Amazon S3 website endpoints do not support HTTPS or access points. If you want to use HTTPS, you can use Amazon CloudFront to serve a static website hosted on Amazon S3.

Item 2: DNS registration. I had originally been thinking of moving my DNS registration to AWS so I had everything in one place. I have been using DotYou.com for ages and they are fine, but nothing special. So while I was moving things, why not consolidate everything? Because AWS doesn’t have a super simple way to do email forwarding. With DotYou.com, I can navigate to the configuration for any of my domains and add one or more email addresses and where the email to those addresses should go. It is a personal life goal never to run a mail server. So I’ll be keeping my domains registered elsewhere - at least until setting up email forwarding on AWS is as simple as with my current provider.

Note bene: Because I want to continue to use the email forwarding from DotYou.com, I had to keep using their default nameservers rather than setting the name servers to be the Route53 name servers that Terraform set up for me. AND because CloudFront gives me a name rather than an IP to point my domain at, I can no longer have an A record for *.example.com; it needs to be a cname. I am not 100% there aren’t some implications of this that I am missing - but so far having a cname seems to work just fine….. or mostly, which brings me to item 3, what should be my canonical url?

Item 3: Canonial url. Google prefers if there is one canonical copy of any page - but it treats http://example.com, https://example.com, http://www.example.com, and https://www.example.com as 4 different sites - even if they all serve the exact same content. The best way to consolidate your analytics is to set up redirects so that all of those similar urls actually end up on one canonical url. I usually choose https://example.com as my real url. But the blog post I was following chose to redirect everything to https://www.example.com. It wasn’t until I had everything set up and was updating the DNS records, that I realized why it might have been better if I had redirected everything to www.example.com as in the example. I can create a wildcard cname record and point everything to some CloudFront distribution, and I have a wildcard SSL certificate. But if I try accessing http://foo.example.com, I get “403 Forbidden” and a message about this being a mis-configuration. If I try https://foo.example.com, I get an error with the error code: SSL_ERROR_NO_CYPHER_OVERLAP

CloudFlare supports wildcard CNAMEs so I might be able to set things up to catch all names - but that if that record points to the canonical distribution, then no redirect happens - so no consolidation of requests. So I think what I need to do is reverse my setup and make https://www.example.com my real bucket and real url. I did find this post about wildcard CloudFront distributions not behaving as the author expected - but I am not sure I understand how this is going to affect my setup.


Nope

After much fooling around with DNS settings and S3 bucket settings, I finally concluded that having working email addresses AND a web site served by S3/Cloudflare on my bare domain name (e.g. cynthiakiser.com) are incompatible. So I guess I will be keeping my VM after all.

Upgrading to MySQL 8

Character set considerations

We have largely been using UTF8 for our databases for some time. However, if we want to support emojis or other more exotic characters, we need to start using utf8mb4. In fact, our Rails projects are set up to require that character set.

So while we are upgrading databases versions (from the EOL MySQL 5.6 to MySQL 8), we decided to upgrade the character sets of our databases at the same time. I initially tried to configure our database with collation_server as utf8mb4_0900_ai_ci to get all the latest unicode goodness, but AWS said I couldn’t set collation_server to that value. The closest option was utf8mb4_unicode_ci so I went with that. Once I got into the database, I found the following settings:

  MySQL [mysql]> show variables like '%collation%';
  +-------------------------------+--------------------+
  | Variable_name                 | Value              |
  +-------------------------------+--------------------+
  | collation_connection          | utf8_general_ci    |
  | collation_database            | utf8mb4_0900_ai_ci |
  | collation_server              | utf8mb4_unicode_ci |
  | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
  +-------------------------------+--------------------+
  4 rows in set (0.00 sec)

  MySQL [mysql]> show variables like '%character_set%';
  +--------------------------+-------------------------------------------+
  | Variable_name            | Value                                     |
  +--------------------------+-------------------------------------------+
  | character_set_client     | utf8                                      |
  | character_set_connection | utf8                                      |
  | character_set_database   | utf8mb4                                   |
  | character_set_filesystem | binary                                    |
  | character_set_results    | utf8                                      |
  | character_set_server     | utf8mb4                                   |
  | character_set_system     | utf8                                      |
  | character_sets_dir       | /rdsdbbin/mysql-8.0.23.R3/share/charsets/ |
  +--------------------------+-------------------------------------------+
  8 rows in set (0.00 sec)

So from that we see databases should automatically be created with character set utf8b4 and collation utf8mb4_0900_ai_ci because that is the default collation for the character set utf8mb4. However, I suspect I may want to set the character set (and collation) for the client and connection to utf8b4.

Changing the values in my terraform aws_db_parameter_group did not change the values in the running database.

Authentication plugin

I was expecting the password hashing mechanism for the new database to be caching_sha2_password since I had read that was the new method for MySQL8. But when I created my first user on the new database, I found they had their password stored as mysql_native_password.

  MySQL [mysql]> show variables like '%auth%';
  +-------------------------------+-----------------------+
  | Variable_name                 | Value                 |
  +-------------------------------+-----------------------+
  | default_authentication_plugin | mysql_native_password |
  +-------------------------------+-----------------------+
  1 row in set (0.00 sec)

We do still have some clients that will need to connect with mysql_native_password but I want those to be the exceptions, not the rule.

Trying to update that in place gave:

  Error: Error modifying DB Parameter Group: InvalidParameterValue: The parameter default_authentication_plugin cannot be modified.
    on main.tf line 726, in resource "aws_db_parameter_group" "mysql-sensible-defaults-8-0":
   726: resource "aws_db_parameter_group" "mysql-sensible-defaults-8-0" {

Timezone information

The Django docs about using MySQL say you need to load the timezone tables when creating a new MySQL instance. Both the docker containers we use in dev and the AWS RDS MySQL instances take care of loading this data automatically.

Notes from MySQL 8 Docs

https://dev.mysql.com/doc/refman/8.0/en/charset-server.html The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose.

The values you see for show variables like '%collation%'; will vary depending on the session details. The database you are in (e.g. one of the user schemas vs in mysql or information_schema) will affect what you see for collation_database and character_set_database. And how you connected will affect the values you see for collation_connection and character_set_connection, character_set_results, and character_set_client. It looks to me like the controlling variable there is character_set_client.