Permission Patches for Multitenancy

November 05, 2023

To completely separate sites within the Wagtail admin, we need to make changes to page and collection permissions and do some patching of the user management, workflow, and history systems. My previous post covered the mechanics of how we introduce monkey patches into our project. In this post I am going to explain how we have customized Wagtail 5.1’s new PagePermissionPolicy to preserve our version of multitenancy.

SuperAdmins

It is impractical for us to add our developers to the actual Admin groups of the hundreds of sites on the system, so we invented concept we call “superadmins”. Superadmins are users who the system pretends are in the “Admin” group for whichever site they’re currently logged in to. In this way, our system presents each site to a superadmin as if it’s the only site on the server and lets us see exactly what an actual admin of the site sees. is_superadmin is a boolean field on our user model:

    class User(AbstractUser):
        """
        Replaces the auth.User model with our customized version.
        """
        is_superadmin = models.BooleanField(
            default=False,
            verbose_name='Super Admin',
            help_text='Enable this flag to make this user a Super Admin, which causes the system to treat them like they '
                    'are an Admin on whatever site they are logged into.'
        )

Page Permission Patches

Prior to Wagtail 5.1 we were patching wagtail.admin.auth.user_has_any_page_permission, wagtail.admin.navigation.get_pages_with_direct_explore_permission, and wagtail.core.models.UserPagePermissionsProxy.__init__.

In Wagtail 5.1, UserPagePermissionsProxy and get_pages_with_direct_explore_permission are both deprecated and permission checking has been consolidated into a new PagePermissionPolicy class. I was initially planning to try subclassing PagePermissionPolicy so I could explicitly initialize it with the current site. Because PagePermissionPolicy is instantiated 27 places in 17 different files, switching out the policy class for a subclass is impractical. So I have gone back to our monkey patching strategy.

Method diagram for Wagtail's PagePermissionPolicy

When I diagram the method calls within PagePermissionPolicy, I see that they nearly all go through get_all_page_permissions_for_user - the main method used to query the GroupPagePermissions table. The results of this query are cached and used by other parts of the Wagtail admin interface as needed.

To enforce our site separation requirement, I added a filter for pages on the current site:

    return GroupPagePermission.objects.filter(
        group__user=user,
        page__path__startswith=site.root_page.path
    ).select_related(
        "page", "permission"
    )

To allow superadmins to behave as site admins, I explicitly filtered for the site admin group:

    # Give them the permissions of the site admin group
    group = Group.objects.filter(name=f'{site.hostname} Admins').first()
    return GroupPagePermission.objects.filter(group=group).select_related(
        "page", "permission"
    )

Combining those two, our full version of get_all_page_permissions_for_user is:

    def mutitenant_get_all_page_permissions_for_user(self, user):
        if not user.is_active or user.is_anonymous or user.is_superuser:
            return GroupPagePermission.objects.none()

        # BEGIN PATCH
        request = get_current_request()
        if not request:
            logger.error(
                'In PagePermissionPolicy.mutitenant_get_all_page_permissions_for_user but could not get the request.'
            )
            return GroupPagePermission.objects.none()

        # So now restrict checks to permissions for the current site
        site = Site.find_for_request(request)
        if user.is_superadmin:
            # Give them the permissions of the site admin group
            group = Group.objects.filter(name=f'{site.hostname} Admins').first()
            return GroupPagePermission.objects.filter(group=group).select_related(
                "page", "permission"
            )
        else:
            # filter for current user and for permissions relevant only to this site
            return GroupPagePermission.objects.filter(
                group__user=user,
                page__path__startswith=site.root_page.path
            ).select_related(
                "page", "permission"
            )
    # Getting this function used is covered below

The behavior changes are both relatively straightforward; the tricky bit is getting the site. In the code above that is taken care of by Site.find_for_request plus our get_current_request method. This could be a problem if get_all_permissions_for_user were called from code that does not have access to the request. Fortunately almost all the places that instantiate PagePermissionPolicy are views or, if the instantiating code is not itself a view, the methods that need the permission policy are only executed from a view. For example, the is_shown method for MenuItem subclasses is only executed when a user is viewing the admin UI.

Looking at the diagram above, you can see in the next to bottom row, in addition to get_all_permissions_for_user, there are two other methods that query GroupPagePermission. Neither of them appear to be in use in the current Wagtail codebase. But for the sake of completeness, I have monkey patched them too:

    def mutitenant_users_with_any_permission(self, actions, include_superusers=True):
        """
        2023-07-22 cnk: I patched this because it had a query in it but as of Wagtail 5.1.1 this
        method is not in use, nor is users_with_permission which delegates to this method
        """
        # User with only "add" permission can still edit their own pages
        actions = set(actions)
        if "change" in actions:
            actions.add("add")

        # BEGIN PATCH
        request = get_current_request()
        if not request:
            logger.error('In PagePermissionPolicy.mutitenant_users_with_any_permission but could not get the request.')
            return get_user_model.objects.none()

        # So now restrict checks to permissions for the current site
        site = Site.find_for_request(request)
        groups = GroupPagePermission.objects.filter(
            permission__codename__in=self._get_permission_codenames(actions),
            group__name__startswith=site.hostname
        ).values_list("group", flat=True)

        q = Q(groups__in=groups)
        # Superadmins will have all page permissions because Admins do
        q |= Q(is_superadmin=True)
        # END PATCH
        if include_superusers:
            q |= Q(is_superuser=True)

        return (
            get_user_model()
            ._default_manager.filter(is_active=True)
            .filter(q)
            .distinct()
        )


    def multitenant_users_with_any_permission_for_instance(
        self, actions, instance, include_superusers=True
    ):
        """
        2023-07-22 cnk: I patched this because it had a query in it but as of Wagtail 5.1.1 the only
        place this is used is send_moderation_notification. Since this is for an instance, it naturally
        filters for just one site - but we need to add in superadmins.
        """
        # Find permissions for all ancestors that match any of the actions
        ancestors = instance.get_ancestors(inclusive=True)
        groups = GroupPagePermission.objects.filter(
            permission__codename__in=self._get_permission_codenames(actions),
            page__in=ancestors,
        ).values_list("group", flat=True)

        q = Q(groups__in=groups)

        # BEGIN PATCH
        # Superadmins will have all page permissions because Admins do
        q |= Q(is_superadmin=True)
        # END PATCH
        if include_superusers:
            q |= Q(is_superuser=True)

        # If "change" is in actions but "add" is not, then we need to check for
        # cases where the user has "add" permission on an ancestor, and is the
        # owner of the instance
        if "change" in actions and "add" not in actions:
            add_groups = GroupPagePermission.objects.filter(
                permission__codename=get_permission_codename("add", self.model._meta),
                page__in=ancestors,
            ).values_list("group", flat=True)

            q |= Q(groups__in=add_groups) & Q(pk=instance.owner_id)

        return (
            get_user_model()
            ._default_manager.filter(is_active=True)
            .filter(q)
            .distinct()
        )

And finally, to get our versions of these files used, we import PagePermissionPolicy and replace the functions:

    from wagtail.permission_policies.pages import PagePermissionPolicy
    PagePermissionPolicy.get_all_permissions_for_user = mutitenant_get_all_page_permissions_for_user
    PagePermissionPolicy.users_with_any_permission = mutitenant_users_with_any_permission
    PagePermissionPolicy.users_with_any_permission_for_instance = multitenant_users_with_any_permission_for_instance

Collection Permission Patches

In addition to managing their own pages, site owners need to be able to manage their own images and documents. Permissions for images and documents are controlled by permissions set on the collection that contains them. When we create a new site, we create a collection for it and allow the site’s Admin group the ability to create collections underneath that parent collection. Permissions for managing the collections are managed by the CollectionManagementPermissionPolicy and permissions that control access to images and documents are controlled by the CollectionOwnershipPermissionPolicy. Both of those use the CollectionPermissionLookupMixin to query GroupCollectionPermission. In the diagrams below, methods coming from CollectionPermissionLookupMixin are denoted with a “*”. Prior to Wagtail 5.1 we were patching CollectionPermissionLookupMixin.check_perm and CollectionPermissionLookupMixin.collections_with_perm but as of Wagtail 5.1 most of the collection permission logic goes through CollectionPermissionLookupMixin.get_all_permissions_for_user.

Document and Image Permissions

The more important set of permissions is in the CollectionOwnershipPermissionPolicy class. This class decides what permissions a user has over the images and documents stored in the site’s collections. As you can see in the diagram below, all of the policy’s queries flow through get_all_permissions_for_user, so we can enforce our rules by patching that one method.

Method diagram for Wagtail's CollectionOwnershipPermissionPolicy

As with page permissions, the first time a Collection model is accessed triggers a query to the GroupCollectionPermission model (via get_all_permissions_for_user) and caches the user’s collection permissions on the user object. So we make similar patches to the ones we made above for pages. We add one line to filter the collection tree to restrict it to permissions for this site and a different change to assign superadmins to the site’s Admin group. Our naming contention ensures the we can find that site’s base collection by knowing the site for this request.

    def mutitenant_get_all_collection_permissions_for_user(self, user):
        """
        This method does a lot of the filtering for collections the user has access to. If we can get a
        request here, we can enforce a lot of our special cases right here.
            1. Users should only see collections for the current site - even if they have permissions on
               other sites. So we need to filter permissions for the site's root collection.
            2. If the user is a superadmin, we need to fake assigning them to the site's Admin group.
        """
        # For these users, we can determine the permissions without querying
        # GroupCollectionPermission by checking it directly in _check_perm()
        if not user.is_active or user.is_anonymous or user.is_superuser:
            return GroupCollectionPermission.objects.none()

        # BEGIN PATCH
        request = get_current_request()
        if not request:
            logger.error('In CollectionPermissionLookupMixin.mutitenant_get_all_permissions_for_user but could not get the request.')
            return GroupCollectionPermission.objects.none()

        # So now restrict checks to the collections for the current site
        site = Site.find_for_request(request)
        collection = Collection.objects.filter(name=site.hostname).first()
        if user.is_superadmin:
            group = Group.objects.filter(name=f'{site.hostname} Admins').first()
            return GroupCollectionPermission.objects.filter(
                group=group,
                collection=collection
            ).select_related("permission", "collection")
        else:
            return GroupCollectionPermission.objects.filter(
                group__user=user,
                collection=collection
            ).select_related("permission", "collection")
        # END PATCH


    from wagtail.permission_policies.collections import CollectionPermissionLookupMixin
    CollectionPermissionLookupMixin.get_all_permissions_for_user = mutitenant_get_all_collection_permissions_for_user

Collection Management

Collection management permissions allow admins to create their own nested set of collections. As you can see in the diagram below, the CollectionManagementPermissionPolicy’s permissions also all flow through get_all_permissions_for_user so the patch above that we used for managing items stored in collections takes care of most of the policy changes needed for managing the collections themselves.

Collection Management Permissions

Method diagram for Wagtail's CollectionManagementPermissionPolicy

The one additional thing we need to patch is a helper method used to decide which collections a user may delete: _descendants_with_perm. (If we omit this patch, admin’s can’t delete any collections).

    def multitenant__descendants_with_perm(self, user, action):
        """
        Return a queryset of collections descended from a collection on which this user has
        a GroupCollectionPermission record for this action. Used for actions, like edit and
        delete where the user cannot modify the collection where they are granted permission.
        """
        # Get the permission object corresponding to this action
        permission = self._get_permission_objects_for_actions([action]).first()

        # BEGIN PATCH
        # Replace the check for permission on the User's full list of Groups to a check for
        # permissions on only the current Site's Groups. Also take SuperAdmins into account.
        request = get_current_request()
        if not request:
            logger.error('In CollectionManagementPermissionPolicy.multitenant__descendants_with_perm but could not get the request.')
            return Collection.objects.none()

        site = Site.find_for_request(request)
        collection = Collection.objects.filter(name=site.hostname).first()

        # Fill in SuperAdmin groups
        if user.is_superadmin:
            groups = Group.objects.filter(name=f'{site.hostname} Admins').all()
        else:
            # user.groups.all() is what is in the original; we could restrict by site but the collection
            # filter will remove permissions not relevant to this site
            groups = user.groups.all()

        # Get the collections that have a GroupCollectionPermission record
        # for this permission and any of the user's groups; create a list of their paths
        # PATCH: restrict to collections belonging to this site
        collection_roots = Collection.objects.descendant_of(collection, inclusive=True).filter(
            group_permissions__group__in=groups,
            group_permissions__permission=permission,
        ).values("path", "depth")
        # END PATCH

        if collection_roots:
            # build a filter expression that will filter our model to just those
            # instances in collections with a path that starts with one of the above
            # but excluding the collection on which permission was granted
            collection_path_filter = Q(
                path__startswith=collection_roots[0]["path"]
            ) & Q(depth__gt=collection_roots[0]["depth"])
            for collection in collection_roots[1:]:
                collection_path_filter = collection_path_filter | (
                    Q(path__startswith=collection["path"])
                    & Q(depth__gt=collection["depth"])
                )
            return Collection.objects.all().filter(collection_path_filter)
        else:
            # no matching collections
            return Collection.objects.none()


    from wagtail.permission_policies.collections import CollectionManagementPermissionPolicy
    CollectionManagementPermissionPolicy._descendants_with_perm = multitenant__descendants_with_perm

Permissions for other models

We also need per-site permissions to manage other kinds of models - Snippets in Wagtail’s terminology. Please see the last section of Snippets for the code we use in our authentication backend.

Monkey Patching Wagtail

November 04, 2023

At work we run a large multitenant version of Wagtail (~500 separate websites on a single installation). To achieve this and to make some other changes to the way Wagtail behaves, we have a number of monkey patches. So we have consolidated all of them in their own Django app which we called wagtail_patches. This is loaded into our INSTALLED_APPS after most of our own apps but before any of the Wagtail apps:

    # settings.py
    INSTALLED_APPS = [
        # Multitenant apps. These are ordered with regard to template overrides.
        'core',
        'search',
        'site_creator',
        'calendar',
        'theme_v6_5',
        'theme_v7_0',
        'robots_txt',
        'wagtail_patches',  #####
        'sitemap',
        'features',
        'custom_auth',

        # Wagtail apps.
        'wagtail.embeds',
        'wagtail.sites',
        'wagtail.users',
        'wagtail.snippets',
        'wagtail.documents',
        # We use a custom replacement for wagtail.images that makes it add decoding="async" and loading="lazy" attrs.
        # 'wagtail.images',
        'wagtail_patches.apps.MultitenantImagesAppConfig',
        'wagtail.search',
        'wagtail.admin',
        'wagtail',
        'wagtail.contrib.modeladmin',
        'wagtail.contrib.settings',
        'wagtail.contrib.routable_page',

        # Wagtail dependencies, django, etc.....
    ]

And then in that app, we use the apps.py file to load everything from the patches directory:

    from django.apps import AppConfig
    from wagtail.images.apps import WagtailImagesAppConfig


    class WagtailPatchesConfig(AppConfig):
        name = 'wagtail_patches'
        verbose_name = 'Wagtail Patches'
        ready_is_done = False
        # If there are multiple AppConfigs in a single apps.py, one of them needs to be default=True.
        default = True

        def ready(self):
            """
            This function runs as soon as the app is loaded. It executes our monkey patches to various parts of Wagtail
            that change it to support our architecture of fully separated tenants.
            """
            # As suggested by the Django docs, we need to make absolutely certain that this code runs only once.
            if not self.ready_is_done:
                # The act of performing this import executes all the code in patches/__init__.py.
                from . import patches  # noqa
                self.ready_is_done = True
            else:
                print("{}.ready() executed more than once! This method's code is skipped on subsequent runs.".format(
                    self.__class__.__name__
                ))


    class MultitenantImagesAppConfig(WagtailImagesAppConfig):
        default_attrs = {"decoding": "async", "loading": "lazy"}

You will note that the first of our customizations is right in apps.py. We use this file to configure default html attributes for image tags generated by Wagtail - per the instructions in “Adding default attributes to all images”.

Patching views

We have a handful of views that need overrides. Mostly these involve changing querysets or altering filters so the choices are limited to users belonging to the current site. The easiest option is to subclass the existing view, make our changes, then assign our subclass to the same path as the original.

I use the show_urls command from django_extensions to find the existing mapping. And then I map my replacement view to the same pattern. So for replacing the page explorer view, I added the following two lines:

    # patched_urls.py
    from .views.page_explorer import MultitenantPageIndexView

    patched_wagtail_urlpatterns = [
        # This overrides the wagtailadmin_explore_page (aka page listing view) so we can monkey patch the filters
        path('admin/pages/', MultitenantPageIndexView.as_view()),
        path('admin/pages/<int:parent_page_id>/', MultitenantPageIndexView.as_view()),
    ]

Because we have a bunch of overrides, we have a patched_urls.py in our wagtail_patches app. Then, in our main urls.py file, we add that pattern before our other mappings:

    # urls.py
    from wagtail import views as wagtailcore_views
    from wagtail_patches.patched_urls import patched_wagtail_urlpatterns

    # We override several /admin/* URLs with our own custom versions
    urlpatterns = patched_wagtail_urlpatterns + [
        # We now include wagtails' own admin URLs.
        path('admin/', include('wagtail.admin.urls')),
        path('documents/', include('wagtail.documents.urls')),
        ... our custom urls and the rest of the standard Wagtail url mappings
    ]

I then use show_urls to check my mapping. As long as our version is the second one, then it will get used. If you feel like your changes are getting ignored, start by checking to see that the url pattern for your override exactly matches the original pattern.

Multitenancy with Wagtail

November 01, 2023

If you want to run several sites from the same Wagtail codebase, you have a couple of options which are summarized in the Wagtail docs.

Wagtail fully supports “multi-site” installations where “where content creators go into a single admin interface and manage the content of multiple websites”. But at work, we would like our Wagtail installation to treat every site as if it were completely independent. So if you have permissions on Site A and Site B, when you’re logged in to Site A, you should only see content, images, etc. from Site A. We also want site owners to be able to manage just about everything for their site. This means that they need to be able to configure their own site’s settings, manage their own collections, images, and documents and manage their own users. This series of blog posts will cover the changes we have made to enforce our version of multitenancy for sites built with the Wagtail CMS.

These posts were originally written describing our patches while running Wagtail 5.1 (and Django 3.2). I have subsequently updated them for additional patches I made to upgrade to Wagtail 6.0 (and Django 4.2).

Determining the current site

When we first started using Wagtail, it included its own site middleware so request.site was available in all views. When this was removed in Wagtail 2.9, we started using CRequestMiddleware to make the request information available from a variety of contexts. We generally access the request via our own get_current_request method which allows us to provide a useful error message if the request is not available.

    def get_current_request(default=None, silent=True, label='__DEFAULT_LABEL__'):
        """
        Returns the current request.

        You can optionally use ``default`` to pass in a fake request object to act as the default if there
        is no current request, e.g. when ``get_current_request()`` is called during a manage.py command.

        :param default: (optional) a fake request object
        :type default: an object that emulates a Django request object

        :param silent: If ``False``, raise an exception if CRequestMiddleware can't get us a request object.  Default: True
        :type silent: boolean

        :param label: If ``silent`` is ``False``, put this label in our exception message
        :type label: string

        :rtype: a Django request object
        """
        request = CrequestMiddleware.get_request(default)
        if request is None and not silent:
            raise NoCurrentRequestException(
                "{} failed because there is no current request. Try using djunk.utils.FakeCurrentRequest.".format(label)
            )
        return request

NOTE: get_current_request has a parameter for setting a default site if none is available when the method is called but in practice we never provide a default site in code that is trying to access the request. Instead we use one of the methods below to fake the request and then let get_current_request use that to determine the site.

Setting current site in scripts and tests

Our data imports, manage.py scripts, and tests do not have a browser context, so get_current_request will fail in those circumstances. We have created a couple of methods to help set the request and site in those circumstances. This is working but it remains a bit of a pain point.

    class FakeRequest:
        """
        FakeRequest takes the place of the django HTTPRequest object in various testing scenarios where
        a real one doesn't exist, but the code under test expects one to be there.

        Wagtail 2.9 now determines the current Site by looking at the hostname and port in the request object,
        which means it calls get_host() on our faked out requests. Thus, we need to emulate it.
        """

        def __init__(self, site=None, user=None, **kwargs):
            self.user = user
            # Include empty GET and POST attrs, so code which expects request.GET or request.POST to exist won't crash.
            self.GET = self.POST = {}
            # Callers can override GET and POST, or override/add any other attribute using kwargs.
            self.__dict__.update(kwargs)
            self._wagtail_site = site

        def get_host(self):
            if not self._wagtail_site:
                return 'fakehost'
            return self._wagtail_site.hostname

        def get_port(self):
            # It should be safe to pretend all test traffic is on port 443.
            # HTTPRequest.get_port() explicitly returns a string, so we do, too.
            return '443'


    def set_fake_current_request(site=None, user=None, request=None, **kwargs):
        """
        Sets the current request to either a specified request object or a FakeRequest object built from the given Site
        and/or User. Any additional keyword args are added as attributes on the FakeRequest.
        """
        # If the caller didn't provide a request object, create a FakeRequest.
        if request is None:
            request = FakeRequest(site, user, **kwargs)
        # Set the created (or provided) request as the "current request".
        CrequestMiddleware.set_request(request)
        return request


    class FakeCurrentRequest():
        """
        Implements set_fake_current_request() as a context manager. Use like this:
        with FakeCurrentRequest(some_site, some_user):
            // .. do stuff
        OR
        with FakeCurrentRequest(request=some_request):
            // .. do stuff

        When the context manager exits, the current request will be automatically reverted to its previous state.
        """
        NO_CURRENT_REQUEST = 'no_current_request'

        def __init__(self, site=None, user=None, request=None, **kwargs):
            self.site = site
            self.user = user
            self.request = request
            self.kwargs = kwargs

        def __enter__(self):
            # Store a copy of the original current request, so we can restore it when the context manager exits.
            self.old_request = CrequestMiddleware.get_request(default=self.NO_CURRENT_REQUEST)
            return set_fake_current_request(self.site, self.user, self.request, **self.kwargs)

        def __exit__(self, *args):
            if self.old_request == self.NO_CURRENT_REQUEST:
                # If there wasn't a current request when we entered the contact manager, remove the current request.
                CrequestMiddleware.del_request()
            else:
                # Otherwise, set the current request back to whatever it was when we entered.
                CrequestMiddleware.set_request(self.old_request)

On Campus Middleware

November 23, 2022

Note this code uses regular expressions to determine if a request comes from one of our allowed IPs. This should really be reworked to use a library that does proper netmask calculations.

    class OnCampusMiddleware(MiddlewareMixin):
        """
        Middleware sets ON_CAMPUS session variable to True if the request
        came from an campus IP or if the user is authenticated.

        2022-04-09 Storing ON_CAMPUS in the session is causing us to set a
        cookie for every request which interferes with Cloudflare caching.
        If your site is largely for anonymous users, store ON_CAMPUS in the request
        itself by adding STORE_ON_CAMPUS_IN_SESSION=False to your settings.py
        """

        CAMPUS_ADDRESSES = [
            # redacted
            r'192\.168\.\d{1,3}\.\d{1,3}',
            r'127\.0\.0\.1',
        ]

        def check_ip(self, request):
            client_ip = get_client_ip(request)

            if client_ip:
                for ip_regex in self.CAMPUS_ADDRESSES:
                    if re.match(ip_regex, client_ip):
                        return True
            return False

        def process_request(self, request):
            # A user is considered "on campus" if they are visiting from a campus IP, or are logged in
            # to the site.
            if getattr(settings, 'STORE_ON_CAMPUS_IN_SESSION', True):
                request.session['ON_CAMPUS'] = request.user.is_authenticated or self.check_ip(request)
            else:
                request.on_campus = request.user.is_authenticated or self.check_ip(request)
            return None

Then to use this in a Django project:

    # settings.py
    ...
    MIDDLEWARE = [
        # Normal Django middle ware stack
        # Sets request.on_campus = True for logged-in users, and for visitors who come from a campus IP.
        # Set STORE_ON_CAMPUS_IN_SESSION to False to prevent setting cookies for anonymous users.
            'djunk.middleware.OnCampusMiddleware',
    ]

    STORE_ON_CAMPUS_IN_SESSION = False

Import files into Wagtail

July 02, 2022

I am building a site that is replacing an older site and I want to preserve a substantial number of PDF files. So I wrote a manage.py command to import all the files in a nested set of directories into corresponding nested collections in Wagtail. For example, given the following local directory:

  archive
      - some-file.pdf
      - 2020
        - file1.pdf
        - file2.pdf
      - 2021
        - file3.pdf

my script will create collections for 2020 and 2021 and the import 4 PDF files into the correct collections and sub-collections.

  # core/management/commands/import_documents_from_directory.py

  from django.core.exceptions import ObjectDoesNotExist
  from django.core.management import BaseCommand, CommandError
  from wagtail.models import Collection, get_root_collection_id

  from core.jobs.document_importer import DocumentImporter

  class Command(BaseCommand):
      help = "Imports all files nested under `pdf-directory` into
      corresponding collection under the given base collection."

      def add_arguments(self, parser):
          parser.add_argument(
              '--pdf-directory',
              dest='pdf_directory',
              default='/tmp/documents',
              help="Path to the local directory where the PDFs are located"
          )

          parser.add_argument(
              '--base-collection',
              dest='base_collection',
              required=False,
              help="Which collection should get these files? Will use the base collection if this is missing."
          )

          parser.add_argument(
              '--dry-run',
              action='store_true',
              dest='dry_run',
              default=False,
              help='Try not to change the database; just show what would have been done.',
          )

      def handle(self, **options):
          if options['base_collection']:
              try:
                  base_collection = Collection.objects.get(name=options['base_collection'])
              except ObjectDoesNotExist:
                  raise CommandError(f"Base collection \"{options['base_collection']}\" does not exist")
          else:
              base_collection = Collection.objects.get(pk=get_root_collection_id())

          importer = DocumentImporter()
          importer.import_all(options['pdf_directory'], base_collection, options['dry_run'])

  # core/jobs/document_importer.py

  import hashlib
  import os
  from django.core.files import File

  from wagtail.documents import get_document_model
  from wagtail.models import Collection

  from core.logging import logger


  class DocumentImporter(object):
      """
      Given a nested directory of files, import them into Wagtails documents model - preserving the
      folder structure as nested collections.
      """

      def import_all(self, pdf_directory, base_collection, dry_run=False):
          for path, file in self._get_files(pdf_directory):
              collection = self._get_collection(path, pdf_directory, base_collection, dry_run)
              self._create_document(file, path, collection, dry_run)

      def _get_files(self, root):
          """Recursively iterate all the .py files in the root directory and below"""
          for path, dirs, files in os.walk(root):
              yield from ((path, file) for file in files)

      def _get_collection(self, path, pdf_directory, base_collection, dry_run):
          """
          Construct a nested set of collections corresponding to the nested directories.
          """
          current_parent = base_collection
          rel_path = os.path.relpath(path, pdf_directory)
          for part in rel_path.split('/'):
              collection = current_parent.get_descendants().filter(name=part).first()
              if collection:
                  current_parent = collection
                  logger.info(
                      'document_importer.collection.found',
                      dry_run=dry_run,
                      name=part,
                  )
              else:
                  # create this collection
                  if not dry_run:
                      collection = Collection(name=part)
                      current_parent.add_child(instance=collection)
                      # Set this as the parent for the next node in our list
                      current_parent = collection
                  logger.info(
                      'document_importer.collection.create',
                      dry_run=dry_run,
                      name=part,
                  )
          return current_parent

      def _create_document(self, file, path, collection, dry_run):
          doc = get_document_model().objects.filter(file__endswith=file).first()
          if doc:
              op = "update"
              if dry_run:
                  self.__log_document_changes(op, file, collection, dry_run)
              else:
                  with open(f'{path}/{file}', "rb") as fd:
                      new_hash = hashlib.sha1(fd.read()).hexdigest()
                      if not new_hash == doc.file_hash:
                          doc.file = File(fd, name=file)
                          doc.file_size = len(doc.file)
                          doc.file_hash = new_hash
                          doc.save()
                          self.__log_document_changes(op, file, collection, dry_run)
                      if not collection == doc.collection:
                          doc.collection = collection
                          doc.save()
                          self.__log_document_changes(op, file, collection, dry_run)
          else:
              op = "create"
              if dry_run:
                  self.__log_document_changes(op, file, collection, dry_run)
              else:
                  with open(f'{path}/{file}', "rb") as fd:
                      doc = get_document_model()(title=file, collection=collection)
                      doc.file = File(fd, name=file)
                      doc.file_size = len(doc.file)
                      doc.file_hash = hashlib.sha1(fd.read()).hexdigest()
                      doc.save()
                      self.__log_document_changes(op, file, collection, dry_run)

      def __log_document_changes(self, op, file, collection, dry_run):
          logger.info(
              "document_importer.document.{}".format(op),
              dry_run=dry_run,
              file=file,
              collection=collection,
          )

← Older Blog Archives Newer →