Refactoring with Rector

Rector is a tool for making changes to PHP code, which powers tools that assist with upgrading deprecated code in Drupal. When I recently made some refactoring changes in Drupal Code Builder, which were too complex to do with search and replace regexes, it seemed like a good opportunity to experiment with Rector, and learn a bit more about it.

Besides, I'm an inveterate condiment-passer: I tend to prefer spending longer on a generalisation of a problem than on the problem itself, and the more dull the problem and the more interesting the generalisation, the more the probability increases.

So faced with a refactoring from this return from the getFileInfo() method:

    return [
      'path' => '',
      'filename' => $this->component_data['filename'],
      'body' => $body,
      'merged' =>$merged,
    ];

to this:

    return new CodeFile(
      body_pieces: $body,
      merged: $merged,
    );

which was going to be tedious as hell to do in a ton of files, obviously, I elected to spend time fiddling with Rector.

The first thing I'll say is that the same sort of approach as I use with migrations works well: work with a small indicative sample, and iterate small changes. With a migration, I will find a small number of source rows which represent different variations (or if there is too much variation, I'll iterate the iteration multiple times). I'll run the migration with just those sources, examine the result, make refinements to the migration, then roll back and repeat.

With Rector, you can specify just a single class in the code that registers the rule to RectorConfig in the rector.php file, so I picked a class which had very little code, as the dump() output of an entire PHP file's PhpParser analysis is enormous.

You then use the rule class's getNodeTypes() method to declare which node types you are interested in. Here, I made a mis-step at first. I wanted to replace Array_ nodes, but only in the getFileInfo() method. So in my first attempt, I specified ClassMethod nodes, and then in refactor() I wrote code to drill down into them to get the array Array_ nodes. This went well until I tried returning a new replacement node, and then Rector complained, and I realised the obvious thing I'd skipped over: the refactor() method expects you to return a node to replace the found node. So my approach was completely wrong.

I rewrote getNodeTypes() to search for Array_ nodes: those represent the creation of an array value. This felt more dangerous: arrays are defined in lots of places in my code! And I haven't been able to see a way to determine the parentage of a node: there do not appear to be pointers that go back up the PhpParser syntax tree (it would be handy, but would make the dump() output even worse to read!). Fortunately, the combination of array keys was unique in DrupalCodeBuilder, or at least I hoped it was fairly unique. So I wrote code to get a list of the array's keys, and then compare it to what was expected:

        foreach ($node->items as $item) {
            $seen_array_keys[] = $item->key->value;
        }
        if (array_intersect(static::EXPECTED_MINIMUM_ARRAY_KEYS, $seen_array_keys) != static::EXPECTED_MINIMUM_ARRAY_KEYS) {
            return NULL;
        }

Returning NULL from refactor() means we aren't interested in this node and don't want to change it.

With the arrays that made it through the filter, I needed to make a new node that's a class instantiation, to replace the array, passing the same values to the new statement as the array keys (mostly).

Rector's list of commonly used PhpParser nodes was really useful here.

A new statement node is made thus:

use PhpParser\Node\Name;
use PhpParser\Node\Expr\New_;

        $class = new Name('\DrupalCodeBuilder\File\CodeFile');
        return new New_($class);

This doesn't have any parameters yet, but running Rector on this with my sample set showed me it was working properly. Rector has a dry run option for development, which shows you what would change but doesn't write anything to files, so you can run it over and over again. What's confusing is that it also has a cache; until I worked this out I was repeatedly confused by some runs having no effect and no output. I have honestly no idea what the point is of caching something that's designed to make changes, but there is an option to disable it. So the command to run is: $ vendor/bin/rector --dry-run --clear-cache. Over and over again.

Once that worked, I needed to convert array items to constructor parameters. Fortunately, the value from the array items work for parameters too:

use PhpParser\Node\Arg;

        foreach ($node->items as $array_item) {
                $construct_argument = new Arg(
                   $array_item->value,
                );

That gave me the values. But I wanted named parameters for my constructor, partly because they're cool and mostly because the CodeFile class's __construct() has optional parameters, and using names makes that simpler.

Inspecting the Arg class's own constructor showed how to do this:

use PhpParser\Node\Arg;
use PhpParser\Node\Identifier;

                $construct_argument = new Arg(
                    value: $array_item->value,
                    name: new Identifier($key),
                );

Using named parameters here too to make the code clearer to read!

It's also possible to copy over any inline comments that are above one node to a new node:

            // Preserve comments.
            $construct_argument->setAttribute('comments', $array_item->getComments());

The constructor parameters are passed as a parameter to the New_ class:

        return new New_($class, $new_call_args);

Once this was all working, I decided to do some more refactoring in the CodeFile class in DrupalCodeBuilder. The changes I was making with Rector made it more apparent that in a lot of cases, I was passing empty values. Also, the $body parameter wasn't well-named, as it's an array of pieces, so could do with a more descriptive name such as $body_pieces.

Changes like this are really easy to do (though by this point, I had made a git repository for my Rector rule, so I could make further enhancements without breaking what I'd got working already).

        foreach ($node->items as $array_item) {
            $key = $array_item->key->value;

            // Rename the 'body' key.
            if ($key == 'body') {
                $key = 'body_pieces';
            }

And that's my Rector rule done.

Although it's taken me far more time than changing each file by hand, it's been far more interesting, and I've learned a lot about how Rector works, which will be useful to me in the future. I can definitely see how it's a very useful tool even for refactoring a small codebase such as DrupalCodeBuilder, where a rule is only going to be used once. It might even prompt me to undertake some minor refactoring tasks I've been putting off because of how tedious they'll be.

What I've not figured out is how to extract namespaces from full class names to an import statement, or how to put line breaks in the new statement. I'm hoping that a pass through with PHP_CodeSniffer and Drupal Coder's rules will fix those. If not, there's always good old regexes!

Defining bundle fields in code

Fields in Drupal 9 can be defined in code, or they can be defined in configuration. Both techniques have their uses and advantages. Typically code fields apply to all bundles of the entity type, as so-called base fields, while config fields apply only to a single bundle.

But there is a way to have code fields which apply only to specific bundles: these are called bundle fields. These are particularly useful for adding computed fields to specific bundles of an entity (more on computed fields in a future blog post).

The API for these is a bit clunky, as it wasn't finalised for Drupal 8, and it may in fact be completely changed in the future. If that causes you apprehension at implementing bundle fields in your code, it's worth mentioning that the last patch on that issue was in 2017: it's a complex problem without many applications, so accordingly doesn't get much attention.

Bundle fields need to be defined in two places: the field storage, and field definition itself. This is similar to how config fields work; it's actually the same as base fields, but the BaseFieldDefinition hides this from you as it does double duty. (There is an issue to remove the need to define the storage, but again, not much happening there.)

You also need a field definition class. The entity contrib module provides a BundleFieldDefinition class, but if you don't want to install that module just for a single class, copy-pasting that class to your custom module is fine too.

(There is a class in core FieldDefinition, which was added for this purpose, however, that requires a storage class to work alongside it, and the issue to provide that has not yet been fixed.)

Once you have that class, how you define the field depends on the entity you're adding it to.

Bundle fields on your own entity type

In your own entity type, you define the entity class. You can therefore implement the bundleFieldDefinitions() method in that class:

  // In MyEntity class:
  /**
   * {@inheritdoc}
   */
  public static function bundleFieldDefinitions(EntityTypeInterface $entity_type, $bundle, array $base_field_definitions) {
    $definitions = [];

    if ($bundle == 'mybundle') {
      $definitions['myfield'] = BundleFieldDefinition::create('entity_reference')
        // This is essential: the array key is NOT used as the field machine name,
        // so it MUST be specified here!
        ->setName('myfield')
        // These two are essential: they define the entity type and the bundle
        // that the field is on.
        ->setTargetEntityTypeId($entity_type->id())
        ->setTargetBundle($bundle)
        // Further define the field as you would with a base field.
        ->setLabel(t('Label'))
        ->setDescription(t('Description.'));
    }

    return $definitions;
  }

That defines the field. Because these are bundle fields, the storage must be defined separately in hook_entity_field_storage_info(). However, we can take advantage of the BundleFieldDefinition class doing double duty as a storage and a field, and piggy-back on the definition in the entity class:

/**
 * Implements hook_entity_field_storage_info().
 */
function mymodule_entity_field_storage_info(EntityTypeInterface $entity_type) {
  // Entity bundle fields need to declare their storage separately. However, we
  // can piggyback on the field definition itself, since the
  // BundleFieldDefinition class does double duty as field and storage
  // definition since it extends from BaseFieldDefinition.
  // We cheat on the parameters we pass in, as the entity's
  // bundleFieldDefinitions() method expects a list of base field definitions,
  // but we have none here to give. For our purposes, bundleFieldDefinitions()
  // does not need them.
  if ($entity_type->id() == 'myentity') {
    return MyEntity::bundleFieldDefinitions($entity_type, 'mybundle', []);
  }
}

The documentation for hook_entity_bundle_field_info() suggests you do it the other way round, and define the storage first, then derive the field from the storage, but doing it in the way shown above means that the code for your field is in your entity class where it's alongside the base field definition, and more easily discoverable.

Bundle fields on someone else's entity type

For an entity type you don't control, you could switch the entity class to a custom subclass and implement bundleFieldDefinitions() in that, but it's simpler to use hook_entity_bundle_field_info().

/**
 * Implements hook_entity_bundle_field_info().
 */
function mymodule_entity_bundle_field_info(EntityTypeInterface $entity_type, $bundle, array $base_field_definitions) {
  if ($entity_type->id() == 'myentity' && $bundle == 'mybundle') {
    $fields = [];
    $fields['myfield'] = BundleFieldDefinition::create('list_integer')
      // This is essential: the array key is NOT used as the field machine name,
      // so it MUST be specified here!
      ->setName('myfield')
      // These two are essential: they define the entity type and the bundle
      // that the field is on.
      ->setTargetEntityTypeId($entity_type->id())
      ->setTargetBundle($bundle)
      // Further define the field as you would with a base field.
      ->setLabel(t('Label'))
      ->setDescription(t('Description.'));

    return $fields;
  }
}

/**
 * Implements hook_entity_field_storage_info().
 */
function mymodule_entity_field_storage_info(EntityTypeInterface $entity_type) {
  if ($entity_type->id() == 'myentity') {
    return mymodule_entity_bundle_field_info($entity_type, 'mybundle', []);
  }
}

We use the same piggybacking trick here. In the case of altering someone else's entity type, it makes just as much sense to define the storage and derive the field, but following the same pattern as the custom entity keeps the two methods matching.

As you can see, this is an area of Drupal's entity system which is unfortunately incomplete, and liable to change in the future. Nonetheless, the usefulness of bundle fields means that it's worth using them. Just be sure to document your code so future developers (who might be you!) know that it's area of the code that may need maintainance when the API is updated. And keep an eye on the Drupal core change records!

Commit your Composer-managed files to version control

According to the Composer documentation, you shouldn't commit to version control any of the files in packages that are managed by Composer. For a Drupal project, that means the following folders are not in your git repository:

  • vendor
  • web/core
  • web/modules/contrib

Bluntly, I think this is the wrong thing to do. Doing the exact opposite saves you time, effort, heartache, and energy.

Let's start by comparing the two workflows.

Composer-recommended workflow: the wrong way

  1. Alice, the developer, updates some packages in her local dev.
  2. She commits only the composer.json and composer.lock files.
  3. Bob pulls changes, and runs composer install to update his local dev.
  4. When the project is deployed, the server needs to run composer install as part of the deployment process.

Git-managed workflow: the right way

  1. Alice, the developer, updates some packages in her local dev.
  2. She commits all the new and changed files.
  3. Bob pulls changes.
  4. When the project is deployed... it's deployed. All the files necessary to run the app are in git, and so updating the server's git clone gives it all the files.

The first reason this workflow is better should jump out: developers save time. Git is much faster and pulling files from a repository than Composer is at resolving, downloading, and installing packages. Switching git branches is fast, even when a lot of files change. Having to run Composer because one branch has package updates on it and the other doesn't, or has different ones, is a waste of developer time.

The second reason is related: save time in deployment. That's not just developer time, that's time your production server is down.

The third reason is more philosophical: Composer is not a file management tool. It's a dependency management tool. Let Composer do what it does well, which is figuring out which versions of which packages should be installed. Let git handle what it does well, which is synchronizing different copies of a codebase. Purists may point out that git is a version control tool, not a file management tool: that's a valid point, but doesn't diminish my argument: commit all files, and then use rsync or whatever to get the files from a git clone to the server.

Related to this: don't make your webserver do package management work; that's not its job. And don't make Bob repeat the work Alice already did.

Finally, committing files to git puts them under version control. That's merely stating the obvious, but putting files under version control has many benefits.

  • IDEs typically ignore git-ignored files for things like searching and code analysis. On my IDE at least (VSCode), disabling that means it then wants to include things like the uploaded files folder.

  • Similarly, search tools such as ag and ack take .gitignore files into account, and having to override that robs you of useful behaviour.

  • Finally, files under version control are easier to debug and tinker with. I for one often go digging in core or contrib modules, trying to figure out why something isn't working the way it's supposed to, or trying to fix a bug. To do that, I put in dump statements; sometimes tons of dump statements. (I can never get xdebug to work. Dump statements are quick and easy.) With code under version control, cleaning up all that mess is trivial: git can discard all changes. Without version control, I'm hunting through files for all the debug changes I made, or bouncing on CMD-Z like it's still the dark ages.

So, put all your files in version control. And remember the adage: if it's not under version control, it doesn't exist.

A better way to work on Drupal core

Often when things seem really complicated, I think it's because I must be doing it wrong.

Working on Drupal core since dependencies were removed from git has seemed really fiddly. For a long time I thought I must have missed something, that there was some undocumented technique I wasn't aware of.

But I've asked various people who work on core a lot more than I do, and they've confirmed that what I've been doing is pretty much the way that they do it:

  1. Get a git clone of Drupal core.
  2. Run composer install on it.
  3. Write code!
  4. Make a patch (well, a merge request now!)

That all sounds simple, right? But wait! If you're working on core, you're going to want Devel module for its useful debugging and inspection tools, right? And Drush, for quick cache clearing. And probably Admin Toolbar so going around the UI is quicker.

But you have to install all those with Composer. And doing that dirties the composer.lock file that's part of Drupal core's git clone.

It's fairly simple to keep changes to that file out of your merge request or patch, but pretty soon, you're going to do a git pull on core that's going to include changes to the composer.lock file, because core will have updates to dependencies.

And that's where it all starts to go wrong, because the git pull will fail because of conflicts in the composer.lock file and in other Composer files, and conflicts in that file are really painful to resolve.

So maintaining an ongoing Drupal install that uses a Drupal core git clone quickly becomes a mess. As far as I know, most core developers frequently reset the whole thing and reinstall from scratch.

The problem is caused by using the git clone of Drupal core as the Composer project, so that Drupal core's composer.json is being used as the project composer.json. But there's a better way...

Using Composer with git clones

Composer has a way of using a git clone for a package in a project:

  1. Create your own git clone of the package
  2. Declare that git clone as a Composer package repository in the project's composer.json
  3. Install the package

The result is that Composer creates a symlink from your git clone into the project, and doesn't touch the git clone. You need to be fairly lax in the version requirement you give for the package, so that Composer doesn't object to the git clone being on a feature branch later on.

This, as far as I know, is the standard way for working on a Composer package that you need to operate in the context of a project. It works for library packages and Composer plugins.

For Drupal core...? Well, it works, but as you might have guessed it's a little more complicated.

Drupal has opinions about where it expects to be located in a project, and furthermore, has a scaffolding system which writes files into the webroot when you install it. All of that gets a bit confused if you put Drupal core out of the way and symlink it in.

But with a few symlinks, and one sneaky patch to a scaffold file, it works. It's all quite fiddly and so I've made...

The Drupal Core development project

This is a Composer project template, available at https://github.com/joachim-n/drupal-core-development-project. It handles all the necessary tweaks to get Drupal to work when symlinked into a project: install it as a Composer project.

There are still a few limitations: those are detailed in the README too.

Of course, I've now fallen down the rabbithole of doing more work towards making Drupal completely agnostic of its location, rather than the core issues I wanted to work on in the first place.

Please try it, report any problems, and happy coding on Drupal core!

Different ways to make entity bundles

A lot of the time, a custom content entity type only needs a single bundle: all the entities of this type have the same structure. But there are times where they need to vary, typically to have different fields, while still being the same type. This is where entity bundles come in.

Bundles are to a custom entity type what node types are to the node entity type. They could be called sub-types instead (and there is a long-running core issue to rename them to this; I'd say go help but I'm not sure what can be done to move it forward), but the name 'bundle' has stuck because initially the concept was just about different 'bundles of fields', and then the name ended up being applied across the whole entity system.

History lesson over; how do you define bundles on your entity type? Well there are several ways, and of course they each have their use cases.

And of course, Module Builder can help generate your code, whichever of these methods you use.

Quick and simple: hardcode

The simplest way to do anything is to hardcode it. If you have a fixed number of bundles that aren't going to change over time, or at least so rarely that requiring a code deployment to change them is acceptable, then you can simply define the bundles in code. The way to do this is with a hook (though there's a core issue -- which I filed -- to allow these to be defined in a method on the entity class).

Here's how you'd define your hardcoded bundles:

/**
 * Implements hook_entity_bundle_info().
 */
function mymodule_entity_bundle_info() {
  $bundles['my_entity_type'] = [
    'bundle_alpha_' => [
      'label' => t('Alpha'),
      'description' => t('Represents an alpha entity.')
    ],
    'bundle_beta_' => [
      'label' => t('Beta'),
      'description' => t('Represents a beta entity.')
    ],
  ];

  return $bundles;
}

The machine names of the bundles (which are used as the bundle field values, and in field names, admin paths, and so on) are the keys of the array. The labels are used in UI messages and page titles. The descriptions are used on the 'Add entity' page's list of bundles.

Classic: bundle entity type

The way Drupal core does bundles is with a bundle entity type. In addition to the content entity type you want to have bundles, there is also a config entity type, called the 'bundle entity type'. Each single entity of the bundle entity type defines a bundle of the content entity type. So for example, a single node type entity called 'page' defines the 'page' node type; a single taxonomy vocabulary entity called 'tags' defines the 'tags' taxonomy term type.

This is great if you want extensibility, and you want the bundles to be configurable by site admins in the UI rather than developers. The downside is that it's a lot of extra code, as there's a whole second entity type to define.

Very little glue is required between the two entity types, though. They basically each need to reference the other in their entity type annotations:

The content entity type needs:

 *   bundle_entity_type = "my_entity_type_bundle",

and the bundle entity needs:

 *   bundle_of = "my_entity_type",

and the bundle entity class should inherit from \Drupal\Core\Config\Entity\ConfigEntityBundleBase.

Per-bundle functionality: plugins as bundles

This third method needs more than just Drupal core: it's a technique provided by Entity module.

Here, you define a plugin type (an annotation plugin type, rather than YAML), and each plugin of that type corresponds to a bundle. This means you need a whole class for each bundle, which seems like a lot of code compared to the hook technique, but there are cases where that's what you want.

First, because Entity module's framework for this allows each plugin class to define different fields for each bundle. These so-called bundle fields are installed in the same way as entity base fields, but are only on one bundle. This gives you the diversification of per-bundle fields that you get with config fields, but with the definition of the fields in your code where it's easier to maintain.

Second, because in your plugin class you can code different behaviours for different bundles of your entity type. Suppose you want the entity label to be slightly different. No problem, in your entity class simply hand over to the bundle:

class PluginAlpha {

  public function label() {
    $bundle_plugin = \Drupal::service('plugin.manager.my_plugin_type')
    return $bundle_plugin->label($this);
  }

}

Add a label() method to the plugin classes, and you can specialise the behaviour for each bundle. If you want to have behaviour that's grouped across more than one plugin, one way to do it is to add properties to your plugin type's annotation, and then implement the functionality in the plugin base class with a conditional on the value in the plugin's definition.

/**
 * @MyPluginType(
 *   id = "plugin_alpha",
 *   label = @Translation("Alpha"),
 *   label_handling = "combolulate"
class MyPluginBase {

  public function label() {
    switch ($this->getPluginDefinition()['floopiness']) {
      case 'combolulate':
        // Return the combolulated label. 
    }
  }

}

For a plugin type to be used as entity bundles, the plugins need to implement \Drupal\entity\BundlePlugin\BundlePluginInterface, and your entity type needs to declare the plugin:

 *   bundle_plugin_type = "my_plugin_type",

Here, the string 'my_plugin_type' is the part of the plugin manager's service name that comes after the 'plugin.manager' prefix. (The plugin system unfortunately doesn't have a standard concept of a plugin type's name, but this is informally what is used in various places such as Entity module and Commerce.)

The bundle plugin technique is well-suited to entities that are used as 'machinery' in some way, rather than content. For example, in Commerce License, each bundle of the license entity is a plugin which provides the behaviour that's specific to that type of license.

One thing to note on this technique: if you're writing kernel tests, you'll need to manually install the bundle plugins for them to exist in the test. See the tests in Entity module for an example.

These three techniques for bundles each have their advantages, so it's not unusual to find all three in use in a single project codebase. It's a question of which is best for a particular entity type.

Pages

Subscribe to Joachim's Drupal blog