Drupal: cache tags for all, regardless your backend

  • 10 minute read

This blog has been re-posted and edited with permission from Matt Glaman's blog.

Phil Karlton is quoted as having said, "There are only two hard things in Computer Science: cache invalidation and naming things." As someone who suffers horribly from the latter, I'm glad Drupal's caching APIs make the former a breeze. The long short of it is that caching of data improves performance by not having to perform the same operations multiple times after it has been done the first time until you need to re-run that operation again. It involves some storage (static variable in memory, memory backend storage, or the database.) Generally, you store the data with a given identifier for later retrieval, the actual data to be cached for later reuse, and possibly an expiration time for automatic invalidation.

And that's where cache invalidation is hard. What if you need a cache object to be invalidated before its automatic invalidation? Or what if the cache is set to be permanent?  First, you need to know when and how to invalidate something. This may get even more complicated if your when means you need to invalidate multiple cache objects. That's where cache tags are helpful and allow invalidating a group of cache objects without knowing their identifiers.

Drupal's 8.0.0 release introduced cache tags to the Cache API. Previously Drupal supported using wild card cache identifiers for bulk cache invalidation. Thanks to sdboyercatch, and everyone else who made this possible!

What is a cache tag?

The Drupal core issue which added cache tags explains cache tags versus identifiers. I'll give a quick example that should be generic enough to relate to Drupal or any other framework or application.

Cache objects have a unique ID that is fixed or dynamically generated, ie entity:data:{entity_type}:{entity_id}. Generally, a cached object may have specific contexts about that cache – the current user or language. That means entity:data:{entity_type}:{entity_id} could have different variants based on contexts. We can still tell the cache to delete that specific cache object if we know the type and ID of the entity data we want to invalidate. But, what if we wanted to delete all cache objects for that entity type? We could query the entity table for the IDs of all entities of that type, but that would be a huge performance hit. Instead, we could use cache tags to describe are cache objects. 

When setting our cache object with the ID of entity:data:{entity_type}:{entity_id} we can choose to provide our own cache tags of entity_data and entity_data_{entity_type}. If we wanted to invalidate all cached data about our entity type, we just need to invalidate any cache containing the entity_data_{entity_type} cache tag! Or if we had a really big system change we can invalidate the entity_data cache objects without purging the entire system cache. Cache tags are not derived from cache IDs.

How are cache tags supported?

Cache tags are supported by various cache storage backends. In-memory cache storages like Redis support cache tags natively. Memcache does not, but there is there are forks which do or via emulation in the application layer. Your SQL database isn't one of them but Drupal supports cache tags via emulation.

For example, Laravel's Cache only supports cache tags for Redis and Memcache (I wasn't able to discern how they emulate tagging on Memcache.) Whereas the Symfony Cache component only supports cache tags for Redis and filesystem cache (tags related to directory structure, so I have no idea how multiple cache tags are supported, or maybe cache is duplicated.)

Cache tags are also generally supported by reverse proxies and CDNs for granular cache invalidation. Having cache tags used by your system enables you to properly invalidate your HTTP cache as well.

What about PSR-6/PSR-16 caching standards?

PSR-6 provided a standards recommendation for a Caching Interface. This standards recommendation focuses on interfaces for cache objects (Item) and handling them alongside a cache collection (Pool).  The meta documents explains how CachePoolInterface could be extended to support tags. Later on, PSR-16 was introduced for Common Interface for Caching Libraries standards recommendation to provide a simpler standard that is less formal and more flexible. The main purpose is an interface for creating, reading, or deleting cache objects. 

A PSR-6 compliant cache library could support cache tags, but a PSR-16 compliant cache library cannot. Drupal is not compliant with either standard recommendations, but that isn't much of a concern. These PSR's are for interoperability with libraries and general use frameworks. Drupal's caching library is not a shared component for other libraries to consume.

How does Drupal emulate cache tags?

It's pretty simple. The \Drupal\Core\Cache\DatabaseBackend cache backend class defines an SQL schema which stores the tags as a space separated string in a tags column in the database table. For MySQL/MariaDB/Percona it is a LONGTEXT field type and PostgreSQL and SQLite is a TEXT field type. There is another table which tracks the number of times a specific cache tag has been invalidated. Here are the top ten invalidated cache tags on my personal site:

MySQL [main]> select * from cachetags order by invalidations desc limit 10;
+----------------------+---------------+
| tag                  | invalidations |
+----------------------+---------------+
| 4xx-response         |        180788 |
| aggregator_feed_list |        179243 |
| aggregator_feed:1    |        179242 |
| simple_sitemap       |        101414 |
| node_list            |           704 |
| entity_field_info    |           524 |
| route_match          |           445 |
| entity_types         |           437 |
| contact_message_list |           388 |
| entity_bundles       |           374 |
+----------------------+---------------+
10 rows in set (0.00 sec)

Whenever cache is written, a checksum is generated based on the cache tags provided for the cache object and the current invalidate count. This is done by fetching the current invalidation counts for the provided cache tags. Here is the SQL query performed:

SELECT [tag], [invalidations] FROM {cachetags} WHERE [tag] IN ( :tags[] )

Drupal has this checksum logic in the \Drupal\Core\Cache\CacheTagsChecksumTrait trait. The checksum is a sum of the current invalidations across each cache tag. This is written with the cache object. Here are cache are some cache objects for entities on my site:

+----------------+--------------------------------+----------+
| cid            | tags                           | checksum |
+----------------+--------------------------------+----------+
| values:media:1 | entity_field_info media_values | 524      |
| values:media:2 | entity_field_info media_values | 524      |
| values:media:4 | entity_field_info media_values | 524      |
| values:media:5 | entity_field_info media_values | 524      |
| values:node:5  | entity_field_info node_values  | 524      |
| values:node:6  | entity_field_info node_values  | 524      |
| values:node:7  | entity_field_info node_values  | 524      |
| values:node:8  | entity_field_info node_values  | 524      |
+----------------+--------------------------------+----------+
8 rows in set (0.07 sec)

If a cache tag is invalidated in the future, the checksum will be different. Currently entity_field_info has 524 invalidations. If I were to invalidate it again, the count would bump to 525. The checksum for my existing cache objects would not match (524 !== 525) and be considered invalid. The checksum is compared once the cache object has been loaded from the database.

The following is taken from \Drupal\Core\Cache\DatabaseBackend::prepareItem, which is used to process cache objects retrieved from database cache.

    // Check if invalidateTags() has been called with any of the items's tags.
    if (!$this->checksumProvider->isValid($cache->checksum, $cache->tags)) {
      $cache->valid = FALSE;
    }

The cache tag invalidation and checksum generation supports delayed delayed operations to prevent a stampede effect when there are multiple invalidation calls during a single database transaction.

This checksum process is part of the Cache API. Any cache backend may implement \Drupal\Core\Cache\CacheTagsChecksumInterface to use checksums as a means for checking tag invalidations.

  • The Redis module leverages the checksum as the tag to push into Redis instead of the cache tags directly.
  • The Memcache module uses timestamps for its checksums to emulate cache tagging in Memcache without needing the archived memcache-tag project.

A practical example in Drupal

Let's use a typical Drupal example. Drupal provides full caching of responses, meaning once a render is completed that computed HTML can be served from cache for subsequent requests. A page in Drupal is made up of blocks placed in different regions of the page including the main content of the page. 

The cached page would have a cache object ID similar to render:page:{path}. Maybe it has a cache tag of rendered, so that all rendered cache objects may be invalidated at once. But, what about all of those blocks and the main content? What if they change? Then the rendered page cache would be stale. The solution is to capture all related cache tags and add them to the main render cache object.

In Drupal we call this cache metadata and we have mechanisms to bubble this data throughout the response lifecycle. This rendered page cache objects could have the following cache tags:

  • block:1
  • node:34
  • page

This allows invalidating only relevant page caches when a dependent object has been modified and its caches invalidated. The benefit is that if node:34 is ever modified, only the page caches with that dependency are invalidated! If block:1 happened to be the header block on all pages, modifying it ensures all page caches get invalidated.

These invalidations can then be tracked and used to purge caches on your reverse proxy or CDN, as those caches should have the same cache tags tracked by Drupal.