Multi-site search using Feeds and SearchAPI

By joachim, Wed, 10/30/2019 - 13:08

[This is an old post that I wrote for System Seed's blog and meant to put on my own too but it fell off my radar until now. It's also about Drupal 7, but the general principle still applies.]

Handling clients with more than one site involves lots of decisions. And yet, it can sometimes seem like ultimately all that doesn't matter a hill of beans to the end-user, the site visitor. They won't care whether you use Domain module, multi-site, separate sites with common codebase, and so on. Because most people don't notice what's in their URL bar. They want ease of login, and ease of navigation. That translates into things such as the single sign-on that drupal.org uses, and common menus and headers, and also site search: they don’t care that it’s actually sites search, plural, they just want to find stuff.

For the University of North Carolina, who have a network of sites running on a range of different platforms, a unified search system was a key way of giving visitors the experience of a cohesive whole. The hub site, an existing Drupal 7 installation, needed to provide search results from across the whole family of sites.

This presented a few challenges. Naturally, I turned to Apache Solr. Hitherto, I've always considered Solr to be some sort of black magic, from the way in which it requires its own separate server (http not good enough for you?) to the mysteries of its configuration (both Drupal modules that integrate with it require you to dump a bunch of configuration files into your Solr installation). But Solr excels at what it sets out to do, and the Drupal modules around it are now mature enough that things just work out of the box. Even better, Search API module allows you to plug in a different search back-end, so you can develop locally using Drupal's own database as your search provider, with the intention of plugging it all into Solr when you deploy to servers.

One possible setup would have been to have the various sites each send their data into Solr directly. However, with the Pantheon platform this didn't look to be possible: in order to achieve close integration between Drupal and Solr, Pantheon locks down your Solr instance.

That left talking to Solr via Drupal.

Search API lets you define different datasources for your search data, and comes with one for each entity type on your site. In a datasource handler class, you can define how the datasource gets a list of IDs of things to index, and how it gets the content. So writing a custom datasource was one possibility.

Enter the next problem: the external sites that needed to be indexed only exposed their content to us in one format: RSS. In theory, you could have a Search API datasource which pulls in data from an RSS feed. But then you need to write a SearchAPI datasource class which knows how to parse RSS and extract the fields from it.

That sounded like reinventing Feeds, so I turned to that to see what I could do with it. Feeds normally saves data into Drupal entities, but maybe (I thought) there was a way to have the data be passed into SearchAPI for indexing, by writing a custom Feeds plugin?

However, this revealed a funny problem of the sort that you don’t consider the existence of until you stumble on it: Feeds works on cron runs, pulling in data from a remote source and saving it into Drupal somehow. But SearchAPI also works on cron runs, pulling data in, usually entities. How do you get two processes to communicate when they both want to be the active participant?

With time pressing, I took the simple option: define a custom entity type for Feeds to put its data into, and SearchAPI to read its data from. (I could have just used a node type, but then there would have been an ongoing burden of needing to ensure that type was excluded from any kind of interaction with nodes.)

Essentially, this custom entity type acted like a bucket: Feeds dumps data in, SearchAPI picks data out. As solutions go, not the most massively elegant, at first glance. But if you think about it, if I had gone down the route of SearchAPI fetching from RSS directly, then re-indexing would have been a really lengthy process, and could have had consequences for the performance of the sites whose content was being slurped up. A sensible approach would then have been to implement some sort of caching on our server, either of the RSS feeds as files, or the processed RSS data. And suddenly our custom entity bucket system doesn’t look so inelegant after all: it’s basically a cache that both Feeds and SearchAPI can talk to easily.

There were a few pitalls. With Search API, our search index needed to work on two entity types (nodes and the custom bucket entities), and while Search API on Drupal 7 allows this, its multiple entity type datasource handler had a few issues to iron out or learn to live with. The good news though is that the Drupal 8 version of Search API has the concept of multi-entity type search indexes at its core, rather than as a side feature: every index can handle multiple entity types, and there’s no such thing as a datasource for a single entity type.

With Feeds, I found that not all the configuration is exportable to Features for easy deployment. Everything about parsing the RSS feed into entities can be exported, except the actual URL, which is a separate piece of setup and not exportable. So I had to add a hook_updateN() to take care of setting that up.

The end result though was a site search that seamlessly returns results from multiple sites, allowing users to work with a network of disparate sites built on different technologies as if they were all the same thing. Which is what they were probably thinking they were all along anyway.

Multi-site search using Feeds and SearchAPI

Tags