Search in Drupal
One of the most important considerations when setting up a site using the Drupal content management system (or any web site, really), is navigation and information architecture: the process of organizing and laying out the site so that its visitors can efficiently find the content/information they seek (or that you want them to find) on the site. Ideally, this process will happen in the early stages of site design, and some careful thought will go into first figuring out what content the site's visitors should be able to find (perhaps considering that there could be several classes of site visitors with different needs), and then figuring out the best way or ways for them to find it. Most sites will employ several types of navigational elements to help visitors find content, which we can put into these categories:
- Image maps - Image maps are clickable images (geographical or non-geographical); clicking on the image takes you to content related to the portion of the image you click. Some image maps may have dynamic behavior, such as zooming/panning capability or hover-activated pop-ups.
- Launch pages - Launch pages are pages whose purpose is to provide a table of contents for the site or a section of the site. Launch pages often contain annotated lists of links to content.
- Embedded links - Links embedded in the text of pages, which lead the visitor to related content, are also navigational elements.
- Search - Search means giving the visitor the ability to type or choose terms, and be presented with a list of content related to those terms.
In my experience, search is the category of navigational aid that is the most problematic. Assuming sufficient careful thought has gone into organizing the content of the site around the needs of the visitors, translating that organization into a logical navigational structure of menus, maps, launch pages, and embedded links should be pretty straightforward. But it is not always so clear how searching fits into this structure, or what it will add to the overall goal of helping visitors find content. One reason for this lack of clarity is that there are many different ways to search a site, and many different ways to deploy search features on a web site. Site owners (and web professionals) may not be aware of all the possibilities and their strengths and limitations.
So, the purpose of this article is to familiarize you with several methods for deploying search features on a Drupal-based site. I'll explore what each method is best used for, and also talk about each method's strengths and weaknesses. You will need some basic familiarity with Drupal terminology and knowledge of Drupal site configuration to get much out of this article. My Drupal Cheat Sheet article is a good starting point.
But before we get into search options in Drupal, here is a bit of search terminology (at least for purposes of this article):
- Keyword search means that the site visitor types one or more keywords, and the search returns content that contains those keywords. A familiar example is searching on Google.
- Boolean search means that the site visitor has more options with multiple keywords, such as designating that all keywords must be present, or that they want to search for a phrase.
- Field-based search means that the site visitor types, or perhaps chooses from a drop-down list, one or more keywords specific to particular fields or aspects of the content, and the search returns content that contains those keywords in those fields. Field-based search is typically an advantage over keyword search if your content has a few fields that are understandable to site visitors and align well with how site visitors would want to search. For that reason, many public library catalogs, such as the Seattle Public Library, use field-based search.
- Faceted search is a field-based search with a more advanced search interface, which presents choices for refining the search along with the search results. Typically, the choices are presented as groups of links, where each group presents a few links to terms for a particular field, and clicking on the link restricts the results to that term. The links are also usually displayed with a number, designating how many results correspond to that term, and the site visitor is also often presented with the option to remove previously-applied restrictions and broaden the search. Faceted search is typically an advantage over field-based search if your content has many fields, or if each field has many choices, because it helps site visitors see how their choices will affect the number of search results they are presented with, and how their choices for different fields interact. For that reason, many product catalog sites, such as REI and CNet, use faceted search.
- Indexing is the process of building a database of correspondence between keywords and content. This typically costs some up-front time and storage space, but makes searching (i.e. actually finding results based on keywords the visitor chose) much more efficient.
Comparison matrix of search methods in Drupal
Here is a summary of the search methods presented in this article -- see sections below for more detail.
|Core "Search" module||Site-wide Boolean keyword search of Nodes and User names||
|Contributed "Search by Page" module||Site-wide Boolean keyword search of pages||
|Third-party crawler-based search engines (e.g. Google)||Site-wide Boolean keyword search of pages||
|Search appliances||Site-wide Boolean, keyword, and/or faceted search of nodes or page content, depending on appliance||
|Contributed "Faceted Search" module||Faceted search of one or more Node types||
|Contributed "Views" module with exposed filters||Field-based search of Nodes, Users, Files, or Comments||
The following sections explore these search options in more detail.
Drupal's core "Search" module
The core module "Search", which is distributed with Drupal, is the easiest search method to deploy on a Drupal-based site. To get the core Search feature working, all you have to do is:
- Enable the module
- Optionally, configure preferences, such as changing the priority ordering of search results
- Add a search box to your theme (either via a check box in the theme configuration, or by enabling the Search Block).
- Make sure Drupal's "cron" is set up correctly, because that is how the site will be indexed (without the index, search will return no results).
The way this module works is that it indexes all the published Nodes on your site, and then when someone does a search, it looks in that index and finds content appropriate to that visitor's permissions that contains the search term. A second tab on the search results page lets a visitor search for matching user names of registered users on the site; this requires no additional index.
Unfortunately, this search method has several drawbacks that limit its utility:
- For the content search, only exact keywords match. This means that if someone searches for "quake", and a node contains "quakes", "quaking", or "earthquake", it will not be matched. (There are ways to change this behavior -- see the Stemming note below.) In contrast, user name search always looks for substrings, so for instance if you search for "jo", you would find users called "mojo" and "josephine" as well.
- All of the Node content on your site will be indexed, whether you want it to be or not. For some sites, this may not be appropriate, as some content types are not ever meant to be displayed (or searched) on their own pages, but only on composite pages such as Views.
- Only Node content on the site will be indexed. So if you have a module that produces content that is not Nodes, or pages that are composites of multiple nodes (such as Views), they will not be included in search results. Again, this may be a problem for some sites.
- Profile fields are not searched in the User search, just the user names.
- There is no faceted search capability. However, the Advanced Search section of Content search does allow you to restrict by Taxonomy and Content Type.
- (technical note) If you have modified how the Node content on your site is displayed in the Theme, these modifications will not be indexed correctly. The content that is indexed is the default view of the node, not the view as rendered by the Theme.
The core Search module uses several API hooks that allow modules to modify its behavior. So, there are many contributed modules available on drupal.org that do so (though it looks like not many of them are currently maintained). One that looks promising is Search Files, which adds a tab that searches PDF files, Word documents, etc. on your site.
Another class of contributed modules for Search is "Stemming" modules, such as Porter-Stemmer. Stemming modules enable matching for inflected forms of words, such as "quakes", "quaking", and "quake" all counting as matches for each other ("earthquake" would not, however, be a match). These modules are language-specific -- the Porter Stemmer module works for English; for non-English Drupal sites, try this search for Projects containing "stemmer" on drupal.org. The Porter-Stemmer module requires no set-up, beyond installing the module and setting up Search, and it works well.
Contributed "Search by Page" module
I recently created a new contributed module for Drupal called "Search by Page", specifically to get around some of the limitations of the core "Search" module. Search by Page uses the core Search module's indexing and searching capabilities, but indexes different content in a slightly different way. The main difference is that Search by Page is oriented on paths (pages) on your site, rather than on Nodes. For each path that it indexes, it builds the content pane of the resulting page, and indexes the content as rendered by the theme (excluding headers, footers, sidebars, and other regions of the page). Like the core Search module, Search by Page indexes all specified content, and then presents search results appropriate to the permissions of the person who is searching.
The three sub-modules that come with Search by Page allow you to index Node pages (you can choose which content types you want to index), User profile pages (you can choose which user roles you want to index), and individually-entered paths (which you can use to index Views pages, pages generated by other modules, and other content that is not Nodes or Users). There is also an API for writing your own sub-modules, if there is another class of content you would like to index.
Search by Page suffers from the same problem of exact keyword matching as the core Search module, but you can use the same stemming modules to get around it (since Search by Page relies on the core Search for its indexing and searching technology). Search by Page does not offer faceted or field-based searching, and it doesn't let you change the priority ordering of search results.
You can download this module on drupal.org (see link above).
Third-party crawler-based search engines
Another method for searching your Drupal-based site is to rely on a third-party search engine, such as Google, which will "crawl" the pages of your site (follow links in your menus and embedded links in your pages), index the pages it finds, and allow searching via the engine's external web site. I'm not specifically endorsing Google, but that's the only search engine I know much about, so here are two methods for using Google to search your Drupal site:
- It's straightforward to add a Google search box to your site that will allow a visitor to type in keywords and result in a search restricted to just your site, by just creating a form that points to Google. The disadvantage of this method is that the search results come out on google.com, so it takes your visitors away from your site (hopefully only temporarily).
There are several advantages of using a third-party crawler search engine for your site:
- You don't have to maintain the search index, or allocate space for it.
- Searching tends to be pretty fast.
- The crawlers will index the entire page, as rendered by your theme.
- You can use a module such as XML Sitemap for more control over which pages are indexed on your site.
- Most search engines have good substring matching capabilities, and support multiple languages, so you shouldn't have to worry about exact matching of your search terms.
There are also several disadvantages:
- You don't have much control over when your site is indexed (you may be able to provide guidance, but you can't force Google to re-index your pages). Outdated content may be returned in search results for a long time.
- Aside from the Business option mentioned above, you have little control over how the search results are displayed (may include ads for your competition).
- The search engine has no knowledge of the fields on your content, so there is no faceted or field-based searching capability.
- Only content that is available to all site visitors is indexed (when you use the core Search or Search by Page module, all content is indexed, and then only content the particular site visitor can see is returned in search results).
- The entire page is indexed, which includes headers, sidebars, and footers. So if you have a particular keyword in your sidebar, and someone searches for it, they could potentially have every page on your site returned in the results (though probably not at the top of the results).
There has been a lot of work done lately on integrating search appliances into Drupal (by "search appliance", I mean a search engine program that runs outside of Drupal on your web server or an external host). These appliances are somewhat difficult to deploy, and if your web site is hosted on standard shared PHP/MySQL hosting, some of them may be impossible to deploy. On the other hand, they offer advantages such as faster searching and faceted searching; if you are deploying a high-volume site, they may be worth looking into for the efficiency gains alone. The options I am aware of in this category (mostly from attending a panel discussion on Search in Drupal at DrupalCon DC 2009, which you can watch the video of at that link) are:
- Apache Solr - this is currently in use as the search engine on drupal.org. It promises good performance and faceted searching. The Solr search engine is difficult to set up and runs on Java; Acquia is offering a hosted solution, which should be easier (but potentially more costly).
- Lucene - a purely PHP implementation of the same technology behind the Apache Solr project (since it is fully PHP, you can probably deploy it on shared hosting). Though the functionality is apparently not as complete as Solr, because it is PHP-based you can configure it all within Drupal.
- Xapian - efficient rather than feature-rich; not yet released for Drupal 6, as of this writing, but they're working on it. Requires having the Xapian PHP libraries on your web server, so may not be possible on a shared hosting account (these are compiled libraries that add functionality to PHP, not PHP scripts).
- Sphinx - offers efficiency and facets. There are two Sphinx integration modules: Sphinx Search Integration, which is currently only released for Drupal 5, and involves some pretty technical set-up, and Sphinx Search, in development for Drupal 6, with better configuration options at the cost of a bit of efficiency. Neither is probably installable on a shared hosting account.
Contributed "Faceted Search" module
The Faceted Search contributed module for Drupal allows you to set up one or more faceted search "environments" for your site. In each environment, you can decide which content types to search, and decide which facets you want to search them by. You can choose from facets of Taxonomy and Content Type; with the CCK Facets module, CCK fields can also be facets, and there are other Faceted Search modules that expand the list of available facets. You can also enable keyword searching in environments; this uses the core Search index, so it should be fairly efficient.
The Faceted Search module is pretty easy to set up, quite flexible, has a fairly intuitive search interface, and works well. Its best use, in my opinion, is to set up specific faceted searching of a single content type's fields and taxonomies, rather than for full-site searching. But for many sites, this type of searching is much more useful than just blindly deploying a full site search, and you can always define a second or third environment, probably with different facets, if you want to search different content.
Contributed "Views" module with exposed filters
The Views contributed module for Drupal also allows you to set up one or more search environments, through "exposed filters". Although the experience is less sophisticated than searching with the Faceted Search module (drop-down lists or type-ahead text boxes, rather than lists of links with numbers and removal options), it offers equivalent searches. Also, because Views can be composed of Nodes, Comments, Files, or Users, you can set up searches for more than just Nodes on your site.
Views with exposed filters are reasonably easy to set up, once you get used to Views (if you develop many Drupal sites, you will probably be familiar with Views anyway). Like Faceted Search, Views-based searching is probably more appropriate to searching a specific content type than to searching the entire site.