Want to get Lullabot article, videocast, and podcast announcements delivered right to your in-box? Let us know your email address (we won't share it) and we'll let you know when anything exciting happens.

New Solr search module in the works (UPDATED)

The download files have been updated! Try out the much improved Drupal module. There have also been changes to solr's schema.xml, so get the Solr download as well.

Many people ask whether there are alternatives to using Drupal's core search module. They ask for many reasons. Some want features that Drupal search doesn't offer, some want more configuration options, and some want better performance. Anyone looking for an alternative search should consider Solr. Solr is a a project from the Apache Foundation that takes the power of Lucene, a fantastic indexer and searcher, and exposes it as a web service. Using HTTP POST and GET requests, you can feed documents to Solr for indexing and issue queries for searching. I've thought for a while now that this would be a perfect fit for large Drupal sites that have complicated searching needs or that need to scale their search infrastructure to meet large demands.

Over the weekend I first evaluated the existing Solr project at Drupal.org, and then began to write my own Solr module. The new module aims first to replicate core Drupal search, and second, to expose all the power of Solr to Drupal.

What I found in the Solr project on Drupal.org is the work of hickory who understands Solr very well. It enforces some design decisions that I wasn't comfortable with, though. For example, to use it you have to turn Drupal's core search module off. This would shut down any Views/Fastsearch action you've got going, and I didn't want to be forced down a road where I can't use a core Drupal module. There is also a lot of cleanup that needs to be done before the existing project meets Drupal coding standards, and it has some bugs.

What I wanted was an easy-to-install 1:1 replica of Drupal core search. And that's what I wrote over the weekend, mostly. The module makes use of the excellent PHP/Solr client by Donovan Jimenez (included with small bugfix in the download I provide below). Here's what I've got so far:

  1. A Drupal module: solr module
  2. A slightly modified version of the Solr example application that comes with the Solr download.

You install the Drupal module like any normal module. Unpack it into sites/all/modules and enable it on the admin/build/modules screen. Then you unpack the example application somewhere (your home directory is fine). I put it in /home/robert.

Then start solr:

cd /home/robert/solr
java -jar ./start.jar

Your terminal should show some logging information as Solr starts up.

You then need to run cron.php as often as needed until admin/settings/search tells you that the index is 100% complete. While cron is running you'll see the logging information in the terminal window telling you that documents are being ingested. When indexing is done, you can search your site with Solr from example.com/search/solr.

To explore Solr a bit more you can test the administration web application that is now running on your machine: http://localhost:8983/solr/admin/

The only file I changed in the Solr example application is solr_example/solr/conf/schema.xml. I added field definitions that mirror the node information that is needed for displaying search results.

   <field name="title" type="text" indexed="false" stored="true"/>
   <field name="body" type="text" indexed="false" stored="true"/>
   <field name="type" type="string" indexed="true" stored="true"/>
   <field name="uid"  type="sint" indexed="true" stored="true"/>
   <field name="changed" type="sint" indexed="true" stored="true"/>
   <field name="nid"  type="sint" indexed="true" stored="true"/>
   <field name="text" type="text" indexed="true" stored="false"/>

The advantages I hope to gain by using Solr include:

  • Speed increase
  • More advanced indexing tools
  • More control over the nature of the search index
  • Support for multiple search indexes
  • Faceted search

The current implementation isn't tuned well in terms of the scoring algorithms and Drupal search returns the results in a more relevant order. In my tests, Solr also had some problems with false positives, sometimes returning results that didn't match the search criteria in any way. These issues need to be addressed.

So much more can be done here, and I'm thrilled to have some working code. Please try it out, learn about Solr, join the Lucene, Nutch and Solr group on groups.drupal.org, and enjoy!

Comments on this post will automatically be closed three months from the original post date.

AttachmentSize
solr-drupal.zip212.57 KB
solr-apache.zip14.71 MB

Comments

Great Solr/Faceted search slideshow

If you want a better overview of Solr and faceted search, and a little more information on why I find this so exciting, look at the PDF document here: http://people.apache.org/~hossman/apachecon2006us/

Querying and updating 'external' Solr index from Drupal

As you said, this is really exciting development! If Drupal gets Solr as a simple turn-key solution, it will be leaps and bounds ahead of most CMSes, both open source and closed.

I have played a little with Solr but with my mediocre programming skills (I'm more in UI/IA side of the things) I haven't yet figured out how to query and update Solr index orginally created from legacy data eg. data which is not Drupal nodes.

My final goal is to be able to hack a UI prototype which combines stuff from Drupal (for an example images) and query resuls from external Solr index. Is this insanely difficult/ outright impossible thing to do eg. is putting all data inside Drupal the only way to go?

There's nothing insane about it.

A custom interface will always require some custom programming, but the workflow is always the same: take a form submission, build a Solr query from it, send the query, display the response.

Drupal and Java

A while ago, when promoting PHP5, I read somewhere an assumption that Drupal could be written in Phython, or any other language for the sake of discussion, if PHP will one day cease to develop.
What does it mean if we were to count on a Java application as core's search replacement?
(I've looked here: http://buytaert.net/drupal-community-skills but didn't find Java...)

Well for one thing it means

Well for one thing it means that the people in charge of hosting your site have to have some minimal java expertise. This includes, at a minimum, being able to run the command as per my instructions (so not much). A more robust solution would involve actually installing Jetty or Tomcat (or Resin or JBoss) and deploying the solar .war file along with the proper configuration files. Also not hard, but not part of the standard LAMP skill set.

As far as community and support goes, however, you'll find lots of activity in this area. Lucene and Solr are both actively developed and have a wide user base. You'll not have problems finding people to hire who are familiar with them. I consider Java a pretty safe technology to deploy in terms of human resources, stability, support, etc.

That's great Robert, thanks

That's great Robert, thanks a lot! If i find time i will help, this module/integration is really helpful for many.

Why Java and what about attached documents?

I agree with the general principle of providing a better search mechanism than Drupal provides in core, and I can think of several requirements:
1) Speed
2) Speed again (I'm labouring the point that ideally a search engine should come back as fast as Google or as near as possible, and that this is really a prime requirement of any search engine no matter what it does)
3) Weighted search: it should be possible to weight the search results depending on various criteria: eg the keywords appear in the title, or in a header of the content, or in the associated taxonomy classification, etc. (actually there's a module in Drupal called SQL Search that allows you to do this already).
4) Handling attached documents: if you wanted to use Drupal as a document management system as opposed to a CMS, then this is essential. Swish-e does this for example - and ideally you would want to be able to do a weighted search on the document content: I would be curious to know what you consider to be the advantage of using solr over Swish-e.

The other thing that leaves me perplexed about Lucene and solr is why anybody would want to build a search engine using Java? I have never been convinced that Java is really suitable for work which absolutely must be FAST: no matter which way you look at it, Java is interpreted code with an intermediate layer between it and the operating system: I don't see how it can possibly be as fast as native C for example (Swish is apparently written in C by the way). It's a long time since I did any real programming... but nonetheless that still seems to me a pretty basic principle, unless of course it is the database engine at the back that actually does all the work.

Java is fast enough. Lucene can index anything you send at it.

For attached documents, you'd want to do something on the Drupal side to extract content. Solr is a REST API, not a package that knows how to parse Word or PDF documents, so it is up to the application to do that before sending stuff to Solr. That's fine, though, and there are plenty of packages that can do this.

You can also specify at index time or query time how different fields are to be weighted. I haven't started to tap these features yet in my Drupal code, but it will be possible in the future since Solr definitely supports it.

Java is fast enough. The "interpreted code" argument is as old as Java itself (and Java did start out with some serious performance problems), so I'll let you read up on the lively and ongoing debate: http://www.kano.net/javabench/

swish-e is extremely fast

swish-e is extremely fast and valuable search engine, i have been using for years, yet it has a serious limitation. It cannot index unicode text.

Moreover, the approach with solr is more flexible and does not require you have the executable started from PHP with all the shell mess. And the indexing machine can be on an independent server.

and lucene must be fast enough, judging number of products that use it (and anecdotal experience of mine). nutch was meant as a web search engine. but Google would not use java to build their se, i agree :-)

Solr - Drupal implementation

Hello Robert,

Looking for a search platform based on Drupal and Solr for our regional newspaper website. I'd like the search forms and results pages customized for categories such as jobs, homes, cars, classifieds, etc. but probably all stored in the same Solr app.

In addition, I'm looking for the ability to write a query such as "Boston vertical:jobs category:professional" and then create a "block" to display in nodes (like a typical newspaper "Top Ads" implementation).

FYI. Some part of our site such as "community" is based on Drupal and other parts are based on .NET and Coldfusion, so I'd have to inject some records externally from Drupal into Solr.

If you know of any freelancers out there that may fit this opportunity, please let me know.

Thanks,

Tim

Results returned by dates, most recent?

Does this module have the same problem that all other Drupal Search modules have, which is that the results are returned in any kind of order BUT most recently posted first? Dated returns are the most common way of getting results in any other CMS and I can not fathom why Drupal's various modules do not have this as an option when users are accustomed to this.

solr etc.

Hey Robert,

nice job, thanks!!

Just installed this on our staging server, and after some abracadabra with set-include_path, it worked pretty ok, other then cron jobs that tend to fall over. took me a long time to get drupal to send everything to the solr. many deleted cron_sempahores to rerun.

Nevertheless, really good stuff. I had eyed the existing solr module in the past, but was not too comfortable with it for the same reasons you outlined. Also don't like to run multiple instances for different content types.

I am primarily interested in the faceted search aspects, and we will be working on that over the next few weeks. care for some patches? pop me a mail if you are.

I'll upload a better version in a day or so

And eventually need to settle on how to host this module on Drupal.org. Glad to have you on the team!

I am also having a look at

I am also having a look at dbsight. This isn't open source - pretty important for us - and seems to do a bit what SOLR does. The difference is that you specify a set of SQL queries to pull the content out of your site, and feed that to Lucene. You can then run searches against it and parse responses using a variety of different methods.

I have a test setup I am happy to provide you access to, if you want to have a look.

I like the speed - it indexes all my content (about 2000 largish nodes for the test setup) and related taxonomy in a few seconds, and search results return instantly. I am on the fence about the fact that the whole things sits outside of Drupal. In some ways thats nice - I can run it on a different machine, and does not carry a lot of the Drupal overhead. In some ways it isn't nice, because, well, I like my stuff playing nicely with the overall framework. I don't like the fact at all that it isn't open source.

awesome

This is a really nice update. it installed much cleaner, with a lot less fuss.
I had to do a very small modification to solr.module to be able to call solr_update_index() directly, with a larger document set. Running a full re-index through cron takes a long time, especially when cron handles a lot of other jobs - in our case, we pull in a lot of feeds, and a lot of financial data on each cron run. PHP ran out of memory a few times (I had a limit of 128M in php.ini)

As for speed, I noticed some improvement using solr over regular drupal search, but I will work on some more serious benchmarks over the next few days.

Are you going to host this module on drupal.org? 2 devs on my team will start working on solr soon, and we will likely take this module and run with it - I would like to be able to send you some patches :) Likely we will focus heavily on performance and faceting.

Drupal.org soon

There's no avoiding getting this project up on Drupal.org. The sooner the better, I suppose. Your team's involvement is a good motivator for doing that. I need to get the external PHP library from the Apache foundation patched with the bugfix that this download contains. Maybe I can persuade the author to dual-license it so that we can host the code on Drupal.org. I could also get rid of the include dir trickery =)

Any suggestions for a drupal project name? solr is taken. I was thinking "Apache Solr Search" and apache_solr as the human readable and unix names.

Apache Solr Search sounds

Apache Solr Search sounds fine, seeing that that's what it does :D

Following this module development

I've been following this module development since the initial blog post and also tried it out in a few development and testing environments. I'd love to see this make its way to drupal.org also.

I too had to change the include dir to make the module work, so I'll give my +1 for 'motivating' the author to dual-license it :)

Solr for library search

Robert,

I'm at DrupalCon in Boston, and just caught part of a discussion about libraries. Eric Goldhagen told me to look into your work, because it's very similar to mine...

I'm about to start working on allowing searching of a Koha database from a Drupal site. Eric says you did this with a proprietary LMS ("Triple I?") with Solr doing most of the hard work. Can you point me towards anything, beyond your module in this post? I'm completely new to Koha, Solr, etc... although my Drupal chops are solid enough.

btw- are you at DrupalCon? If so, I'd love to chat.

brad at sleepcamel dot net

Hi Rob, I've been

Hi Rob,

I've been implementing Solr for the hope that I can provide a better substitute for drupal core search. I followed closely your instructions. The following are my installations:
1. CentOS
2. Java 1.6
3. Ant
4. Drupal 5.1
5. Solr -- the one you provided
6. Solr Module -- the one you provided

But still I can't get it successfully. When I search on something using solr (search/solr), a blank page is returned as if error occurs.

My questions:
1. Is there anything that I might have missed during the installation?
2. I think memcache and solr is having trouble with each other. I still can't find the exact things that they're having trouble with, but I guess they have. I just want to hear something about this from you.

Thanks!

The current ApacheSolr module

For those of you who are reading about Apache Solr and Drupal for the first time on this post, please visit http://drupal.org/project/apachesolr

That's where development has continued over the course of the past year, and where you can find the latest Drupal modules (currently for Drupal 5 and Drupal 6), as well as a bunch of documentation.

The current ApacheSolr module

For those of you who are reading about Apache Solr and Drupal for the first time on this post, please visit http://drupal.org/project/apachesolr

That's where development has continued over the course of the past year, and where you can find the latest Drupal modules (currently for Drupal 5 and Drupal 6), as well as a bunch of documentation.

need help on Apache solr

Hi,
I need help on how to use multisite search.iam using Apache solr in my site ,every thing is working fine except multisite tab.
When i hit multisite tab it is throwing error
Fatal error: Call to undefined method Solr_Base_Query::get_query() in /opt/lampp/htdocs/drupal-6.10/sites/all/modules/apachesolr/contrib/apachesolr_multisitesearch/apachesolr_multisitesearch.module on line 105

i tried a lot to solve this but i cant.please help me.

Apace solr

I am facing some problem in apaceh solr with drupal for chinese japanese and korean.

As these languages don't have the apace characters.

If i have to search single character in chinese but there is another character with that. therefore its not searching.

for ex: if text is Susan???, then this should be visible with following search string s - "?", "??", "???", "Susan???"

I have searched on net, it says you have to upload cjk package. Can't able to modify the solr.war of apache

Can anyone provide me some help. Its very urgent