Top Ten Tips for Writing Plumtree Crawlers that Actually Work

Just in time for Halloween, I’ve decided to publish my Top Ten Tips for Writing Plumtree Crawlers that Actually Work. This post may scare you a little bit, but hey, that’s the spirit of Halloween, right?

[Editor’s note: yes, we’re still calling it Plumtree. Why? I did a Google search today and 771,000 hits came up for “plumtree” as opposed to around 300,000 for “aqualogic” and just over 400,000 for “webcenter.” Ignoring the obvious — that a short, simple name always wins over a technically convoluted one — it just helps clarify what we’re talking about. For example, if we say “WebCenter,” no one knows whether we’re talking about Oracle’s drag-n-drop environment for creating JSR-168 portlets (WebCenter Suite) or Plumtree’s Foundation/Portal (WebCenter Interaction). So, frankly, you can call it whatever you want, but we’re still gonna call it Plumtree so that people will know WTF we’re talking about.]

So, you want to write a Plumtree Crawler Web Service (CWS), eh?

Here are ten tips that I learned the hard way (i.e. by NOT doing them):

1. Don’t actually build a crawler
2. If you must, at least RTFM
3. “Hierarchyze” your content
4. Test first
5. When testing, use the Service Station 2.0 (if you can get it)
6. Code for thread safety
7. Write DRY code (or else)
8. Don’t confuse ChildDocument with Document
9. Use the named methods on DocumentMetaData
10. RTFM (again)

Before I get into the gory details, let me give you some background. First off, what’s a CWS anyway? It’s the code behind what Oracle now calls Content Services, which spider through various types of content (for lack of a better term) and import pointers to those bits of content into the Knowledge Directory. This ability to spider content and normalize its metadata is one of the most underrated features in Plumtree. (FYI, it was also the first feature we built and arguably, the best.)

Each bit of spidered content is called a Document or a Card or a Link depending on whether you’re looking at the product, the API or the documentation, respectively. It’s important to realize that CWSs don’t actually move content into Plumtree; rather, they store only pointers/links and metadata and they help the Plumtree search engine (known under the covers as Ripfire Ignite) build its index of searchable fulltext and properties.

Today, Plumtree ships with one OOTB CWS that knows how to crawl/spider web pages. Not surprisingly, it’s known as the Web Crawler. Don’t let the name mislead you: the web crawler can actually crawl almost anything, as I explain in my first tip, which is:

Don’t actually build a crawler.

But I’m getting ahead of myself.

So, back to the background on crawlers. Oracle ships five of ’em, AFAIK: one for Windows files, one for Lotus Notes databases, one for Exchange Public Folders, one for Documentum and one for Sharepoint. Their names give you blatantly obvious hints at what they do, so I won’t get into it. Along with the OOTB crawlers, Oracle also exposes a nice, clean API for writing crawlers in Java or .NET. (If you really want to push the envelope, you can try writing a crawler in PHP, Python, Ruby, Perl, C++ or whatever, but it’s hard enough to write one in Java or .NET, so I wouldn’t go there. If you do, though, make sure that your language has a really good SOAP stack.)

So, after reading this, you still want to write a crawler, yes?

Let’s get into my Top Ten Tips:

1. Don’t actually build a crawler

Yes, you really don’t want to go here. Building crawlers is not that hard, as there’s a clean, well documented API. However, getting them work is a whole other story.

Most applications these days have a web UI. So, take advantage of it. Point the OOTB web crawler at the web UI and see what it does. Some web UIs will work well, other won’t (particularly if they use frames or lots of javascript.)

Let’s assume for a moment that this technique doesn’t work. Or perhaps you’re dealing with some awful client-server or greenscreen POS that doesn’t have a web UI. Either way, you may still be able to use the web crawler.

How? Well, the web crawler can crawl almost anything using something that we used to call the Woogle Web. Think of it this way. Say you want to crawl a database. Perhaps that database contains bug reports. Rather than waste your time trying to write a database crawler, just write an index.jsp (or .php, .aspx, .html.erb, .you-name-it) page that does something like select id from bugs and dumps out a list of all the bugs in the database. Then, hyperlink each one to a detail page (that’s essentially a select * from bugs where id = ? query). Your index page can be sinfully ugly. However, put some effort into your detail pages, making them look pretty AND using meta tags to map each field in the database to its value.

Then, simply point the OOTB web crawler at your index page, tell it not to import your index page, map your properties to the appropriate meta tags, crawl at a depth of one level, and get yourself a nice cup of coffee. By the time you get back, the OOTB web crawler will have created documents/links/cards for every bug with links to your pretty detail pages and every bit of text will be full-text searchable. So will every property, meaning that you can run an advanced search on bug id, component, assignee, severity, etc.

At Plumtree, we used to call this a Woogle Web. It may sound ridiculous, but

a Woogle Web is a great way to crawl virtually anything without lifting a finger.

However, a Woogle Web won’t work for everything. If you’re dealing with a product where you can’t even get your head around the database schema AND you have a published Java, .NET or Web Service API, then you might want to think about writing a custom crawler.

2. If you must, at least RTFM

If you’re anything like me, reading the manual is what you do after you’ve tried everything else and nothing has worked out. In the case of Plumtree crawlers, I recommend swallowing your pride (at least momentarily) and reading all of the documentation, including their “tips” (which are totally different from and not nearly as entertaining as my tips, but equally valuable).

Once you’re done reading all the documentation, you might also want to consult Tip #10.

3. “Hierarchyze” your content

Um, yeah, I know “hierarchyze” isn’t a word. But since crawlers only know how to crawl hierarchies, if your data aren’t hierarchical, you darn well better figure out how to represent them hierarchically before you start writing code. Even if you don’t think this step is necessary, just do it because I said so for now. You’ll thank me later.

4. Test first

Don’t even try to write your crawler and then run it and expect it to work. Ha! Instead, write unit tests for every modular bit of code that you throw down. To every extent that’s it’s possible, write these tests first. It’ll save your butt later.

5. When testing, use the Service Station 2.0 (if you can find it)

When you do finally get around to integration testing your crawler, it’ll save you a lot of time if you use the Service Station 2.0. However, it may take you a long time to get it, so start the process early.

Unlike every product Oracle distributes, Service Station is 1) free and 2) not available for download. Yes, you read that correctly.

To get it, you need to contact support. I called them and told them I needed it and after two weeks I got nothing but a confused voicemail back saying that support doesn’t fulfill product orders. Um, yeah. So I called back and literally begged to talk to someone who actually knew what this thing was. Then I painstakingly explained why I couldn’t get it from edelivery (because it’s not there) nor from commerce.bea.com (because it’s permanently redirecting to oracle.com) nor from one.bea.com (because there it says to contact support). So, after my desperate pleas and 15 calendar days of waiting, I got an e-mail with an FTP link to download the Service Station 2.0.

After installing this little gem, my life got a lot easier. Now instead of testing by launching a Plumtree job to kick off the crawler (and then watching it crash and burn), I could use the Service Station to synchronously invoke each method on the crawler and log the results.

Another handy testing tool is the PocketSOAP TCPTrace utility. (It’s also very handy for writing Plumtree portlets.) You can set it up between the Service Station and your CWS and watch the SOAP calls go back and forth in clear text. Very nice.

6. Code for thread safety

So, as the documentation says (and as I completely ignored), crawlers are multithreaded. The portal will launch several threads against your code and, unless you code for thread safety, these threads will proceed to eat your lunch.

Coding for threadsafety means not only that you need to synchronize access to any class-level variables, but also that you must use only threadsafe objects (e.g. in Java, use ArrayList instead of Vector).

7. Write DRY code (or else)

Even though you’re probably writing your CWS in Java or .NET, stick to the ol’ Ruby on Rails adage:

Don’t Repeat Yourself.

Say for example, that you need to build a big ol’ Map of all the Document objects in order to retrieve a document and send its metadata back to the mothership (Plumtree). It’s really important that you don’t build that map every time IDocumentProvider.attachToDocument is called. If you do, your crawler is going to run for a very very very long time. Crawlers don’t have to be super fast, but they shouldn’t be dogshit slow either.

As a better choice, build the Map the first time attachToDocument is called and store it as a class-level variable. Then, with each successive call to attachToDocument, check for the existence of the Map and, if it’s already built, don’t build it again. And don’t forget to synchronize not only the building of the Map, but also the access to the variable that checks whether the Map exists or not. Like I said, this isn’t a walk in the park. (See Tip #1.)

8. Don’t confuse ChildDocument with Document

IContainer has a getChildDocuments method. This, contrary to how it looks on the surface, does not return IDocument objects. Instead, it expects you to return an array of ChildDocument objects. These, I repeat, are not IDocument objects. Instead, they’re like little containers of information about child documents that Plumtree uses so that it knows when to call IDocumentProvider.attachToDocument. It is that call (attachToDocument) and not getChildDocuments that actually returns an IDocument object, where all the heavy lifting of document retrieval actually gets done.

You may not understand this tip right now, but if drop by and read it again after you’ve tried to code against the API for a few hours, and it should make more sense.

9. Use the named methods on DocumentMetaData

This one really burned me badly. I saw that DocumentMetaData had a “put” method. So, naturally, I called it several times to fill it up with the default metadata. Then I wasted the next two hours trying to figure out why Plumtree kept saying that I was missing obvious default metadata properties (like FileName). The solution? Call the methods that actually have names like setFileName, setIndexingURL, etc. — don’t use put for standard metadata. Instead, only use it for custom metadata.

10. RTFM (again)

I can’t stress the importance of reading the documentation enough.

If you think you understand what to do, read it again anyway. I guarantee that you’ll set yourself up for success if you really read and thoroughly digest the documentation before you lift a finger and start writing your test cases (which of course you’re going to write before you write your code, right?).

* * *

As always, if you get stuck, feel free to reach out to the Plumtree experts at bdg. We’re here to help. But don’t be surprised if the first thing we do is try to dissuade you from writing a crawler.

Have a safe and happy Halloween and don’t let Plumtree Crawlers scare you more than they should.

Boo!