Oracle Announces Roadmap for Plumtree / AquaLogic / WebCenter

UPDATE 2: I’ve incorporated all the great feedback and comments from ex-Plumtreevians, ex-BEA and ex- and current Oracle folks.

UPDATE: A bunch of Plumtreevians are contributing really good comments on this post over on Facebook.

bea_think_oracleI worked at Plumtree Software, Inc. from June 1998 to December, 9th 2002. In four-and-a-half years, the company grew from 25 employees to over 400 and it had thousands of happy customers before it was purchased by BEA Systems in 2005 for $220M. Here at bdg, we’ve been supporting dozens of Plumtree/AquaLogic Interaction (ALI)/WebCenter Interaction (WCI) customers since we opened our doors in December of 2002.

Back around 2005, BEA’s BID (Business Interaction Division) still had a lot of really smart engineers from Plumtree working on a lot of really interesting things, including Pages (think CMS 2.0), Pathways (kind of an enterprise version of del.icio.us) and Ensemble (the portlet engine/gateway, minus the overhead and UI of the portal itself).

They were also working on an enterprise social network, kind of a Facebook for business if you will.

However, there was a lot of wrangling at BEA, primarily between BID/AquaLogic and BEA’s flagship product, WebLogic (the world-class application server). Most of the strife came in the form of WebLogic Portal vs. AquaLogic/Plumtree Portal nonsense. Senior management at BEA, in their infinite wisdom, had taken a “let’s try not to alienate any customers” policy and in the process they confused all their customers and alienated/frustrated quite a few of them as well. They renamed Plumtree to AquaLogic User Interaction (ALUI), put in place a “separate but equal” policy with WebLogic Portal (WLP) and spewed some nonsense about how WLP was for “transactional portal deployments” vs. ALI for .NET and non-transactional portals, but no one, including BEA management, had any idea WTF that meant. To further confuse the issue, the WLP team, which also had a lot of really smart engineers, built products like “Adrenaline” (which was basically a less-functional and more buggy version of Ensemble) rather than do the unthinkable and integrate Ensemble into WLP so that WLP could finally host non-Java/JSR-168 portlets.

I was really pissed about BEA’s spineless portal strategy, their “separate but equal” policy between WLP and BID/ALUI and their waste of precious engineering resources in an arms race between WLP and ALUI rather than just stepping back, growing a spine, and coming up with a portal strategy.

Because I can’t keep my pie hole shut, I started several loud, messy and public fights with BEA management. Why? Because the real loser here is the customer.

And BEA, because management got mired in politics and chose to waste engineers’ time on in-fighting and competition instead of building enterprise Facebook, which Steve Hamrick and I arguably already wrote in our spare time. All they needed to do was product-ize that and they would have owned that market.

In 2008, Oracle inherited this clusterfuck of a portal strategy when they bought BEA for $7B+, giving me new hope that cooler heads would prevail and fix this mess. The first thing they did was fire all the impotent BEA managers who were afraid to make any decisions. It took Oracle a while, but alas, they have finally arrived at a portal strategy that makes sense. I first learned about this strategy when I crashed the WebCenter Customer Advisory Board last Thursday.

First of all, let me say this: under the leadership of Vince Casarez, current (and future) customers are in good hands.

I realized when he said “everyone still calls it Plumtree” that this was going to be a bullshit-free presentation.

He also said something regarding the “portal stew” at Oracle that puts all of my ranting and raving in perspective: “Oracle did not buy BEA for Plumtree or WLP, just like it didn’t buy SUN for SUN’s portal product.” To rephrase that, Oracle bought BEA for WebLogic (the application server, not the portal) and Sun for their hardware (not for Java, NetBeans and all the rest of Sun’s baggage).

So, let’s face it, portals are a relatively insignificant part of Oracle.

However, they’ve finally did what I called for 2008 and what BEA never had the wits to do: pick a single portal strategy/stack and stick to it. SO, if you’re a current Plumtree/ALUI/WCI or a current WLP customer, you have a future with Oracle.

Here’s the plan, as I understand it.

All roads lead to Web Center (not Web Center Interaction, but Web Center)

At the heart of Web Center will be WebLogic’s app server and portal. Plumtree/ALUI as a code base will be supported, but eventually put into maintenance mode and retired. You get nine or twelve years of support and patches (blah blah blah) but if you want new features, you need to switch to the new Web Center, powered by WLP. CORRECTION: WebCenter will not be “powered by WLP.” At its core will be the Oracle-developed, ADF-based WebCenter Portal running on WebLogic Server.

All the “server products” (Collaboration, Studio, Analytics, Publisher) will be replaced by Web Center Services or Web Center Suite

Publisher will be subsumed by WCM/UCM (Web Content Management / Universal Content Management, formerly Stellent). The other products will be more-or-less covered by similar offerings in Suite or Services.

What about Pages, Ensemble and Pathways?

Pages is dead as WCM/UCM does it better. Pathways is getting rolled into the new Web Center somehow, but I’m not sure how yet. Perhaps I can follow up with another blog post on that. Ensemble has been renamed “Pagelet Producer” — more on that below. CORRECTION: Pathways is now called “Activity Graph” and it will be part of the new WebCenter. Think of an enterprise-class version of the Facebook News Feed crossed with Sales Force chatter and you’ll be on the right track.

What about .NET/SQL Server, IIS and everything else that isn’t Java?

This is a really interesting question and the key question that I think drove a lot of BEA’s failure to make any decision about portal strategy from 2005-2008. Plumtree had a lot of .NET customers and some of the biggest remaining Plumtree/ALUI customers are still running on an all-Microsoft stack. In fact, one of them told me recently that they have half a million named user accounts, two million documents and 72 Windows NT Servers to power their portal deployment.

So, let’s start with the bad news: Oracle doesn’t want you to run .NET/Windows and they REALLY don’t want you to run on SQL Server.

(That will change when Oracle acquires Microsoft, but that’s not gonna happen, at least not any time soon.) WebLogic app server and WLP/WCI, to the best of my knowledge, will not run on SQL Server. They will, however, run on Windows, but I would not recommend that approach.

It’s inevitable that large enterprises will have both .NET and Java systems along with a smattering of other platforms.

So, if you’re a .NET-heavy shop, you’ll need to bite the bullet and have at least one server running JRockit or Sun’s JVM, one of Oracle’s DB’s (Oracle proper or MySQL), WLS/WLP/WCI and preferably Oracle Enterprise Linux, Solaris or some other other flavor of Un*x. CORRECTION: WLP will run on SQL Server. Not sure about the new WebCenter Portal, but my guess is that it does not.

Now, for the good news: the new WCI, powered by WLP and in conjunction with the Pagelet Producer (formerly Ensemble) and the WSRP Producer (formerly the .NET Application Accelerator) will run any and all of your existing portlets, regardless of language or platform.

This was arguably the best feature in Plumtree and it will live on at Oracle.

.NET/WRSP and even MOSS (Sharepoint) Web Parts will run in WebCenter through the WSRP Producer. The Pagelet Producer will run portlets written in ANY language through what is essentially a next generation, backwards-compatible CSP (Content Server Protocol, the superset of HTTP that allows you to get/set preferences, etc. in Plumtree portlets). So, in theory, if you’re still writing your portlets in ASP 1.0 using CSP 1.0 and GSServices.dll, they will run in the new Web Center via the Pagelet Producer. Time for us to update the PHP and Ruby/Rails IDKs? Indeed it is. Let me know if you need that sooner rather than later.

How do I upgrade to the new WebCenter?

Well, first off, you have to wait for it to come out later this fall. Then, you have to start planning for what’s less of an upgrade and more of a migration. Oracle, between engineering and PSO, has promised to provide migration for all the portal metadata (users, communities, pages, portlets, security, etc.) from Plumtree/ALUI/WCI to the new Web Center, with WLP at its heart. (Wouldn’t it have made sense for some of those WLP engineers to start building that migration script in 2005 instead of trying to compete with ALUI by building Adrenaline? Absolutely.) All your Java portlets, if you’re using JSR-168 or JSR-286, will run natively in WLP through a wrapper in WebCenter Portal. Everything else will either run in the WRSP Producer (if it’s .NET) or in the Pagelet Producer (if it’s anything else). The only thing I don’t fully understand yet is how to migrate from Publisher to UCM, but I’m due to speak with Oracle’s PSO about that soon. Please contact me directly if you need to do a migration from Publisher to WCM/UCM that’s too big to do by hand.

The only other unanswered question in my mind is how the new WebCenter will handle AWS/PWS services — the integrations that bring LDAP/AD users and profile information/metadata into Plumtree/ALUI/WCI. I wrote a lot of that code for Plumtree anyway, so if Oracle’s not working on a solution for the new Web Center, perhaps I can help you with that somehow as well. CORRECTION: User and group objects are fully externalized in Web Center, so there is no need for AWS/PWS synchronization. (Thanks, Vince, for pointing that out.)

So, that’s my understanding of the new portal strategy at Oracle.

Kudos to Oracle’s management for listening to their customers, making some really hard decisions and picking a path that I think is smart and achievable.

I’m here to help if you have questions or need help with your portal strategy or technical implementation/migration.

Q&A;

(Some other notes about discussions that have spawned from this original post.)

Q: What’s the future of the Microsoft Exchange portlets (Mail, Calendar and Contacts) and the CWS for crawling Exchange public folders. Retired and replaced with something Beehive related? Still supported? For how long? Against what versions of Exchange?

A: We’ve got updated portlets for Mail & Calendar in WebCenter now for Exchange 2003 & 2007. We don’t have a Contacts portlet but it could be added quickly if we see a large demand. Crawling public folders can be done with an adapter we have for SES [Oracle Secure Enterprise Search] already. We’re working but aren’t done with a new version of KD on top of the new infrastructure that will come out post PS3. (Contributed by Vince Casarez.)

Q: If migration scripts are provided to move WCI metadata into WebCenter, I understand that a portlet is a portlet, but what about pages and communities, users and groups, content sources and crawlers, etc.? Do they all have analogous objects in WebCenter or is there some reasonable mapping to some other objects?

A: Pages and Communities follow a model where we extract/export the meta data and data, then run it through a set of scripts that create a WebCenter Space for each Collab project/community and a JSPx page for every page. Users and Groups will come out of the LDAP/AD directory they are already using and the scripts associate the right permissions to each of the migrated objects. I don’t recall what we did about crawlers but since we use SES directly, all the hundred or more connectors we ship for SES are now available for direct usage. The scripts go through a multiphase approach to move content, then portlets, then pages, then communities so that dependencies can be fixed up versus trying to do a manual fix up. (Contributed by Vince Casarez.)

Q: Will any existing WCI-related products that are slated for retirement (e.g. Publisher, Collab, Studio, Analytics, etc.) be re-released with support for Windows Vista, Windows 7, IE 8, IE 9 or Chrome?

A: For Publisher, we are planning a set of migrations to quickly move them to UCM. For Collab & Studio, we have new capabilities in WebCenter Spaces to match these functions. For Analytics, we’ve also rebuilt it on top of the WebCenter stack with over 50 portlets for the different metrics and made sure we provide apis/ access to the data directly. These analytics data also feeds the activity graph in providing recommendations for people on the content and UIs that are relevant to them. These are tied into the personalization engine that we brought over from the WLP side. So there is a rich blending of the best features from WLP with WCI key features. As for Neo [the codename for the next release of WCI], we are certifying the additional platforms. On the IE 8 front, we’ve just released patches for WCI 10gR3 customers to be able to use IE8 without upgrading to Neo. (Contributed by Vince Casarez.)

Upcoming Oracle Web Center Interaction Training

Just wanted to let you know that I (formerly Plumtree’s Lead Engineer, worked with hundreds of different Plumtree, BEA and Oracle customers and now an Oracle ACE Director) am leading a public training course over the next two weeks and if you’re interested, there are few available slots left.

We’re partnering with training provider Peak Solutions and you can find the full details on their web site. Here’s the critical information:

THIS Monday, May 3rd and Tuesday, May 4th in Harrisburg, PA
Oracle WebCenter Interaction Administration – $1,200
Deployment planning, installation, configuration, maintenance, troubleshooting, etc.

NEXT Monday, May 10th, 11th and 12th in Harrisburg, PA
Oracle WebCenter Interaction Portlet Develoment in Java and .NET (also Ruby and PHP) – $1,800
Hello world all the way through advanced portlet dev concepts like setting preferences and using caching

Please drop us a note if you’d like to attend. There are only 4-5 slots left, so please act now to reserve your space!

Crowd Campaign, The Social Contest Builder

(I-Newswire) November 16, 2009 – FOR IMMEDIATE RELEASE

CC_Glossy_Side2.pdf (1 page)New York, New York – November 16th, 2009 – At Web 2.0 Expo today, Social Collective, Inc., the company behind the hosted social networking and scheduling software that powered SXSW and Oracle OpenWorld, announced the launch of their newest product, Crowd Campaign. Using Crowd Campaign, Twitter users can easily launch cost-effective branded online contests.

With the power of Twitter and the social web, people who enter these contests, vote on entries, or make comments help propagate their viral spread via easy-to-use sharing tools that facilitate posting contest updates to a multitude of different social networking sites.

When setting up a contest, Crowd Campaign offers the ability to customize the contest’s subdomain, upload a logo and background image, write a Terms & Conditions page, set colors and styles, insert Google Analytics tracking code and change all of the contest rules and other marketing copy. Contest winners can be decided by popular vote or by an “expert panel” or some combination of both. Contests can include text entries, photo or video submissions and links to other content such as blog posts or web sites, meaning that Crowd Campaign contests can be used for any type of competition. Crowd Campaign also offers a rich set of entry management tools for removing offensive content, merging duplicate entries and tallying entries, votes and page views. Crowd Campaign offers a free version for contests containing no more than 10 entries and/or 100 votes. To increase these limits, contest managers can pay as little as $95 up to $4,995 for a one-year unlimited-use license.

Crowd Campaign is used by ad agencies, event managers, social media marketers, small business owners, popular authors, independent film makers and musicians – anyone who wants to leverage the power of the social web to build a brand. Any type of contest can be set up in 10-15 minutes, limited only by one’s imagination and federal, state and local laws.

During the private beta period that ended with today’s launch, hundreds of Crowd Campaign sites were created, including these prominent contests:

http://w2e.crowdcampaign.com — Ask Beth Noveck, Deputy CTO of the US Government a Question: Tim O’Reilly will pick from among the top questions and ask Noveck the best one during their Web 2.0 Expo Keynote later this week. The winner will also receive an autographed copy of Noveck’s book, Wiki Government.

http://w2e.crowdcampaign.com — Search for the Best LinkedIn Profile: Mike O’Neil and Lori Ruff will feature the top five vote-getters in O’Neil’s new book, Rock the World with Your Online Presence.

http://expoexpo.crowdcampaign.com — Ask Guy Kawasaki, Chris Brogan, Rick Calvert, Ann Hamilton and David Rich a Question — the panel moderator will ask the panel the top question at Expo! Expo! in Atlanta in early December. The winner will also receive an autographed copy of Kawasaki’s book, Reality Check.

http://techcocktail.crowdcampaign.com — Enter a DC-area startup company and Frank Gruber and Eric Olson will give the top vote-getter a free Bronze Sponsorhip at the next TECH Cocktail DC event (a $999 value).

http://digitalmarketingmixer.crowdcampaign.com — Suggest an idea for MarketingProfs’ Digital Marketing Mixer: the winner of a random drawing will receive a free registration to MarketingProfs B2B Forum 2010 in Boston, MA (a $695 value).

Mike O’Neil, who launched a Crowd Campaign contest in support of his book, Rock the World with Your Online Presence, said:

“With just weeks from start to finish, we embarked on a partnership with the folks at Crowd Campaign to see how we could find and refine applications for [online contest] technology in social media.”

O’Neil launched a very successful contest centered around finding top LinkedIn profile pages and featuring the winners in his book, which is “something that money can’t buy.” O’Neil remarked on how powerful it was for CrowdCampaign to “create a following around the pop culture image we represent with social media, networking and music.”

About Crowd Campaign

Crowd Campaign is a social tool for building brands. It’s used by ad agencies, event managers, social media marketers – anyone who wants to leverage the power of the social web to build a brand. Easy to customize and manage, you can launch a Twitter-powered contest including a branded promotional site in minutes from your desktop. It’s FREE to get started and there’s no sign up or credit card required. Start a contest today by visiting http://crowdcampaign.com

For more information, contact Clinton Bonner at Social Collective, Inc., [email protected], 860-608-9074

Procrastinators Rejoice: OpenWorld’s Social Schedule Builder is Live!

If you’re attending Oracle OpenWorld and you’ve waited this long to build your event schedule, boy do we have some good news for you!

You can use your existing Oracle Mix credentials to access a brand new social schedule builder powered by our flagship product, The Social Collective.

Of course we did plan to launch this a bit earlier, but due to many factors outside of our control, we ended up launching it the day before the conference. Well, better late than never! Now you can build your schedule in a social context, seeing who in your Mix network is attending each event so that you can make more informed decisions about what content you want to see at OpenWorld.

Your Mix login and password will get you in and there’s no need to re-add anyone to your network — all the work you’ve spent building up your Mix profile and social network will carry right over in our product. In fact, unless you look carefully at the URL, it’s hard to tell that you’re even leaving Oracle Mix.

Many thanks to the peerless (and tireless) Oracle Mix team: Marius, Tim, Hassan and Phil and to Chirag, Eric and other dedicated folks who run the Oracle Single Sign On partner application registration team (among other things) for all their support, help, patience and dedication. We couldn’t have done it without you guys.

Now, if you’re one of the nearly 100,000 people attending Oracle OpenWorld, what are you waiting for? Come on in and get social with your conference schedule . . . you have 24 hours, so get moving!

My Oracle OpenWorld Sessions

I’m going to be speaking in two different Oracle OpenWorld sessions on Sunday. They are OOW-S312303 — Enterprise-Enable Dynamic PHP, Ruby, Python Apps: Oracle WebCenter Interaction and OOW-S312304 — Enterprise Ruby on Rails: Rolling with JRuby on Oracle WebLogic Suite.

Top Ten Tips for Writing Plumtree Crawlers that Actually Work

Just in time for Halloween, I’ve decided to publish my Top Ten Tips for Writing Plumtree Crawlers that Actually Work. This post may scare you a little bit, but hey, that’s the spirit of Halloween, right?

[Editor’s note: yes, we’re still calling it Plumtree. Why? I did a Google search today and 771,000 hits came up for “plumtree” as opposed to around 300,000 for “aqualogic” and just over 400,000 for “webcenter.” Ignoring the obvious — that a short, simple name always wins over a technically convoluted one — it just helps clarify what we’re talking about. For example, if we say “WebCenter,” no one knows whether we’re talking about Oracle’s drag-n-drop environment for creating JSR-168 portlets (WebCenter Suite) or Plumtree’s Foundation/Portal (WebCenter Interaction). So, frankly, you can call it whatever you want, but we’re still gonna call it Plumtree so that people will know WTF we’re talking about.]

So, you want to write a Plumtree Crawler Web Service (CWS), eh?

Here are ten tips that I learned the hard way (i.e. by NOT doing them):

1. Don’t actually build a crawler
2. If you must, at least RTFM
3. “Hierarchyze” your content
4. Test first
5. When testing, use the Service Station 2.0 (if you can get it)
6. Code for thread safety
7. Write DRY code (or else)
8. Don’t confuse ChildDocument with Document
9. Use the named methods on DocumentMetaData
10. RTFM (again)

Before I get into the gory details, let me give you some background. First off, what’s a CWS anyway? It’s the code behind what Oracle now calls Content Services, which spider through various types of content (for lack of a better term) and import pointers to those bits of content into the Knowledge Directory. This ability to spider content and normalize its metadata is one of the most underrated features in Plumtree. (FYI, it was also the first feature we built and arguably, the best.)

Each bit of spidered content is called a Document or a Card or a Link depending on whether you’re looking at the product, the API or the documentation, respectively. It’s important to realize that CWSs don’t actually move content into Plumtree; rather, they store only pointers/links and metadata and they help the Plumtree search engine (known under the covers as Ripfire Ignite) build its index of searchable fulltext and properties.

Today, Plumtree ships with one OOTB CWS that knows how to crawl/spider web pages. Not surprisingly, it’s known as the Web Crawler. Don’t let the name mislead you: the web crawler can actually crawl almost anything, as I explain in my first tip, which is:

Don’t actually build a crawler.

But I’m getting ahead of myself.

So, back to the background on crawlers. Oracle ships five of ’em, AFAIK: one for Windows files, one for Lotus Notes databases, one for Exchange Public Folders, one for Documentum and one for Sharepoint. Their names give you blatantly obvious hints at what they do, so I won’t get into it. Along with the OOTB crawlers, Oracle also exposes a nice, clean API for writing crawlers in Java or .NET. (If you really want to push the envelope, you can try writing a crawler in PHP, Python, Ruby, Perl, C++ or whatever, but it’s hard enough to write one in Java or .NET, so I wouldn’t go there. If you do, though, make sure that your language has a really good SOAP stack.)

So, after reading this, you still want to write a crawler, yes?

Let’s get into my Top Ten Tips:

1. Don’t actually build a crawler

Yes, you really don’t want to go here. Building crawlers is not that hard, as there’s a clean, well documented API. However, getting them work is a whole other story.

Most applications these days have a web UI. So, take advantage of it. Point the OOTB web crawler at the web UI and see what it does. Some web UIs will work well, other won’t (particularly if they use frames or lots of javascript.)

Let’s assume for a moment that this technique doesn’t work. Or perhaps you’re dealing with some awful client-server or greenscreen POS that doesn’t have a web UI. Either way, you may still be able to use the web crawler.

How? Well, the web crawler can crawl almost anything using something that we used to call the Woogle Web. Think of it this way. Say you want to crawl a database. Perhaps that database contains bug reports. Rather than waste your time trying to write a database crawler, just write an index.jsp (or .php, .aspx, .html.erb, .you-name-it) page that does something like select id from bugs and dumps out a list of all the bugs in the database. Then, hyperlink each one to a detail page (that’s essentially a select * from bugs where id = ? query). Your index page can be sinfully ugly. However, put some effort into your detail pages, making them look pretty AND using meta tags to map each field in the database to its value.

Then, simply point the OOTB web crawler at your index page, tell it not to import your index page, map your properties to the appropriate meta tags, crawl at a depth of one level, and get yourself a nice cup of coffee. By the time you get back, the OOTB web crawler will have created documents/links/cards for every bug with links to your pretty detail pages and every bit of text will be full-text searchable. So will every property, meaning that you can run an advanced search on bug id, component, assignee, severity, etc.

At Plumtree, we used to call this a Woogle Web. It may sound ridiculous, but

a Woogle Web is a great way to crawl virtually anything without lifting a finger.

However, a Woogle Web won’t work for everything. If you’re dealing with a product where you can’t even get your head around the database schema AND you have a published Java, .NET or Web Service API, then you might want to think about writing a custom crawler.

2. If you must, at least RTFM

If you’re anything like me, reading the manual is what you do after you’ve tried everything else and nothing has worked out. In the case of Plumtree crawlers, I recommend swallowing your pride (at least momentarily) and reading all of the documentation, including their “tips” (which are totally different from and not nearly as entertaining as my tips, but equally valuable).

Once you’re done reading all the documentation, you might also want to consult Tip #10.

3. “Hierarchyze” your content

Um, yeah, I know “hierarchyze” isn’t a word. But since crawlers only know how to crawl hierarchies, if your data aren’t hierarchical, you darn well better figure out how to represent them hierarchically before you start writing code. Even if you don’t think this step is necessary, just do it because I said so for now. You’ll thank me later.

4. Test first

Don’t even try to write your crawler and then run it and expect it to work. Ha! Instead, write unit tests for every modular bit of code that you throw down. To every extent that’s it’s possible, write these tests first. It’ll save your butt later.

5. When testing, use the Service Station 2.0 (if you can find it)

When you do finally get around to integration testing your crawler, it’ll save you a lot of time if you use the Service Station 2.0. However, it may take you a long time to get it, so start the process early.

Unlike every product Oracle distributes, Service Station is 1) free and 2) not available for download. Yes, you read that correctly.

To get it, you need to contact support. I called them and told them I needed it and after two weeks I got nothing but a confused voicemail back saying that support doesn’t fulfill product orders. Um, yeah. So I called back and literally begged to talk to someone who actually knew what this thing was. Then I painstakingly explained why I couldn’t get it from edelivery (because it’s not there) nor from commerce.bea.com (because it’s permanently redirecting to oracle.com) nor from one.bea.com (because there it says to contact support). So, after my desperate pleas and 15 calendar days of waiting, I got an e-mail with an FTP link to download the Service Station 2.0.

After installing this little gem, my life got a lot easier. Now instead of testing by launching a Plumtree job to kick off the crawler (and then watching it crash and burn), I could use the Service Station to synchronously invoke each method on the crawler and log the results.

Another handy testing tool is the PocketSOAP TCPTrace utility. (It’s also very handy for writing Plumtree portlets.) You can set it up between the Service Station and your CWS and watch the SOAP calls go back and forth in clear text. Very nice.

6. Code for thread safety

So, as the documentation says (and as I completely ignored), crawlers are multithreaded. The portal will launch several threads against your code and, unless you code for thread safety, these threads will proceed to eat your lunch.

Coding for threadsafety means not only that you need to synchronize access to any class-level variables, but also that you must use only threadsafe objects (e.g. in Java, use ArrayList instead of Vector).

7. Write DRY code (or else)

Even though you’re probably writing your CWS in Java or .NET, stick to the ol’ Ruby on Rails adage:

Don’t Repeat Yourself.

Say for example, that you need to build a big ol’ Map of all the Document objects in order to retrieve a document and send its metadata back to the mothership (Plumtree). It’s really important that you don’t build that map every time IDocumentProvider.attachToDocument is called. If you do, your crawler is going to run for a very very very long time. Crawlers don’t have to be super fast, but they shouldn’t be dogshit slow either.

As a better choice, build the Map the first time attachToDocument is called and store it as a class-level variable. Then, with each successive call to attachToDocument, check for the existence of the Map and, if it’s already built, don’t build it again. And don’t forget to synchronize not only the building of the Map, but also the access to the variable that checks whether the Map exists or not. Like I said, this isn’t a walk in the park. (See Tip #1.)

8. Don’t confuse ChildDocument with Document

IContainer has a getChildDocuments method. This, contrary to how it looks on the surface, does not return IDocument objects. Instead, it expects you to return an array of ChildDocument objects. These, I repeat, are not IDocument objects. Instead, they’re like little containers of information about child documents that Plumtree uses so that it knows when to call IDocumentProvider.attachToDocument. It is that call (attachToDocument) and not getChildDocuments that actually returns an IDocument object, where all the heavy lifting of document retrieval actually gets done.

You may not understand this tip right now, but if drop by and read it again after you’ve tried to code against the API for a few hours, and it should make more sense.

9. Use the named methods on DocumentMetaData

This one really burned me badly. I saw that DocumentMetaData had a “put” method. So, naturally, I called it several times to fill it up with the default metadata. Then I wasted the next two hours trying to figure out why Plumtree kept saying that I was missing obvious default metadata properties (like FileName). The solution? Call the methods that actually have names like setFileName, setIndexingURL, etc. — don’t use put for standard metadata. Instead, only use it for custom metadata.

10. RTFM (again)

I can’t stress the importance of reading the documentation enough.

If you think you understand what to do, read it again anyway. I guarantee that you’ll set yourself up for success if you really read and thoroughly digest the documentation before you lift a finger and start writing your test cases (which of course you’re going to write before you write your code, right?).

* * *

As always, if you get stuck, feel free to reach out to the Plumtree experts at bdg. We’re here to help. But don’t be surprised if the first thing we do is try to dissuade you from writing a crawler.

Have a safe and happy Halloween and don’t let Plumtree Crawlers scare you more than they should.

Boo!

bdg welcomes Michael Buckbee!

mike_buckbeeWe’re very pleased to announce that industry veteran, entrepreneur and Rails developer Michael Buckbee joined the bdg team today as CTO and Lead Developer on The Social Collective!

Mike’s career path is surprisingly similar to my own — in multiple ways. Fresh out of college, he joined a health industry startup called Aristar in 1999. They were acquired by SoftMed, which was then acquired by 3M HIS. (Recall that I worked at Plumtree, which was bought by BEA, which was bought by Oracle.) Unlike yours truly, who left Plumtree in 2002, Mike stayed on at 3M. As someone who just doesn’t know what it means to get bored, he also started several side projects. The most successful of these was Fabjectory, which could best be described as a 3D printshop that allows people to take their avatars (or other objects from the virtual world) offline and reincarnate them as real life figurines.

Again, our paths resemble one another, in an almost uncanny way. As I was toiling away on Feedhaus, Mike was building FeedMail, which essentially tried to allow people to read and respond to email from inside a feed reader. Like Feedhaus, the idea never really took off. However, some of Mike’s other projects — like Fabjectory — generated an amazing amount of buzz, including articles in the New York Times and WIRED. Prior to starting Fabjectory, Mike had yet another side project called Second411 which allowed people to search for virtual items both in-world and on the web. Second411 was purchased by ESC in October of 2006.

Never one to settle for just a few side-projects, Mike also worked on FoxyMelody, Watchlister, OneToFive, FeedSpeaker and an open source HTTP queue that runs on Google’s Application Engine. Here at bdg, we’ve long been in the business of throwing lots of spaghetti at the fridge and seeing what sticks, so obviously Mike will fit right in. Visit Mike’s blog and project page for more about his amazing career.

As Lead Developer on The Social Collective, Mike is already busy getting the site prepped for SXSW 2009, which will launch early in Q1.

Please join me and the rest of the bdg team in extending a warm welcome to Mike!