Open Access News

News from the open access movement


Friday, October 27, 2006

Another full-text cross-archive search engine

Les Carr at Southampton University has created a ROAR Search Engine, which searches the 748 OA repositories registered at ROAR.   Like the OpenDOAR search engine, launched yesterday, the new ROAR engine is built from Google Custom Search.  Here are Les' comments on the new ROAR engine from a posting this morning to the AmSci OA Forum:

[The OpenDOAR search engine] is a very interesting service!

There was a discussion on this list at the beginning of August about "Search Engines for Repositories Only". There were several attempts to define constrained searches using RollYO or similar, but they all suffered from one defect or another (too few sites, or logins required etc). The Google Custom Search that OpenDOAR have set up seems much more suitable to the repository community needs. Further, it would seem to be fairly simple to set up Country-specific searches (a la UKOLN's EPrints UK) by providing location-identifying annotations for each repository.

I have had a go with this, and created a ROAR-based Repository Search Engine [here].  You can search all the ROAR repositories for a keyword and then Derek Law can click on 'Scottish Research' to reduce the set of results to those coming from the Scottish repositories (the "small and smart" ones, according to his recent keynote at Open Scholarship :-)

There is a serious point that this opens up: why would we bother with OAI-based repositories, if you can do it all with Google? The advantage that OAI provided us was "metadata", ie the possibility of providing more accurate resource identification. The advantage of repositories were that they provided an identifiable source of (well- maintained) research material. Of course, the one can be simulated by the other, and if Google could support a simple quality control "refereed material" tag then we could get by without OAI and without repositories.

Well, it doesn't, and so OAI still seems our best hope. However, even with five years of OAI our repositories are not doing a very good job of sharing metadata that helps a service to comprehend the status of the holdings that it harvests (is this a published, refereed journal article or equivalent? Is this a paper from an unrefereed workshop? is this a chemical data file?) Too much is still down to interpretation and subsequent data mining of the web pages. The Eprints Application Profile seems to be doing a good job in achieving consensus in the use of Dublin Core, but there is an urgent need for it to be implemented by all repositories!

We've spent a lot of time and effort on advocacy and policies over the last couple of years, but I think it's time that we went back to some of the technical fundamentals and made sure that our information interoperability is up to scratch, otherwise we'll find ourselves in a universe where the only thing you can do is a keyword search!

Comments.

  1. The ROAR search engine is as welcome as the OpenDOAR engine and for the same reasons.  Kudos to Bill Hubbard (at OpenDOAR) and Les Carr (at ROAR) for getting these off the ground.  I'd still like to see ROAR and OpenDOAR merge, rather than take the valuable time of valued OA activists to build duplicate services, but this doesn't detract in the slightest from the utility of their latest features.
  2. As for Les' reflections on the continuing utility of OAI, see my May 2004 article, The case for OAI in the age of Google
  3. If I were revising that article today, I'd add that Google (and Google Scholar and Google Custom Search) could neutralize some of the remaining advantages of OAI if it would (1) label peer-reviewed articles as peer-reviewed and (2) label OA articles as OA.  It could make strides toward the first if it used, instead of discarding, the metadata it found in OA repositories.  To make strides toward the second it would have to produce an OA-detecting algorithm that could distinguish an abstract from a full-text article.  Authors could help by using machine-readable CC licenses, since the Google advanced search page already has a "usage rights" filter to limit results to CC-licensed content.