Open Access News

News from the open access movement


Friday, March 13, 2009

The difficulty of automated harvesting of publisher self-archiving policies

Preben Hansen, Gunnar Ericsson and Oscar Täckström, Steps towards automatic acquisition and recognition of IPR conditions for parallel publishing, Swedish Institute of Computer Science, March 6, 2009.  Excerpt:

Parallel publishing is a rather new term within the area of access to copyrighted content produced by researchers and is sometimes also called post-print and self-archiving....

We examined 31 different publishers...of different size and contents. The initial goal was to visit a publisher and download the copyright agreement for publishing a journal article.

The assumption was that this single document would contain all the conditions and that a tool then could be trained to extract those conditions. The point of departure was to use a set of publishers not yet registered by the Romeo/Sherpa database and not previously examined.

However, during the project, it was observed that not all the examined publishers had a copyright agreement (or similar) in an online and downloadable form. Furthermore, of those that had their copyright agreements available, it was also observed that not all publishers had IPR conditions for parallel publishing in their copyright agreement, and finally, some of the IPR conditions was found on other web pages such within sections for authors and author rights.

This situation made us to move into a modified direction in which we needed to make a more detailed examination of what actually was available and recognizable in order to be used for an automatic acquisition of IPR conditions....

Extracting IPR conditions only from copyright agreements proved to be a more complex task than expected, and the results does not satisfy the initial goals of the this part of the project....

Comment.  It's a pity we still haven't cracked this nut.  I argued in 2004 that:

[All stakeholders, including publishers, would benefit greatly] if journals would post their policy details to a central database, or post them on their own web sites with standardized terminology or tags.  Detail-harvesting, searching, and comparison could then be automated.  But for now [2004] this is too much to ask.  At least journals should put their policies on their own sites in their own words and keep them up to date.

Update:

Also see Erik Sandewall, Demonstrating the Use of Author-Deposit Restrictions in Publication-Related Software Systems, a technical report from the Analysis and Development of Electronic Publishing Technologies (ADEPT) project, from Sweden's Royal Institute of Technology and Linköping University, March 8, 2009. 

Abstract:   The present memo documents a system demonstration for the sponsoring agency, by first describing the goals of the project and its major design decisions, and then describing the demo setup. The project concerns the management of IPR information that determines whether, when and how a given research article can legally be posted on a public website, in particular in an institutional repository or archive. The challenge is to make this information available in structured form so that it can be applied and used in the automatic operations of, for example, an institutional repository, as well as in the autonomous operation of software agents for providing assistance to their users. The demo uses a configuration of agents representing the different interested parties in an article's lifecycle, and shows how the IPR information can be represented, made available, and put to effective use in such a network of software agents.

Also see Sandewell's Support for Managing IPR and Parallel Publishing in the MADMAN Research Author Support System, also from the ADEPT project, January 17, 2009.