Open Access News

News from the open access movement


Thursday, May 07, 2009

More on automating metadata annotation for repository deposits

Christine Urquhart, Deposit Plait Project: Final Report, February 23, 2009.  (Thanks to Charles Bailey.)  Excerpt:

The aim of the Deposit Plait project was to examine potential for easing the deposit of journal articles into institutional repositories by making use of any metadata embedded within the document properties of the document being deposited....

The first stage of the project was to see how easy it is to extract this metadata. The target file formats that the project worked with were the Open Document Format (as created by OpenOffice), OpenXML (as created by Microsoft Office 2007), and .doc files (as created by version of Microsoft Office from 97 to 2003). There are standard open source software libraries that can extract both standard and custom metadata fields from each of these file forms.

The second stage of the project was to see how easy it is to use extracted metadata as search terms in order to search for a more complete metadata record. In the case where the item being deposited into the repository has been in existence for some time (it is a ‘retrospective deposit’) then metadata found can be used to perform a search....

The project concluded by creating an online demonstration system. In contrast to a normal repository deposit where the user enters metadata, and then uploads a file, this system requires the user to first upload a file. The metadata is extracted, and the user is allowed to choose which (one or more) of the fields to use as the basis of a search. The search is then initiated and matching records returned. The user can then pick and choose fields from the results the ‘plait’ together their final metadata record.

The end to end concept works well, subject to the following issues:

  • Metadata must exist within the deposited document. It is not common practise for authors to make use of these fields at present.
  • The item must have been published a reasonable amount of time earlier for the metadata record to have made its way into online metadata stores.
  • Licensing issues may restrict both the searching of online metadata stores and the re-use of the metadata found.