Open Access News

News from the open access movement


Friday, May 06, 2005

Statistical aids for identifying texts

Ryan Singel, Judging a Book by Its Contents, Wired News, May 5, 2005. On Amazon's cool tools for helping to identify books that fit your interests. Excerpt: 'Name that famous book from just these phrases: "pagan harpooneers," "stricken whale," "ivory leg." Or how about this one: "old sport." Yes, it's Herman Melville's Moby Dick and F. Scott Fitzgerald's The Great Gatsby, respectively, but the words aren't just a game. They are Statistically Improbable Phrases, the result of a new Amazon.com feature that compares the text of hundreds of thousands of books to reveal an author's signature constructions. The haiku-like SIPs are not the only word toys on the site. Customers can also see the 100 most common words in a book....While such services seem to have little value and have generated scant publicity, except from bibliophilic thrill seekers, web watchers say the madcap stats aren't just for kicks....Bill Carr, Amazon's executive vice president of digital media, confirms that this is a serious attempt to sell more books. "We've been spending a lot of time thinking, 'We have this rich digital content, how can we pull info out and expose it to customers that makes discovery even better?'" Carr said. "What you are seeing here are the fruits of a lot experimenting and brainstorming....One of the cool things is getting people to discover books that are not only related, but that they would have a hard time finding anywhere else."...Benjamin Vershbow, a researcher at the Institute for the Future of the Book, sees Amazon's SIPs as an automated version of tagging, a concept that fuels sites like del.icio.us, a bookmark-sharing site, and photo-sharing site Flickr. Both rely heavily on users attaching descriptive names to websites or photos so others can discover them. Vershbow found, however, that Amazon's SIPs work much better for nonfiction than for novels....Vershbow sees Amazon's data mining as part of a trend on the web where sites are learning to weave data sources together to create a new web experience. Amazon's Carr agrees. "We are pioneers here ... in that we have this amazing corpus -- no one else has a corpus of this magnitude -- and are finding exciting ways to leverage that content to make a better discovery process for customers." '

(PS: Open-access texts lend themselves to these kinds of fingerprinting, tagging, mining, and discovery tools. And the tools don't have to be provided by the publisher or host. They can be can be developed by third-party service providers. OA texts support layer upon layer of useful services --whatever we are clever enough to conceive and good enough to let loose.)