Open Access News

News from the open access movement


Friday, February 20, 2009

A tool for crowdsourcing digitization of public domain books

Patti Lane, reCaptcha: How to turn blather into books, Christian Science Monitor, February 19, 2009.

When you buy a concert ticket on Ticketmaster, post something for sale on Craigslist, or poke an old friend on Facebook, you may not know it, but you’re helping to put millions of books online in a vast free library.

To access these websites, you must decipher two squiggly words to prove that you’re not a computer program designed to spam the site. Once it knows you’re human, the website lets you continue.

Those two decoded words don’t disappear, however. In fact, your brain has deciphered words that had baffled the scanning software used for an enormous project to digitize every public domain book in the world. ...

Some 200 million of these words, dubbed “Captchas” for Completely Automated Public Turing test to tell Computers and Humans Apart, are typed every day by people around the world. ...

In 2007, [Luis von Ahn] came up with reCaptchas. Now, instead of frittering away their time typing random characters, Internet users spell actual words plucked from old books that computers have trouble reading.

The Open Content Alliance, a nonprofit group based in a San Francisco, has enlisted about 150 libraries and research centers to digitize as many printed works as it legally can and post them online for anyone in the world to read. ...

The scanned texts are sent to a server in California, where they’re run through optical character recognition software.

But computer programs are only 80 percent accurate in older books. They stump over blurry lines, places where the ink has bled together over time, and less uniform fonts.

Carnegie Mellon computers send the indecipherable words to more than 100,000 websites that use them in the reCaptcha security checks. Any website or blogger can sign up for the free service. ...

The system’s accuracy rate of 99.1 percent is about the same as professional human transcribers.

Web users now provide about 3,000 man-hours a day of free labor in 10-second bursts of human computation, correcting more than 10 million words every day. ReCaptchas have solved 5 billion words in less than two years. ...