HomeLatest ThreadsGreatest ThreadsForums & GroupsMy SubscriptionsMy Posts
DU Home » Latest Threads » Forums & Groups » Topics » Computers & Internet » Website, DB, & Software Developers (Group) » A Free Database of the En...
Introducing Discussionist: A new forum by the creators of DU

Thu Jan 24, 2013, 08:51 PM

A Free Database of the Entire Web May Spawn the Next Google

Common Crawl supplies a database of over five billion Web pages in the hope that it will inspire new research or online services.

http://www.technologyreview.com/news/509931/a-free-database-of-the-entire-web-may-spawn-the-next-google/

Google famously started out as little more than a more efficient algorithm for ranking Web pages. But the company also built its success on crawling the Web—using software that visits every page in order to build up a vast index of online content.

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s.

“The Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top,” says entrepreneur Gilad Elbaz, who founded Common Crawl. “But simply doing the huge amount of work that’s necessary to get at all that information is a large blocker; few organizations … have had the resources to do that.”

New search engines are just one of the things that can be built using an index of the Web, says Elbaz, who points out that Google’s translation software was trained using online text available in multiple languages. “The only way they could do that was by starting with a massive crawl. That’s put them on the way to build the Star Trek translator,” he says. “Having an open, shared corpus of human knowledge is simply a way of democratizing access to information that’s fundamental to innovation.”

1 replies, 611 views

Reply to this thread

Back to top Alert abuse

Always highlight: 10 newest replies | Replies posted after I mark a forum
Replies to this discussion thread
Arrow 1 replies Author Time Post
Reply A Free Database of the Entire Web May Spawn the Next Google (Original post)
Bill USA Jan 2013 OP
hootinholler Jan 2013 #1

Response to Bill USA (Original post)

Fri Jan 25, 2013, 07:25 PM

1. Thanks for posting this. n/t

Reply to this post

Back to top Alert abuse Link here Permalink

Reply to this thread