Trip Report WWW2004nutch |
nutch
- is a young open source project
- is a web search application software
- it's not a business
- it's not a search site
- it's not a research project
- shall increase transparency of web search
technical goals:
- scale to the entire web
- index billions of pages
- complete crawl task in a week
- handle 1000s of queries per second
scalability:
- fetches: 100 pages / sec / CPU
- database update: 100 million pages @ 100 pages / sec / CPU
- search: 2-20 million pages with 1..40 searches sec / CPU
intranets:
- fetch, database and search can run all on one box
- complete crawl takes within hours (my comment: i'm not sure about that)
- cleaner content (my comment: i'm not sure about that either)
- lesson learned:
- not great for link analysis
- good for anchor text analysis
see http://www.nutch.org/ for details.