A Search Engine Architecture Based on Collection Selection
Google Tech TalksDecember, 19 2007ABSTRACTWe present a distributed architecture for a Web search engine, basedon the concept of collection selection. We introduce a novel approachto partition the collection of documents, able to greatly improve theeffectiveness of standard collection selection techniques (CORI), anda new selection function outperforming the state of the art. Ourtechnique is based on the novel query-vector (QV) document model,built from the analysis of query logs, and on our strategy ofco-clustering queries and documents at the same time.By suitably partitioning the documents in the collection, our systemis able to select the subset of servers containing the most relevantdocuments for each query. Instead of broadcasting the query to everyserver in the computing platform, only the most relevant will bepolled, this way reducing the average computing cost to solve a query.We introduce a novel strategy to use the instant load at each serverto drive the query routing. Also, we describe a new approach tocaching, able to incrementally improve the quality of the storedresults. Our caching strategy is effectively both in reducingcomputing load and in improving result quality. The proposedarchitecture, overall, presents a trade-off between computing cost andresult quality, and we show how to guarantee very precise results inface of a dramatic reduction to computing load. This means that, withthe same computing infrastructure, our system can serve more users,more queries and more documents.Speaker: Diego Puppin
Channel: Science & Technology
Uploaded: January 4, 2008 at 10:12 am
Author: googletechtalks
Length: 33:01
Rating: 4.60
Views: 4543
Tags: education engedu google googletechtalks talk talks techtalk techtalks
Video Comments
|
vicaya (January 7, 2008 at 10:59 am)
Sorry, this strategy doesn't work well with long tail and personalized search load. The indexing cost (I'd consider cluster selection an indexing phase) is much higher as well. For aggregate performance, a much simpler caching strategy (multiple (for different types/languages etc.) doc.part + (pre-computed/trained) distributed query cache) can be built that match or outperform this complicated solution.
wildchildplasma (January 5, 2008 at 7:43 am)
The crusing capabilities of ac tive data clouds you mean?One day it'll know the kind of stuff i want and i won't even have to make entries all the time. (Standard unified ratings data).I'll also be able to talk to a bot wich wil adapt it's data personality as to know me better. |
|