Geeking with Greg: 08/01/2008

Saturday, August 30, 2008

KDD talk on the Future of Image Search

Jitendra Malik from UC Berkeley gave an enjoyable and often quite amusing invited talk at KDD 2008 on "The Future of Image Search" where he argued that "shape-based object recognition is the key."

The talk started with Jitendra saying image search does not work well. To back this claim, he showed screen shots of searches on Google Image Search and Flickr for [monkey] and highlighted the false positives.

Jitendra then claimed that neither better analysis of the text around images nor more tagging will solve this problem because these techniques are missing the semantic component of images, that is, the shapes, concepts, and categories represented by the objects in the image.

Arguing that we need to move "from pixels to perception", Jitendra pushed for "category recognition" for objects in images where "objects have parts" and form a "partonomy", a hierarchical grouping of object parts. He said current attempts at this have at most 100 parts in the partonomy, but it appears humans commonly use 2-3 orders of magnitude more, 30k+ parts, when doing object recognition.

A common theme running through the talk was having a baseline model of a part, then being able to deform the part to match most variations. For example, he showed some famous examples from the 1917 text "On growth and form"

then talked about how to do these transformation to find the closest matching generic part for objects in an image.

A near the end of the talk, Jitendra contrasted the techniques used for face detection, which he called "a solved problem", with the techniques he thought would be necessary for general object and part recognition. He argued that the techniques used on face detection, to slide various sized windows across the image looking for things that have a face pattern to them, would both be too computationally expensive and have too high a false positive rate to do for 30k objects/parts. Jitendra said something new would have to be created to deal with large scale object recognition.

What that something new is is not clear, but Jitendra seemed to me to be pushing for a system that is biologically inspired, perhaps quickly coming up with rough candidate sets of interpretations of parts of the image, discounting interpretations that violate prior knowledge on objects that tend to occur nearby to each other, then repeating at additional levels of detail until steady state is reached.

Doing object recognition quickly, reliably, and effectively to find meaning in images remains a big, hairy, unsolved research problem, and probably will be for some time, but, if Jitendra is correct, it is the only way to make significant progress in image search.

Friday, August 22, 2008

Column versus row stores

Daniel Abadi, Samuel Madden, and Nabil Hachem had a paper at SIGMOD 2008, "Column-Stores vs. Row-Stores: How Different Are They Really?" (PDF), with a fascinating discussion of what optimizations could be implemented in a traditional row store database to make it behave like a column store.

The paper attempts to answer the question of whether there "is something fundamental about the way column-oriented DBMSs are internally architected" that yields performance gains or whether the column-oriented optimizations can be mimicked in a row-store design.

Specifically, the authors look at taking a row-store and vertically partitioning the rows, indexing all the columns, creating materialized views of the data (just the needed columns for a query already precomputed), compressing the data, and a couple variations on late materialization when joining columns that significantly impact performance.

Despite strong statements in the abstract and introduction that "it is not possible for a row-store to achieve some of the performance advantages of a column-store" and that "there is in fact something fundamental about the design of column-store systems that makes them better suited to data-warehousing workloads", the discussion and conclusion further into the paper are far murkier. For example, the conclusion states that it "is not that simulating a column-store in a row-store is impossible", just that attempting to fully simulate a column-store is difficult in "today's row store systems".

The murkiness seems to come from the difficulty the authors had forcing a row-store to do some of the query plan optimizations a column-store does once they had jiggered all the underlying data to simulate a column-store.

For example, when discussing "index-only plans" (which basically means throwing indexes on all the columns of a row-store), the row-store used slow hash-joins to combine indexed columns and they "couldn't find a way to force [the system] to defer these joins until later in the plan, which would have made the performance of this approach closer to vertical partitioning".

Later, the authors say that they couldn't "trick" the row-store into storing columns "in sorted order and then using a merge join", so it instead did "relatively expensive hash joins". They say the problems are not fundamental and that "there is no reason why a row-store cannot store tuple headers separately, use virtual record-ids to join data, and maintain heap files in guaranteed position order", just that they were not able to force the row-store to do these things in their simulation.

In the end, the paper only concludes that:

A successful column-oriented simulation [in a row-store] will require some important system improvements, such as virtual record-ids, reduced tuple overhead, fast merge joins of sorted data, run-length encoding across multiple tuples ... operating directly on compressed data ... and late materialization ... some of ... [which] have been implemented or proposed to be implemented in various row-stores.

That seems like a much weaker conclusion that what the introduction promised. The conclusion appears to be that a row-store can perform as well as a column-store on data warehouse workloads, just that it looks awkward and difficult to make all the optimizations necessary for it to do so.

Please see also two blog posts, "Debunking a Myth: Column-Stores vs. Indexes" and "Debunking Another Myth: Column-Stores vs. Vertical Partitioning", by the first author, Daniel Abadi, where he expands on some parts of the paper.

If you are interested in this topic, you might also enjoy two of my older posts, "MapReduce a step backwards?" and "C-Store and Google BigTable".

Going to KDD 2008

I'll be at KDD 2008 next week. If you make it there, please say hello!

Friday, August 15, 2008

Cassandra data store at Facebook

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented a talk at SIGMOD 2008, "Cassandra: Structured Storage System over a P2P Network", that describes a Google Bigtable-like data store used actively by Facebook.

What strikes me as remarkable about the system is its rather casual treatment of writes. As far as I can tell, a write is only sent to memory on one box, not written to disk, not even written to multiple replicas in memory. That seems fine for log or click data, but, for the kind of data Facebook deals with, it seems a little surprising to not see a requirement for multiple replicas to get the write in memory before the app is told that the write succeeded.

The code for Cassandra is open source and there is a wiki that adds a few tidbits to the SIGMOD slides. Note that HBase is also open source and also is modeled after Google's Bigtable; HBase is layered on top of Hadoop and sponsored heavily by Yahoo.

Please see also James Hamilton, who posts a summary of the slides and brief commentary, and Dare Obasanjo, who offers more detailed commentary on Cassandra.

Please see also my post, "Highly available distributed hash storage from Amazon", about Amazon's Dynamo (PDF). There are similarities in the design and references in the wiki that suggest that Facebook's Cassandra was influenced by Amazon's Dynamo.

If you are interested in all things Bigtable, you might also enjoy Phil Bernstein's post, "Google Megastore", that summarizes a Google talk at SIGMOD 2008 on a storage system they built on top of Bigtable that adds transitions and additional indexes, among other things.

Update: Avinash Lakshman swings by in the comments and clarifies that Cassandra does have a commit log, so writes do go to disk on at least one machine immediately. A write first updates the commit log, then updates the memory tables, and, finally, in batch some time later, goes out to all the tables on disk.

Y Combinator's list of startup ideas

Paul Graham at Y Combinator posts an interesting list of "Startup Ideas We'd Like to Fund".

My favorites are Fix Advertising, Fixing E-mail Overload, and New News.

Found via John Cook, who notes that many are working on most of these ideas, but "ideas are a dime a dozen [and] it all comes down to execution." Can't argue with that.

Please see also "What to advertise when there is no commercial intent?", "Attention and Life Hacking", and Findory.

Thursday, August 07, 2008

Clever method of near duplicate detection

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University have a cute idea in their SIGIR 2008 paper, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They focus near duplicate detection on the important parts of a web pages by using the next few words after a stop word as a signature.

An extended excerpt:

The frequent presence of many diverse semantic units in individual Web pages makes near-duplicate detection particularly difficult. Frame elements for branding, and advertisements are often freely interspersed with other content elements. The branding elements tend to be deliberately replicated across the pages of individual sites, creating a "noise level" of similarity among pages of the same site.

SpotSigs provides a robust scheme for extracting characteristic signatures from Web documents, thus aiming to filter natural-language text passages out of noisy Web page components ... [It] is much more efficient, easier to implement, and less error-prone than ... layout analysis .... [and] we are able to show very competitive runtimes to the fastest known but more error-prone schemes like I-Match and Locality Sensitive Hashing (LSH).

The points (or "spots") in the page at which spot signatures are generated are all the locations where ... anchor words [occur]. We call the anchor words antecedents, which are typically chosen to be frequent within the corpus. The most obvious, largely domain-independent choices ... are stopwords like is, the, do, have, etc. ... which are distributed widely and uniformly within any snippet of natural-language text.

A spot signature ... consists of a chain of words that follow an antecedent word .... Spot signatures ... [focus on] the natural-language passages of Web documents and skip over advertisements, banners, and navigational components.

The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".

What I particularly like about this paper is that they take a very hard problem and find a beautifully simple solution. Rather than taking on the brutal task of tearing apart the page layout to discard ads, navigation, and other goo, they just noted that the most important part of the page tends to be natural language text. By starting all their signatures at stopwords, they naturally focus the algorithm on the important parts of the page. Very cool.

If you can't get enough of this topic, please see also my previous post, "Detecting near duplicates in big data", that discusses a couple papers out of Google that look at this problem.

Update: Martin Theobald posted a detailed summary of the paper on the Stanford InfoBlog.

Wednesday, August 06, 2008

BrowseRank: Ranking pages by how people use them

Liu et al. from Microsoft Research Asia had the best student paper at SIGIR 2008, "BrowseRank: Letting Web Users Vote for Page Importance" (PDF), that builds a "user browsing graph" of web pages where "edges represent real transitions between web pages by users."

An excerpt:

The user browsing graph can more precisely represent the web surfer's random walk process, and thus is more useful for calculating page importance. The more visits of the page made by users and the longer time periods spent by the users on the page, the more likely the page is important ... We can leverage hundreds of millions of users' implicit voting on page importance.

Some websites like adobe.com are ranked very high by PageRank ... [because] Adobe.com has millions of inlinks for Acrobat Reader and Flash Player downloads. However, web users do not really visit such websites very frequently and they should not be regarded [as] more important than the websites on which users spend much more time (like myspace.com and facebook.com).

BrowseRank can successfully push many spam websites into the tail buckets, and the number of spam websites in the top buckets in BrowseRank is smaller than PageRank or TrustRank.

Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks.

One issue that came up, in discussions afterward about the paper, is whether BrowseRank gets something much different than a smoothed version of visit counts.

Let's say all the transition probabilities between web pages are set by how people actually move across the web, all the starting points on the web are set by where people actually start, and then you simulate random walks. If all this is done correctly, your random walkers should move like people do on the web and visit where they visit on the web. So, after all that simulating, it seems like what you get should be quite close to visit counts.

This is a bit of an oversimplification. BrowseRank uses time spent on a page in addition to visit counts and its Markov model of user behavior ends up smoothing the raw visit count data. Nevertheless, the paper does not compare BrowseRank to a simple ranking based on visit counts, so the question still appears to be open.

Please see also my earlier post, "Google Toolbar data and the actual surfer model".

Please see also my earlier post, "Ranking using Indiana University's user traffic", which discusses a WSDM 2008 paper that looked at supplementing PageRank with traffic data.

Tuesday, August 05, 2008

Caching, index pruning, and the query stream

A SIGIR 2008 paper out of Yahoo Research, "ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines" (ACM page), looks at how performance optimizations to a search engine can impact each other. In particular, it looks at how caching the results of search queries impacts the query load that needs to be served from the search index, and therefore changes the effectiveness of index pruning, which attempts to serve some queries out of a reduced index.

An extended excerpt:

Typically, search engines use caching to store previously computed query results, or to speed up the access to posting lists of popular terms.

[A] results cache is a fixed-capacity temporary storage of previously computed top-k query results. It returns the stored top-k results if a query is a hit or reports a miss otherwise.

Results caching is an attractive technique because there are efficient implementations, it enables good hit rates, and it can process queries in constant time. ... One important issue with caches of query results is that their hit rates are bounded ... [by] the large fraction of infrequent and singleton queries, even very large caches cannot achieve hit rates beyond 50-70%, independent of the cache size.

To overcome this limitation, a system can make use of posting list caching or/and employ a pruned version of the index, which is typically much smaller than the full index and therefore requires fewer resources to be implemented. Without affecting the quality of query results, such a static pruned index is capable of processing a certain fraction of queries thus further decreasing the query rate that reaches the main index.

[A] pruned index is a smaller version of the main index, which is stored on a separate set of servers. Such a static pruned index resolves a query and returns the response that is equivalent to what the full index would produce or reports a miss otherwise. We consider pruned index organizations that contain either shorter lists (document pruning), fewer lists (term pruning), or combine both techniques. Thus, the pruned index is typically much smaller than the main index.

The original stream of queries contains a large fraction of non-discriminative queries that usually consist of few frequent terms (e.g., navigational queries). Since frequent terms tend to be popular, those queries are likely to repeat in the query log and therefore are typical hits in the results cache. At the same time, these queries would be good candidates to be processed with the document pruned index due to their large result set size. Therefore, document pruning does not perform well anymore when the results cache is included in the architecture.

We observed that the results cache significantly changes the characteristics of the query stream: the queries that are misses in the results cache match approximately two orders of magnitude fewer documents on average than the original queries. However, the results cache has little effect on the distribution of query terms.

These observations have important implications for implementations of index pruning: the results cache only slightly affects the performance of term pruning, while document pruning becomes less effective, because it targets the same queries that are already handled by the results cache.

It is an excellent point that one optimization can impact another, so analyzing performance and configuration twiddles in isolation of the other optimizations is likely to give incorrect results. The results in this paper could be problematic for past work on index pruning since it indicates that the query streams used in their evaluations may not be representative of reality.

Monday, August 04, 2008

To personalize or not to personalize

Jaime Teevan, Sue Dumais, and Dan Liebling had a paper at SIGIR 2008, "To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent" (PDF), that looks at how and when different people want different things from the same search query.

An excerpt:

For some queries, everyone ... is looking for the same thing. For other queries, different people want very different results.

We characterize queries using features of the query, the results returned for the query, and people's interaction history with the query ... Using these features we build predictive models to identify queries that can benefit from personalized ranking.

We found that several click-based measures (click entropy and potential for personalization curves) reliability indicate when different people will find different results relevant for the same query .... We [also] found that features of the query string alone were able to help us predict variation in clicks.

Click entropy is a measure of variation in the clicks on the search results. Potential for personalization is a measure of how well any one ordering of results can match the ordering each individual searcher would most prefer. The query features that worked the best for predicting ambiguity of the query were query length, the number of query suggestions offered for the query, and whether the query contained a url fragment.

One tidbit I found interesting in the paper was that average click position alone was a strong predictor of ambiguity of the query and was well correlated with both click entropy and potential for personalization. That's convenient. Average click position seems like it should be much easier to calculate accurately when facing sparse data.

Please see also my older posts, "Effectiveness of personalized search", "Potential of web search personalization", and "Characterizing the value of personalized search".

Friday, August 01, 2008

Modeling how searchers look at search results

Georges Dupret and Benjamin Piwowarski from Yahoo Research had a great paper at SIGIR 2008, "A User Browsing Model to Predict Search Engine Click Data from Past Observations" (ACM page). It nicely extends earlier models of searcher behavior to allow for people skipping results after hitting a lot of uninteresting results.

An extended excerpt:

Click data seems the perfect source of information when deciding which documents (or ads) to show in answer to a query. It can be thought of as the results of users voting in favor of the documents they find interesting.

Nevertheless, [click data] cannot be used without further processing: A fundamental problem is the position bias. The probability of a document being clicked depends not only on its relevance, but on other factors as its position on the result page ... Eye-tracking experiments show that a user is less likely to examine results near the bottom of the list ... [and] that a document is not clicked with the same frequency if situated after a highly relevant or mediocre document.

The Cascade Model assumes that users view search results from top to bottom, deciding whether to click each result before moving to the next ... The cascade model is based on a simplistic behavior model: Users examine all the documents sequentially until they find a relevant document and then abandon the search.

We generalize this and allow for the possibility that a user skips a document without examining it .... We propose that the probability of examination is dependent on the distance d from the last click as well as on the position r in the ranking. The intuition behind using the distance is that a user tends to abandon the search after seeing a long sequence of unattractive snippets on the page.

Our solution outperforms very significantly all previous models ... Our findings confirm that [users] almost always see the document directly after the clicked document .. and explain why documents situated just after a very relevant document are clicked more often.

The paper also discusses a "multiple browsing model" that attempts to capture that different people browse differently on different types of queries (e.g. navigational vs. informational), but that model, surprisingly, failed to yield an improvement.

Please see also Craswell et al., "An Experimental Comparison of Click-Position Bias Models" (PDF), a fun WSDM 2008 paper that proposes a Cascade Model for how searchers review search results.

Geeking with Greg