Skip to main content
Hero Image


Leading Conversational Search by Suggesting Useful Questions

2020 | The Web Conference 2020 (formerly WWW conference)

C Rosset, Ch Xiong, X Song, D Campos, N Craswell, S Tiwary, P Bennett

This paper studies a new scenario in conversational search, conversational question suggestion, which leads search engine users to more engaging experiences by suggesting interesting, informative, and useful follow-up questions. We first establish a novel evaluation metric, usefulness, which goes beyond relevance and measures whether the suggestions provide valuable information for the next step of a user’s journey. We construct a benchmark dataset for useful question suggestion which we seek to release publicly. Then we develop two suggestion systems, a BERT ranking model and a GPT-2 generation model, both trained in a multi-task learning framework. To make suggestions grounded in users’ information-seeking trajectories, one of the tasks brings in novel weak supervision training signals that convey past users’ search behaviors in search sessions. Using a novel embedding-based technique, we identify more coherent sessions in which users express salient information needs, and then inject them into our suggestion models to imitate how users transition to the next step of their session. Our offline experiments demonstrate the crucial role our “next-turn” inductive training plays in improving usefulness over a strong online system. Our online A/B test shows that our more useful question suggestions receive 8% more user clicks than the previous system.

Transformer-XH: Multi-evidence Reasoning with Extra Hop Attention

2020 | ICLR

C Zhao, C Xiong, C Rosset, X Song, P Bennett, S Tiwary

Transformers have obtained significant success modeling natural language as a sequence of text tokens. However, in many real world scenarios, textual data inherently exhibits structures beyond a linear sequence such as trees and graphs; many tasks require reasoning with evidence scattered across multiple pieces of texts. This paper presents Transformer-XH, which uses eXtra Hop attention to enable the intrinsic modeling of structured texts in a fully data-driven way. Its new attention mechanism naturally “hops” across the connected text sequences in addition to attending over tokens within each sequence. Thus, Transformer-XH better conducts multi-evidence reasoning by propagating information between multiple documents, constructing global contextualized representations, and jointly reasoning over multiple pieces of evidence. On multi-hop question answering, Transformer-XH leads to a simpler multi-hop QA system which outperforms previous state-of-the-art on the HotpotQA FullWiki setting. On FEVER fact verification, applying Transformer-XH provides state-of-the-art accuracy and excels on claims whose verification requires multiple evidence.

Generic Intent Representation in Web Search

2019 | SIGIR

H Zhang, X Song, Ch Xiong, C Rosset, P Bennett, N Craswell, S Tiwary

This paper presents GEneric iNtent Encoder (GEN Encoder) which learns a distributed representation space for user intent in search. Leveraging large scale user clicks from Bing search logs as weak supervision of user intent, GEN Encoder learns to map queries with shared clicks into similar embeddings end-to-end and then fine tunes on multiple paraphrase tasks. Experimental results on an intrinsic evaluation task – query intent similarity modeling–demonstrate GEN Encoder’s robust and significant advantages over previous representation methods. Ablation studies reveal the crucial role of learning from implicit user feedback in representing user intent and the contributions of multi-task learning in representation generality. We also demonstrate that GEN Encoder alleviates the sparsity of tail search traffic and cuts down half of the unseen queries by using an efficient approximate nearest neighbor search to effectively identify previous queries with the same search intent. Finally, we demonstrate distances between GEN encodings reflect certain information seeking behaviors in search sessions.

Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks

2019 | SIGIR

B Mitra, C Rosset, D Hawking, N Craswell, F Diaz, E Yilmaz

Classical information retrieval (IR) methods, such as query likelihood and BM25, score documents independently w.r.t. each query term, and then accumulate the scores. Assuming query term independence allows precomputing term-document scores using these models—which can be combined with specialized data structures, such as inverted index, for efficient retrieval. Deep neural IR models, in contrast, compare the whole query to the document and are, therefore, typically employed only for late stage re-ranking. We incorporate query term independence assumption into three state-of-the-art neural IR models: BERT, Duet, and CKNRM—and evaluate their performance on a passage ranking task. Surprisingly, we observe no significant loss in result quality for Duet and CKNRM—and a small degradation in the case of BERT. However, by operating on each query term independently, these otherwise computationally intensive models become amenable to offline precomputation—dramatically reducing the cost of query evaluations employing state-of-the-art neural ranking models. This strategy makes it practical to use deep models for retrieval from large collections—and not restrict their usage to late stage re-ranking.

An Axiomatic Approach to Regularizing Neural Ranking Models

2019 |

C Rosset, B Mitra, C Xiong, N Craswell, X Song, S Tiwary

Axiomatic information retrieval (IR) seeks a set of principle properties desirable in IR models. These properties when formally expressed provide guidance in the search for better relevance estimation functions. Neural ranking models typically contain a large number of parameters. The training of these models involve a search for appropriate parameter values based on large quantities of labeled examples. Intuitively, axioms that can guide the search for better traditional IR models should also help in better parameter estimation for machine learning based rankers. This work explores the use of IR axioms to augment the direct supervision from labeled data for training neural ranking models. We modify the documents in our dataset along the lines of well-known axioms during training and add a regularization loss based on the agreement between the ranking model and the axioms on which version of the document---the original or the perturbed---should be preferred. Our experiments show that the neural ranking model achieves faster convergence and better generalization with axiomatic regularization.

Serving DNNs in Real Time at Datacenter Scale with Project Brainwave

2018 | Microsoft Research

To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsoft's principal infrastructure for AI serving in real time, accelerates deep neural network (DNN) inferencing in major services such as Bing’s intelligent search features and Azure. Exploiting distributed model parallelism and pinning over low-latency hardware microservices, Project Brainwave serves state-of-the-art, pre-trained DNN models with high efficiencies at low batch sizes. A high-performance, precision-adaptable FPGA soft processor is at the heart of the system, achieving up to 39.5 TFLOPs of effective performance at Batch 1 on a state-of-the-art Intel Stratix 10 FPGA.

Neural Ranking Models with Multiple Document Fields

2018 | ACM

H Zamani, B Mitra, X Song, N Craswell, S Tiwary

Deep neural networks have recently shown promise in the ad-hoc retrieval task. However, such models have often been based on one field of the document, for example considering document title only or document body only. Since in practice documents typically have multiple fields, and given that non-neural ranking models such as BM25F have been developed to take advantage of document structure, this paper investigates how neural models can deal with multiple document fields. We introduce a model that can consume short text fields such as document title and long text fields such as document body. It can also handle multi-instance fields with variable number of instances, for example where each document has zero or more instances of incoming anchor text. Since fields vary in coverage and quality, we introduce a masking method to handle missing field instances, as well as a field-level dropout method to avoid relying too much on any one field. As in the studies of non-neural field weighting, we find it is better for the ranker to score the whole document jointly, rather than generate a per-field score and aggregate. We find that different document fields may match different aspects of the query and therefore benefit from comparing with separate representations of the query text. The combination of techniques introduced here leads to a neural ranker that can take advantage of full document structure, including multiple instance and missing instance data, of variable length. The techniques significantly enhance the performance of the ranker, and outperform a learning to rank baseline with hand-crafted features.

Optimizing Query Evaluations Using Reinforcement Learning for Web Search

2018 | ACM

C Rosset, D Jose, G Ghosh, B Mitra, S Tiwary

In web search, typically a candidate generation step selects a small set of documents---from collections containing as many as billions of web pages---that are subsequently ranked and pruned before being presented to the user. In Bing, the candidate generation involves scanning the index using statically designed match plans that prescribe sequences of different match criteria and stopping conditions. In this work, we pose match planning as a reinforcement learning task and observe up to 20% reduction in index blocks accessed, with small or no degradation in the quality of the candidate sets.

Systems and methods for automated query answer generation

2018 | Google Patents

S Tiwary, M Rosenberg, J Gao, X Song, R Majumder, L Deng

ystems and methods for automated generation of new content responses to answer user queries are provided. The systems and methods for automated generation of new content responses answer user queries utilizing deep learning and a reasoning algorithm. The generated response is composed of new content and is not merely cut or copied information from one or more search results. Accordingly, the systems and methods for automated generation of new content responses provide tailored query specific answers that can be long and detailed including several sentences of information or that can be short and concise, such as “yes” or “no.” The ability of the systems and methods described herein to create or generate new content in response to a user query improves the usability, improves the performance, and/or improves user interactions of/with a search query system.

Towards Language Agnostic Universal Representations

2018 |

A Aghajanyan, X Song, S Tiwary

When a bilingual student learns to solve word problems in math, we expect the student to be able to solve these problem in both languages the student is fluent in,even if the math lessons were only taught in one language. However, current representations in machine learning are language dependent. In this work, we present a method to decouple the language from the problem by learning language agnostic representations and therefore allowing training a model in one language and applying to a different one in a zero shot fashion. We learn these representations by taking inspiration from linguistics and formalizing Universal Grammar as an optimization process (Chomsky, 2014; Montague, 1970). We demonstrate the capabilities of these representations by showing that the models trained on a single language using language agnostic representations achieve very similar accuracies in other languages.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

2016 |

P Bajaj, D Campos, N Craswell, L Deng, J Gao, X Liu, R Majumder, A McNamara, B Mitra, T Nguyen, M Rosenberg, X Song, A Stoica, S Tiwary, T Wang

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.