Qcri @ Trec 2011: Microblog Track
QCRI @ TREC 2011: Microblog TrackAli El-Kahki, Kareem DarwishQatar Computing Research Institute, Qatar [email protected], [email protected] AbstractThis paper briefly describes the Qatar Computing Research Institute (QCRI) participation in the TREC 2011 Microblog track. The focus of our TREC submissions was on using a generative graphic model to perform query expansion. We trained a model that attempted to predict appropriate hashtags to expand tweets as well as queries. In essence, we used hashtags to represent latent topics in tweets.IntroductionSearching tweets, a popular type of microblogs, poses interesting research problems. Some of these problems include: 1) the short length of tweets limits the contexts that are available for search; and 2) the language of tweets typically contains non-standard abbreviations and colloquial expressions. We focused on solving the first problem that is related to the short length of tweets. In particular, we focused on bringing more contexts to tweets by expanding tweets and queries alike using appropriate hashtags. Essentially, we used hashtags to represent latent underlying topics in tweets. Massoudi et al. (2011) showed that Hashtags can be effective expansion terms in the context of search in microblogs. Though hashtags appear in less than 19% of all tweets[1] and popular hashtags are often used by spammers, there are sufficient numbers of tagged tweets to build effective hashtag models. We used a Latent Dirichlet Allocation (LDA) like graphical model to learn the relationship between words, latent topics, and hashtags. We assumed that the relationship between latent topics and hashtags to be m to n and that each tweet contains only one topic.
In remainder of the paper, we will describe: the preprocessing we performed on tweets (Sec. 2); the graphical model that we employed (Sec. 3); the experimental setup of the submitted runs (Sec. 4); and official TREC results for our runs (Sec. 5). We finally conclude the paper (Sec. 6).Tweet PreprocessingAccording to the track guidelines, only English tweets are considered relevant. Thus, we needed to extract the English from the approximately 16 million tweets in the collection. We used the language-detection open source Java library[2]. In all, we extracted roughly 4.8 million English tweets. We performed basic text tokenization where words were split on delimiters, except for “#” and “@” as they signify hashtags and user mentions respectively.Graphic ModelWe used an LDA-like graphical model. Figure 1 shows the plate representation of the model that we used. Formally, for each tag T, a set of documents D, Ɵt represents a distribution over different topics Z. For each tag also, there is a set of topics Φkt and distribution Пt of for background and foreground probability. Each document is represented by its topic Z and a set of words. Words w are generated from either background distribution Φbg or from corresponding topic distribution ΦZ,t based on whether the word is background or foreground which is determined by the binary variable y.