TripAdvisor: The Good, the Bad, and the Neutral

Problem and Motivation

TripAdvisor is the world's largest travel website, providing reviews for travel-related businesses. On its website, users can query anything from restaurants to tour guides, flights, and hotels.


Our goal is to improve the quality of returned search results by interpreting past users' feelings toward the specific features of each business. For example, one may learn that for a certain hotel, the service is good but restroom cleanliness is not. Our approach allows us to rank businesses according to the magnitude of how past users feel about the objects in a user's query. These search results can be further enriched by displaying other positively (or negatively) mentioned items to the users. In addition, we present results from clustering businesses based on reviews, allowing users to view similar businesses when an exact match to their query cannot be found.

As the premier travel-planning tool with nearly 50 million monthly visitors and millions of business reviews, being able to provide detailed, easily interpretable information to travelers is key to TripAdvisor's mission. Towards this end, the construction of our search engine involves three main steps:

  • Entity Recognition: Determine what things (nouns/noun phrases) are being discussed in each review.
  • Sentiment Analysis: Determine how people feel towards each entity.
  • Topic Modeling: Cluster businesses according to the words discussed in their reviews. When a user's query doesn't match any entitites, what might be the next closest thing?

Data

Our dataset is comprised of a set of 582,626 TripAdvisor reviews for businesses in the Boston-Cambridge area. These reviews are supplied in typical column-format, with entries specified as follows:


Business
Rating
TimeStamp
ReviewTitle
ReviewText
The Krusty Krab 3 02:24:12 06-13-2015 Plankton Infestation Great recipe. But this place is overrun...

We found that businesses have approximately 174 reviews on average. Given the size of the dataset, we therefore have reasonable confidence that most commonly-searched items can be found among the reviews, and that we can therefore perform sentiment analysis on specific features.

Methodology

Entity Recognition

As we have discussed in the introduction part, for each review, our team will need to recognize what people are commenting first, which is our entity recognition part.

We've known that the general principal for machine learning is garbage in, garbage out. As such, parsing text is a hard problem, especially we are dealing with texts written on the internet. Our solution is to take advantage of Part-Of-Speech Tagging(PoS Tagging)

This approach has been utilized in many places. The PoS tags are assigned to a single word according to its role in the sentence. Traditional grammar classifies words based on eight parts of speech: the verb (VB), the noun (NN), the pronoun (PR+DT), the adjective (JJ), the adverb (RB), the preposition (IN), the conjunction (CC), and the interjection (UH).

Relation tags describe the relation between chunks, and clarify the role of a chunk in that relation. For example, we could extract SBJ (subject noun phrase) and OBJ (object noun phrase). They link NP to VP chunks.

In our approach, we used such labeling convention to label our words and split them by their labels. We also need to match our entities to adjectives. Those entities we extracted tell us each review’s about. Later on, when customers query some words, we could search entities first to find associated business. Also, we use these nouns to do LDA (topic modeling), which will be covered in the later section.

Sentiment Analysis

Formulation

We formulate the problem of learning sentiments as a multi-class classification problem. Specifically, using review contents as input, we treat the overall review ratings as the sentiments. Since ratings could range from 1 to 5, we treat those higher than 3 as positive, lower than 3 as negative, and otherwise neutral. Thus our problem reduces to a 3-class classification task. With this setup, we experiment with 3 different types of models: Convolutional Neural Networks (Kim, 2014), FastText (Joulin, 2016), and our own approach using a semi-supervised model.

Approaches

Convolutional Neural Networks

We also used the popular Convolutional Neural Network to do text classification. Although we didn’t end up using this model eventually, we briefly introduced our approach here. The figure above summarized this implementation’s flow. The first matrix is n by k representation of a sentence in the review, where the sentence has n words and each word is embedded into a k-dimensional vector (the vectors are padded to be the same length). The next layer performs convolutions over the vectors using multiple filter size. A simple example would be slide over different words at a time. The next step is to max-pool the result of those convoltional layers into a feature vector, regularizing, adding drop out and classifying our result using a softmax layer (binary output indicating postive or negative of the sentence). We referenced to an existing CNN library by Denny Britz and perform variables tunning, and for simplicity, we discard the channel for non-static word vectors (which is presented in the paper), and we also discard sentences whose length exceeds 100 words to reduce the training time on the relatively large dataset we have.

FastText

FastText is a highly efficient approach to text classification. As depicted in the below figure, the model uses bag-of-ngram features constructed from the input sentence. Each n-gram term is transformed to continuous vector space via an embedding (i.e. lookup) layer. and the average is taken as input features for classification. Training is as efficient as other bag-of-words models but can also pick up word ordering which could be critical in discovering sentiments. Although not relevant to our setup, it's worth noting that when the number of output classes is large, FastText uses Hierarchical Softmax to optimize the training speed and thus is able to process more than one billion words in less than 10 minutes.

Semi-Supervised Model

While supervised training often yields high quality models and oftentimes can achieve better performance, the cost of obtaining labels can be prohibitively high for many scenarios. As a result, much research has been done towards learning good unsupervised representations. Notable examples include the word2vec approach which learns continuous vector representation for discrete words. In this work, we attempt to learn these features by first training a language model on the entire corpus of review data. That is, we first train a model to predict the next word given all previous words in the sequence (that is to learn the conditional probabilities shown in the equation below). For this, we make use of a single-layer uni-directional LSTM which uses traditional word embeddings as input. Note that review ratings are not used since we use the next word as the label instead.

Once a model is obtained, we save the embedding matrices and LSTM parameters. Using these weights, we train a supervised model to predict review rating by first running the embedding + LSTM through the review text to obtain good features. Finally, we use logistic regression to train a sentiment model.

Topic Modeling

Latent Dirichlet Allocation (LDA)

For each business, we concatenate all of its reviews into a single string and consider this to be one 'document' in the topic model. After determining the distribution of topics in the dataset of businesses, we assign each business to its 'top' topic - e.g. the topic comprising the highest proportion of its reviews.

In order to determine the appropriate number of topics, we apply a CountVectorizer to our collection of documents and perform K-Means clustering on the businesses. Using an elbow plot of the sum of errors for all businesses allows us to approximate the optimal number of topics.

After determining the topics, we then attempt to create "summaries", or labels, for each topic according to its highest-weighted words. We calculate an approximation of KL-divergence using the corpus of words for the topic model.

Candidate labels are generated using unigrams of the highest-weighted words for each topic, as well as bigrams of words that occur within 10 words up/downstream of these high-weight words. We use this paper by Mei, Shen, and Zhai (2007) as reference.

Results

Sentiment Analysis

The results of the five sentiment models we attempted are shown below. Our models are trained using user reviews as input, and user ratings (bubble scores) as output.

Model
Accuracy
NLTK Vader 61%
DC GAN Failed
CNN 89%
Fast Test 90.5%
Semi-Supervised 92%

Among the five models, the semi-supervised approach was the most successful. We use this model in our final demo by predicting the sentiments expressed in individual sentences within each review. Then, we extrapolate the sentiment predicted to that sentence to the entity that appears within it.

Topic Modeling



Determining K (# Topics)

Distribution of Topics

Based on the results of K-Means clustering, we determined that K=30 was a good approximation for the number of topics.

Highest-Weighted Words in Each Topic



While some topics are well-differentiated, others seem to cover similar topics (blue). Meanwhile, one topic appears nonsensical, capturing words from non-English reviews. After performing topic labeling, we find that some of the more similar-appearing topics are suddenly more differentiable:



Topic Labels

Conclusion and Future Work

We have implemented a model for suggesting query results by searching for matching positively-discussed entities within reviews. Our demo allows a user to enter a search query, and returns results with the highest associated sentiment intensity for the query. Users can also examine other positively and negatively expressed features of each business.

The goal of this project is to engage with customers. Our search engine achieves this goal by improving the accuracy of returned results, and by trying to understand what our customers really need. One extension of this project might be to predict what customers will like in the future based on previous queries. Such a feature could be implemented by introducing a recommendation system that utilizes collaborative filtering and other approaches, allowing TripAdvisor to become a personal travel advisor to guide, help and provide advice for customers. Furthermore, by looking at user queries, we may anticipate potential business opportunities to create more market value.

Meet The Team