How Harvey (the AI assistant) Detects Thank You Emails Auto-Magically

Published in

Hiver Engineering

10 min readJul 28, 2022

One of our objectives at Hiver is to make the customer support teams productive who are using Hiver. We’ve built a lot of features to enable automating a lot of workflows that a customer support team would need and we always keep finding ways to improve them.

This post talks about the Thank You email detection problem that we wanted to solve for our customers and how we evaluated and built a solution for this problem.

What are Thank You emails and why does Hiver want to automatically detect them

A lot of customer support teams who use Hiver are very focused on providing the best customer support experience possible. It is not surprising that a lot of their customers reply with a Thank You email after their customer ticket is resolved. All customer support teams like hearing a word of appreciation from their customers but these Thank You emails also create some inconvenience in the workflow of a typical customer support agent.

The problem: After a support ticket is resolved for a customer, a typical customer support agent replies to the customer on the same ticket and then marks the email ticket as closed. When the customer responds back with a Thank You email, it re-opens the ticket again for the agent even if there is absolutely no action needed from the customer support team there. This causes unnecessary productivity loss for the agent as he has to now close the ticket again. Additionally, this extra time while the ticket stays in the open state after reopening gets added to the Resolution Time metric for this ticket. Ideally, this additional time should not have been added to the Resolution Time metric and is not the true reflection of the actual time spent resolving that ticket by the agent.

We came up with a problem statement of identifying Thank You emails from a customer automatically and preventing the reopening of a closed ticket.

Starting the search for a possible solution 🤔

Detecting a Thank You email seems pretty straightforward on paper. But when you go deeper into the problem you understand that a customer can convey Thank You without even writing those two words (ie, thank you). They can even club a new request with a Thank You message. The cost of a false positive is huge and the customer support agent might end up with an unresponded customer email.

ReGex-Based Solution: We evaluated the most straightforward approach first which was a regular expression-based solution. In this approach, we identify all keywords and patterns that can possibly convey the Thank You sentiment and try to create a regular expression that would cover most of the possible messages that a customer can write.

We discarded this approach quickly because detecting Thank You message patterns will not work well with a free-flowing email message if the customer was a little creative in conveying the Thank You. Second, the regular expression would be overtly complex to debug and understand for future enhancements. It will not be able to learn the relations among the words within a sentence and neither be able to capture the sentiment of the message.

Machine Learning-Based Classifier: The approach involved using machine learning to classify the emails into two separate classes (types):

Class 0 (General class/Non-Thank you type emails)
Class 1 (Thank You sentiment type emails)

To validate the approach, we performed a POC (proof of concept) on an openly available email dataset (Enron Email Dataset). The intent was to validate whether we can “encode” our email content in a way so that an ML algorithm can model the difference and relationships between these two classes. The results were fairly encouraging but not perfect. One of the reasons was that we did not have very clean data. We also needed to generalize the model so that it works for a wide variety of email content well. But we were fairly confident after this POC that it is possible to tweak the approach to reach a stage that would solve our business requirements.

Preparing the Training Data 🕵️‍♀️

Identifying the data to train on: The idea was to get as much real-world scenario data as possible. This would ensure that our model works much better in the real world too. At Hiver, our customer support team uses our own product to provide customer support. A lot of customers end up appreciating us (ie, conveying thank you) whenever they are happy about their ticket resolution. This was good enough real-world data to train our model.

Cleaning the data: The first step was to get rid of all PII (personally identifiable information) like email ids, phone numbers, etc. The next step was to remove all Proper Nouns being used in the sentences. We utilized SpaCy’s transformer-based model which helped us to detect parts of speech that were proper nouns.

Data labeling: The intent was to manually label the emails which were Thank You emails so that they can be used to train the model. After spending some time on this we realized that there are ambiguities in a few cases as an email might contain Thank You and still ask for an action from the customer support team. It is possible that two different teammates (who are performing the labeling exercise) might end up tagging the same email differently. To solve this, we came up with a clearly documented set of instructions/policies on how to go about the decision making and this brought in a lot of consistency.

Data pre-processing: These steps are standard ones when processing any textual data.

Remove URLs, email ids, and HTML tags
Lemmatize all worlds. Ie, bring all words to their root form
Remove punctuation and emojis
Remove stop words (a curated list of words at the end that we did not want in our vocabulary)

Words to numbers (Encoding the training data 🥸)

The training data was textual data and it needed to be converted to a machine-understandable form. The next step was to encode textual data into numeric data which can be used to train the mode. Although there are multiple ways to approach this problem, we decided to go ahead with the TF-IDF (Term Frequency and Inverse Document Frequency) approach.

Source: https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558

TF-IDF is a kind of vectorizer that helps us to give weightage to more unique words in a corpus (collection of documents i.e. emails in our case). Words that are occurring often will be more common and won’t be able to act as a differentiator, thus they’ll get a low score, while the words which are unique will help to separate one kind of document from another, and hence are given a higher score. Though TF-IDF doesn’t provide or preserve the semantics of the sentences, it was good enough for our use case, and the type of encodings it generated has proven to work very well over the years for such kinds of classification use cases.

The output after this step is a Compressed Sparse Matrix.

Splitting the Data: We separated the data into two parts. One for training the machine learning model and the other for evaluating how well the model has been trained. We used an 80–20 split (80% for training and 20% for validation). One of the important things to note is that both the splits should get an equal distribution from both the classes (ie, Class 0 and Class 1). We followed the stratified splitting strategy to ensure that.

Building the classifier (The artificial separator)

Now that the training data was ready, the next step was to choose which classifiers to use to build the model. We evaluated the tree-based classifiers first Random Forest Classifier and LightGBM (Light Gradient Boosting Mechanism) Classifier, the results were impressive but, we thought we could still do better and wanted to explore more models. Next, we tried out a linear classifier — Logistic Regression

Source: http://www.ritchieng.com/logistic-regression/

LogisticRegression performed better than our advanced tree-based models. In fact, it was getting better with more data. Whereas, the model’s performance was dropped with more data in the case of LightGBM. This meant that it was trying to overfit in the earlier stage with lesser data and when provided with more data it wasn’t able to generalize very well.

Learning Curve Plot for Logistic Regression

The fact that a linear classifier like LogisticRegression was able to work well on our binary classification problem indicates that there was a linearly separable relationship between our classes and the corresponding data. Additionally, when we have very high dimensional data, linear models like LogisticRegression can perform better than some advanced models or algorithms.

Setting up metrics to judge the performance of the model

The key to finding the right metric is to look at the business problem and find the most relevant metric which conveys precisely how well this model is solving the business requirement. The metric should also help to improve the model performance as well as tell us whenever the performance is going down.

For our use case, we wanted to be very sure that an email is being predicted as a Class 1(i.e. Thank You) email is genuinely Class 1. Based on this prediction, the Hiver system would refrain from re-opening the ticket. Primarily, Precision was the metric we were looking for and Recall could be our secondary metric which we were slightly okay to be lenient on.

Source: https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488

Summarizing the product requirement, it was occasionally okay to miss detecting a genuine Thank You email but if we detect an email as Thank You email we better be very confident that it is genuine.

We evaluated our model’s performance at different thresholds to find the sweet spot where the precision was fairly high but we did not compromise much on the recall.

Precision-Recall trade-off at different thresholds

Since it was a binary classification problem, we measured the AUC-ROC too when we were done with the training.

Taking the model to production 🚀 (The ML + Web service combo)

Taking a model to production needs a lot of planning and can result in sub-optimal results if not done correctly.

As VentureBeat reports, around 90 percent of machine learning models never make it into production. In other words, only one in ten of a data scientist’s workdays actually end up producing something useful for the company.

One of our primary concerns was that our model should not disrupt anything which is already working well for our customers and we would want to sandbox our approach to whatever extent possible. We built out a separate microservice around this model as a wrapper which will help us encapsulate the model properly and at the same time expose it to other services through APIs for any prediction.

We used a python micro-framework (FastAPI) to expose the capability through REST APIs. We designed the system in a way so that even if the REST APIs fail for any reason, the core workflow for a Hiver user should not get impacted at all.

The data and data trends won’t be changing too fast for such a type of problem so we did not need any continuous training pipelines for this use case. Continuous monitoring of the prediction at some cadence is a must though.

Versioning the model: When we deploy a machine learning model in production it’s very important that we version them well. Tools like MLflow can easily do this for us, but our use case didn’t require us to change the model too frequently so we postponed this for later. We might need to retrain our model once we have enough new data where the model went wrong (false positives or true negatives)

The story doesn’t end here — Capturing feedback to check the performance of the model in the real world

Taking and collating feedback is the most important stage of a machine learning model lifecycle in production.

We identified two ways of taking feedback:

Implicit Feedback — Our model closed the conversation, i.e. predicted a mail as of class 1. Now if the conversation got reopened again, either by an agent or by another incoming email in the same conversation, then from this we can infer that the prediction might be a false positive. Why do I say “might”? Because we have observed examples where a conversation got reopened, but the inferred email was of the correct class. So we have to be careful while gathering implicit feedback.
Explicit Feedback —Explicit feedback requires input from the user who is using the feature currently. It also requires more effort (building a way for users to give feedback) and can sometimes disrupt the user flow. At this point in time, we have not implemented the capability to get feedback from the user directly and would pick that up later.

Final notes

We had a bunch of happy customers coming back and telling us how well this feature is working for them. Our model in production is performing well too and we have not seen any false positives reported yet. With more customers enabling this feature gradually, we should be able to see its performance for a wide variety of email responses.

We are very excited about the next feature that our ML team is working on. We would be able to open up the beta access pretty soon so watch out for the release. By the way, if you found whatever we are working on interesting, come and join us here at Hiver.