CURTIS — CURe by Therapy Intelligent System

Kaushik Shakkari
6 min readNov 2, 2020

Curtis is a mental health therapy bot that comforts users with short, accurate, and empathetic responses using deep contextual NLP models.

A photo from Factor Daily

On March 22nd of this year, in the peak covid time in Los Angeles, I read news about how people losing jobs and facing anxiety and depression issues because of the pandemic.

Tweet by CBS Chicago

I got an idea to built a mental health conversational AI bot, CURTIS that can be your friend and consult you during these difficult times.

CURTIS uses real therapist responses for multiple questions from council chat for training the model to generate empathic responses to soothe users. Council Chat is a website where therapy-seekers can find therapists by posting their questions on the forum and can view the responses from real therapists. Each URL on Council Chat is an individual question and I have written a python script to retrieve all URLs from council chat. Then I have crawled all the answers from therapists using questions’ URLs. There is a total of 12139 therapists’ answers from 33 different topics. Topics include depression, anxiety, stress, and parenting, etc.

Screenshot from author: Plotted using Plotly Visualization
Screenshot from author: Topic Distribution of scraped dataset follows Zipfian Distribution

There are many challenges in using this dataset for modeling. In spite of 14268 therapist answers, there are only 831 unique questions. Selecting one therapist’s answer for a question to extract the best response is difficult. The topic distribution is following Zipfian distribution. An imbalanced dataset (Zipfian distribution) is not good for a topic classification model.

I solved the above challenges by performing a sample selection technique on the council chat dataset and by collecting another dataset from Kaggle. I used upvotes and views as features to select one answer from multiple therapist’s answers. To avoid Zipfian distribution while modeling the baseline topic classification model, I have created a new label called root_topics. The root topic only consists of 3 topics.

Screenshot from author

I created a multi-label feature for root topics. For example, the question — “I am very stressed because of my girlfriend. How can I deal with it?” gets [1,1,0] as the label because the question can be classified as both Relationship Conflicts and Emotion Conflicts category.

Modeling: After understanding the problem, I came up with 3 main ideas to generate good responses. For all 3 models, I am using deep contextual bidirectional sequence to sequence models like BERT, XLNet, and Roberta. I have completed implementing model 1 and model 2. model 3 is currently in progress.

Example:

Question Title (Question): I don’t know why my mother cannot trust me Question Text (Question Context): My mother started hating me. She doesn’t believe in me and keeps scolding me. She is also avoiding my questions.

Sample Response (from the bot): It seems like you are having conflicts with someone important in your life.

Any seq-seq model requires a good word and sentence representation for any supervised or unsupervised task. In our case, It is very important to capture the context of the user questions. For example, let us consider the following scenarios.

Screenshot from author

The questionText feature from the dataset will help us to get context and give a better representation of words and sentences while building the model.

Model 1: (Fine Tuning Multi-Label Multi-Class Classification model) — My Baseline Model Description: This model is similar to finding the intent of the query and generating an intent specific response. Given a questionTitle and Context/ questionText, I built a model to do multiclass classification using fine-tuning Google BERT. I am adding an output layer (768 nodes * the number of labels) at the end of the “BERT BASE UNCASED” pretrained model which uses 110 million parameters to classify questions. The topic feature from the dataset acts as labels for classification. After classifying into a topic, a topic-specific response is outputted.

Metric: label ranking average precision score = 91.46 %

Advantages: It is easy to evaluate classification tasks using metrics like accuracy, precision, recall, and F1 score, etc. In our case, I am building the multilabel classification and I used “label ranking average precision score” as metrics to evaluate the model.

Disadvantages: This model can only generate a specific message for a topic. For example, both questions “How do I get my girl to fall back in love with me?” and “How do I get over my breakup and concentrate on my career” gets classified as a “Relationship Conflicts” root topic or “Relationships” topic and generic response like “The decision regarding relationship issues is always a difficult one” can be generated and it might not be apt for all questions in “relationships” topic.

Model 2: (Contextual Similarity Model) — questionText can act as the context of the questionTitle. As I explained earlier, it is very important to get the contextual representation for questionTitle.

For each questionTitle in the dataset, I generate 768-dimensional contextual vectors using the BERT pre-trained model. When the user types the context of the questionText and the questionTitle, I generate the BERT contextual vector for the questionTitle. I calculate the similarity between the user’s contextual vector with all other contextual vectors in the dataset. The contextual vector with the highest cosine similarity score will be the response from the bot.

Sample responses generated from some questions:

Screenshot from author
Screenshot from author

From the above table, we can see that the response from the model for the last response is irrelevant. We can also observe that confidence for such responses is low compared to other relevant responses.

Advantages: Unlike the baseline model, this model will generate specific responses for different questions.

Disadvantages: It is very challenging for any unsupervised learning task to evaluate the model performance without some feedback from users. Currently, I am using the highest similarity score to predict the responses. However, I need users’ feedback to compute the threshold statistically and analyze if the highest similarity score is greater than or less than the threshold. Later, I can perform an error analysis on False Positives and False Negatives to tune the model.

Model 3: (Response Generative Model) — In progress

The first model is a retrieval-based model and the second model is the context similarity-based model. We can also build a sequence to sequence generative models. The architecture of this model is similar to language translation models. The generative model requires more data points but we have only 837 unique questions. Currently, I am exploring more websites like CouncilChat to scrape more data.

I am also building the user interface for CURTIS. (In Progress)

Video from author

Stay tuned for updates on the generative model and user interface!

Add me on LinkedIn. Attaching my portfolio link.

--

--

Kaushik Shakkari

Senior Data Scientist | My biggest success till day was converting my passion to my profession (Data Science 🚀)