Skip to main content

Text Classification

/img/content-concepts-raw-text-classification-img.png

Introduction

  • Definition: Text classification is a supervised learning method for learning and predicting the category or the class of a document given its text content. The state-of-the-art methods are based on neural networks of different architectures as well as pre-trained language models or word embeddings.
  • Applications: Spam classification, sentiment analysis, email classification, service ticket classification, question and comment classification
  • Scope: Muticlass and Multilabel classification
  • Tools: TorchText, Spacy, NLTK, FastText, HuggingFace, pyss3

Models

FastText

Bag of Tricks for Efficient Text Classification. arXiv, 2016.

fastText is an open-source library, developed by the Facebook AI Research lab. Its main focus is on achieving scalable solutions for the tasks of text classification and representation while processing large datasets quickly and accurately.

XLNet

XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv, 2019.

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 2018.

TextCNN

What Does a TextCNN Learn?. arXiv, 2018.

Embedding

Feature extraction using either any pre-trained embedding models (e.g. Glove, FastText embedding) or custom-trained embedding model (e.g. using Doc2Vec) and then training an ML classifier (e.g. SVM, Logistic regression) on these extracted features.

Bag-of-words

Feature extraction using methods (CountVectorizer, TF-IDF) and then training an ML classifier (e.g. SVM, Logistic regression) on these extracted features.

Process flow

Step 1: Collect text data

Collect via surveys, scrap from the internet or use public datasets

Step 2: Create Labels

In-house labeling or via outsourcing e.g. amazon mechanical turk

Step 3: Data Acquisition

Setup the database connection and fetch the data into python environment

Step 4: Data Exploration

Explore the data, validate it and create preprocessing strategy

Step 5: Data Preparation

Clean the data and make it ready for modeling

Step 6: Model Building

Create the model architecture in python and perform a sanity check

Step 7: Model Training

Start the training process and track the progress and experiments

Step 8: Model Validation

Validate the final set of models and select/assemble the final model

Step 9: UAT Testing

Wrap the model inference engine in API for client testing

Step 10: Deployment

Deploy the model on cloud or edge as per the requirement

Step 11: Documentation

Prepare the documentation and transfer all assets to the client

Use Cases

Email Classification

The objective is to build an email classifier, trained on 700K emails and 300+ categories. Preprocessing pipeline to handle HTML and template-based content. Ensemble of FastText and BERT classifier. Check out this notion.

User Sentiment towards Vaccine

Based on the tweets of the users, and manually annotated labels (label 0 means against vaccines and label 1 means in-favor of vaccine), build a binary text classifier. 1D-CNN was trained on the training dataset. Check out this notion

ServiceNow IT Ticket Classification

Based on the short description, along with a long description if available for that particular ticket, identify the subject of the incident ticket in order to automatically classify it into a set of pre-defined categories. e.g. If custom wrote "Oracle connection giving error", this ticket type should be labeled as "Database". Check out this notion.

Toxic Comment Classification

Check out this notion.

Pre-trained Transformer Experiments

Experiment with different types of text classification models that are available in the HuggingFace Transformer library. Wrapped experiment based inference as a streamlit app.

Long Docs Classification

Check out this colab.

BERT Sentiment Classification

Scrap App reviews data from Android playstore. Fine-tune a BERT model to classify the review as positive, neutral or negative. And then deploy the model as an API using FastAPI. Check out this notion.

Libraries

  • pySS3
  • FastText
  • TextBrewer
  • HuggingFace
  • QNLP
  • RMDL
  • Spacy

Common applications

  • Sentiment analysis.
  • Hate speech detection.
  • Document indexing in digital libraries.
  • Forum data: Find out how people feel about various products and features.
  • Restaurant and movie reviews: What are people raving about? What do people hate?
  • Social media: What is the sentiment about a hashtag, e.g. for a company, politician, etc?
  • Call center transcripts: Are callers praising or complaining about particular topics?
  • General-purpose categorization in medical, academic, legal, and many other domains.