Annotating implicit toxicity and subtle forms of abusive language is difficult
When my academic PhD advisor Thomas Bonald and I started collaborating with Jigsaw and the Athens University of Economics and Business in Spring 2019, we faced the following issue: if annotating explicit toxicity in online conversations has already been done (ex: 100k English Wikipedia Talk Page comments labelled for personal attacks, 2m comments from the Civil Comments platform annotated with the following attributes: severe toxicity, obscene, threat, insult, identity attack, sexual explicit), it becomes much more complicated asking crowdworkers to label implicitly offensive comments. Zeerak Waseem et al. state in Understanding abuse: A typology of abusive language detection subtasks that:
“Annotation (via crowd-sourcing and other methods) tends to be more straightforward when explicit instances of abusive language can be identified and agreed upon, but is considerably more difficult when implicit abuse is considered. […] Furthermore, while some argue that detailed guidelines can help annotators to make more subtle distinctions, others find that they do not improve the reliability of non-expert classifications”.
Besides, more subtle abusive comments may be easier to rephrase since crude (and easy to detect) toxicity often involves identity-based attacks, obscenity or insults that are the main content of the comment. As proposed by The challenge of identifying subtle forms of toxicity online, a system suggesting automatic rewordings of “detoxifiable” comments could nudge healthier conversations at scale.
The difficulty of getting accurate labeled comments containing passive toxicity and our interest in a general AI detoxification model motivates the exploration of Unsupervised Text Style Transfer. As stunning results in style transfer came first from Computer Vision research, we introduced some milestones achieved in Image-to-Image translation research in this article.
Unsupervised Sequence to Sequence Models
Sequence to Sequence tasks aim at transforming a sequence with certain attribute (the source sequence) to a generated sequence with other attribute (the destination sequence). Text input being seen as a sequence of words, or more generally a sequence of tokens for instance if you consider subword units.
Examples of Sequence to Sequence tasks for text include machine translation, text style transfer, abstraction-based summarization and conversational systems. In this article we’ll focus on models based on unsupervised learning tricks as we saw earlier that annotating parallel text sequences for “detoxification” was difficult and costly.
Unsupervised Machine Translation
Unsupervised Machine Translation has made significant progress recently at the instigation of work from Facebook AI Research. If you want a mor detailed chronological wrap up of their research, please refer to the Facebook blog post Unsupervised machine translation: A novel approach to provide fast, accurate translations for more languages (Marc’Aurelio Ranzato et al.).
In their first article Word Translation Without Parallel Data (2017), Alexis Conneau et al. built a bilingual dictionary by first learning word embeddings in each language, then learning rotations forcing the word embeddings in both languages to match.
The second step towards good Machine Translation systems is to generate sentences that sound fluent in the destination language, which cannot be obtained only from a word-by-word translation system. Fluency of a translated sentence is quantitavely measured by its perplexity. Guillaume Lample et al. describe in Unsupervised Machine Translation Using Monolingual Corpora Only (2018a) and Phrase-Based & Neural Unsupervised Machine Translation (2018b) methods to achieve fluent generation of translations.
- Initialization: align the representation of similar words (or phrases or subwords) in both languages. This allows coarse word-by-word translation. In a neural machine translation system, we can learn a joint token embedding table for the two coropora.
- Language modeling: independently learn a language model per language. A language model captures the likelihood of a words sequence to be fluent in a specific language. It will refine the word by word translation. Language model is learnt through a denoising auto-encoder.
- Back-transaltion (Rico Sennrich et al.): similarly to the cycle consistency loss of CycleGAN, translating back to the source language a sentence pseudo-translated from the source language to the target language (called a pseudo-translation since during training the translation is very noisy), should provide a sentence close to the source sentence. Training is done by minimizing the distance between the source sentence and the back-translated sentence.
There are different ways of implementing these principles. We’ll focus on the neural Encoder-Decoder architecture described in Lample et al. (2018b).
A Neural Encoder-Decoder is made of an encoder module and a decoder module and a token generation module. The encoder and the decoder are often based on the same network or the same network class. The initialization step is implemented by learnt embedding matrices that produce the word embeddings at the input of the encoder and decoder modules. The embedding lookup matrix can be shared by the encoder and the decoder or not. When generating a translation at inference time or a pseudo-translation at training time, the decoder takes as input the representation output of the encoder, a Beginning of Sentence token, the attribute (language, style, etc…) of the destination and tokens already generated at the time step. The straightforward implementation of the generator module is a linear layer followed by a softmax function that will be used to sample generated tokens.
Training the Encoder-Decoder
Now that we saw the architecture of the Sequence to Sequence model at inference time, how can we use the three principles of unsupervised machine translation to train the model?
Language modeling can be trained by minimizing a denoising auto-encoder loss function: LLM = 𝔼x~X[-log PLM(x|C(x), asrc(x)].
where C is the noise applied to input sentences and asrc(x) is the language of the input sentence x (pseudo-translation is not performed when training the Language Model; the destination language is the source language).
The back-translation training consists in pseudo-translating the source sentence into the destination language through the Encoder-Decoder, then translating back to the source language the generated sentence and minimizing the error between the back-translated sentence and the source sentence. LBT = 𝔼x~X[-log Pasrc(x)->adst(x)(x|D(E(x), adst(x))]
where adst(x) is the target language of the pseudo-translation.
Finally, the total loss to minimize is LLM+λLBT.
Wait a minute, how can we backpropagate the error though the entire pipeline if the sampling operation generating the pseudo-translation tokens (based on the probability distribution produced by the softmax layer) is not differentiable? While Lample et al. (2018b) just froze the weights of the Encoder-Decoder during the pseudo-translation for “simplicity and because [they] did not observe improvements when doing so”, Christos Baziotis et al. described in SEQ3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression an approach to workaround the non differentiability of sampling during back translation. We extend the non differentiability of sampling in this article.
Allright, we saw the general architecture, the training procedure, but which models to use for the encoder and the decoder? Lample et al. (2018b) used LSTMs and transformer cells but later, Lample et al. introduced in Cross-lingual Language Model Pretraining (2019a) a network entirely based on the transformer architecture and a training procedure inspired from the Masked Language Modeling described by Jacob Devlin et al. in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Pre-trained transformers as encoder and decoder networks
If you’re not familiar with how Attention Is All You Need leveraged attention mechanisms to build the Transformer, I would recommand to take a look at The Annotated Transformer. Compared to recurrent neural networks, transformers capture longer-term dependency which proved working well on Natural Language Understanding (BERT, XLNet) and Natural Language Generation (GPT-2).
Furthermore, to understand how Jacob Devlin et al. pre-train the Transformer architecture with Masked Language Modeling, The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) is an excellent ressource.
As Lample et al. (2019a) obtained very good results on unsupervised Machine Translation tasks, we are genuinly interested in studying how the three principles of Unsupervised Machine Translation and the Neural Encoder-Decoder architecture with pre-trained transformers can help us transfer the style of text sequences and more generally, improve Sequence to Sequence unsupervised Natural Language Processing tasks.
Quick overview of previous work in Unsupervised Text Style Transfer
Text Style Transfer with unparallel data has been approached with various methods, you can find an exhaustive list of research papers since 2017 here.
In 2017, Zhiting Hu et al. and Tianxiao Shen et al. based their models on disentangled latent representations and established first results in Text Style Transfer. Examples below show sentiment style transfer applied to short sentences with some loss of meaning or fluency:
- Source (👎): “it was super dry and had a weird taste to the entire slice .”
- Destination (👍 – Hu et al.): “it was a great meal and the tacos were very kind of good .”
- Destination (👍 – Shen et al.): “it was super flavorful and had a nice texture of the whole side .”
- Source (👍): “i love the ladies here !”
- Destination (👎 – Hu et al.): “i avoid all the time !”
- Destination (👎 – Shen et al.): “i hate the doctor here !”
The only application of style transfer for detoxication of abusive comments to our knowledge is Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer (Cicero Nogueira dos Santos et al., 2018). Despite giving a first try at paraphrasing offensive messages, this work states that “current unsupervised text style transfer approaches can only handle well cases where the offensive language problem is lexical […]. The models experimented in this work will not be effective in cases of implicit bias where ordinarily inoffensive words are used offensively.”; this conclusion motivates for deeper research to capture sublte forms of toxicity.
- Source (☹): “i hope they pay out the as* , fraudulent or no .”
- Destination (🙂 Shen et al.): “i hope the work , we out the UNK and no .”
- Destination (🙂 dos Santos et al.): “i hope they pay out the state , fraudulent or no .”
In Multiple-Attribute Text Rewriting (2019b), Lample et al. applied their Machine Translation method to style transfer. Results seems pretty good as shown by the following examples:
Here ends our tour of Unsupervised Style Transfer. We don’t claim to have covered all of the most promising technics, for instance Reinforcement Learning and GANs for text arouse ongoing interest and are especially relevant for controled text generation. Looking at the growing interest in Text Style Transfer, things are moving fast, so models we presented might be outdated soon. Anyway, diving into this community and studying how AI and NLP methods can help fighting against offensive behaviors online is particularly thrilling.