Workarounds for the non-differentiability of sampling when generating text


In Deep Learning for Natural Language Processing, supervised text generation tasks are usually trained by minimizing the cross-entropy loss (i.e. an error) between the ground-truth sequence and the predicted sequence. However, when we tackle unsupervised text generation task, we may be interested in feeding the generated sentence in other networks such as classifiers (or Encoder-Decoders when doing back-translation). Yet, if the predicted sequence was generated by sampling tokens on a distribution that is the output of a softmax layer, the sampling operation “breaks the chains of differentiability” in the model, and stop the backpropagation of gradients during training. Reinforcement Learning has been used to workaround the non-differentiability of sampling but we present here two different tricks: SEQ3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression from Christos Baziotis et al. and Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs from Sachin Kumar et al.

Approximation of discrete sampling with continuous mixture of embeddings

Christos Baziotis et al.‘s work tackles a Sequence-to-Sequence task similar to Machine Translation or Text Style Transfer: Unsupervised Sentence Compression. Their Encoder-Decoder is trained similarly to the back-translation training in Lample et al. (2018b), where the (pseudo-translation, back-translation) pair is now a (compressor, reconstructor) pair. Note that both systems are Sequence-to-Sequence-to-Sequence Autoencoders.

Model architecture of SEQ3. Figure from Baziotis et al. 

Christos Baziotis et al. describe a method to make sampling differentiable, hence allowing the backpropagation of gradients in the compressor at training time. During the forward path, they do a discrete sampling but during the backward path, they use the Gumbel-Softmax reparametrization trick: a weigthed sum of word embeddings approximates the embedding of the token that would have been sampled after the softmax layer.

Differentiable Sampling in SEQ3. Figure from the presentation given at NAACL-HLT 2019 by Baziotis et al.

What if discret ouput could be totally reimplaced by continuous output?

Actually, there is a radically different approach: getting rid off the softmax layer and discrete sampling. The idea of involving continuous outputs to enable controllable language generation and paraphrasing was introduced by Yulia Tsvetkov in her presentation Towards Personalized & Adaptive NLP: Modeling Output Spaces in Continuous-Output Language Generation at The 4th Workshop on Representation Learning for NLP (ACL19).

The main issues with softmax are its high memory complexity, its slowness and the non differentiability of sampling based on its output distribution.

Therefore, in Sachin Kumar and Yulia Tsvetkov proposed in Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs to reimplace the softmax layer by the generation of word embedding (a continuous representation of words generated by a differentiable layer). For a supervised Sequence to Sequence task, training is performed by minimizing the distance between the generated word embedding and a pre-trained word embedding table. At inference time, they genrate the predicted token by decoding the generated word embedding with the k-nearest neighbors algorithm.

Continuous representation of words in conditional language generation. Figure from Tsekov.

Leave a Reply

Required fields are marked *.