In Deep Learning for Natural Language Processing, supervised text generation tasks are usually trained by minimizing the cross-entropy loss (i.e. an error) between the ground-truth sequence and the predicted sequence. However, when we tackle unsupervised text generation task, we may be interested in feeding the generated sentence in other networks such as classifiers (or Encoder-Decoders when doing back-translation). Yet, if the predicted sequence was generated by sampling tokens on a distribution that is the output of a softmax layer, the sampling operation “breaks the chains of differentiability” in the model, and stop the backpropagation of gradients during training. Reinforcement Learning has been used to workaround the non-differentiability of sampling but we present here two different tricks: SEQ3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression from Christos Baziotis et al. and Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs from Sachin Kumar et al.
Approximation of discrete sampling with continuous mixture of embeddings
Christos Baziotis et al.‘s work tackles a Sequence-to-Sequence task similar to Machine Translation or Text Style Transfer: Unsupervised Sentence Compression. Their Encoder-Decoder is trained similarly to the back-translation training in Lample et al. (2018b), where the (pseudo-translation, back-translation) pair is now a (compressor, reconstructor) pair. Note that both systems are Sequence-to-Sequence-to-Sequence Autoencoders.
Christos Baziotis et al. describe a method to make sampling differentiable, hence allowing the backpropagation of gradients in the compressor at training time. During the forward path, they do a discrete sampling but during the backward path, they use the Gumbel-Softmax reparametrization trick: a weigthed sum of word embeddings approximates the embedding of the token that would have been sampled after the softmax layer.
What if discret ouput could be totally reimplaced by continuous output?
Actually, there is a radically different approach: getting rid off the softmax layer and discrete sampling. The idea of involving continuous outputs to enable controllable language generation and paraphrasing was introduced by Yulia Tsvetkov in her presentation Towards Personalized & Adaptive NLP: Modeling Output Spaces in Continuous-Output Language Generation at The 4th Workshop on Representation Learning for NLP (ACL19).
The main issues with softmax are its high memory complexity, its slowness and the non differentiability of sampling based on its output distribution.
Therefore, in Sachin Kumar and Yulia Tsvetkov proposed in Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs to reimplace the softmax layer by the generation of word embedding (a continuous representation of words generated by a differentiable layer). For a supervised Sequence to Sequence task, training is performed by minimizing the distance between the generated word embedding and a pre-trained word embedding table. At inference time, they genrate the predicted token by decoding the generated word embedding with the k-nearest neighbors algorithm.