A Dual Reinforcement Learning Framework for Unsupervised Text Style Transfer

Overview

  • propose a dual reinforcement learning framework to directly transfer the style of the text via a one-step mapping model, without any separation of content and style.
  • consider the learning of the source-to-target and target-to-source mappings as a dual-task, and two rewards are designed based on such a dual structure to reflect the style accuracy and content preservation, respectively.
  • two one-step mapping models can be trained via reinforcement learning, without any use of parallel data.

    Approach

    image.png
    image.png

  • DualRL

    • the reward for changing style: a pre-trained classifier evaluating how well the transferred sentence matches the target style.

Style Transfer - 图3

  • the reward for preserving content: the probability that the model g reconstructs x when taking y′ as input.

Style Transfer - 图4

  • the overall reward is the harmonic mean of the above 2 rewards.

Style Transfer - 图5

  • policy gradient: sample k times.

Style Transfer - 图6
image.png

  • annealing pseudo teacher-forcing
    • to get rid of the dependence of pseudo-parallel data.
    • enlarge the training interval of teacher-forcing to decay its frequency of updating parameters via MLE.
    • at iteration i, we adopt an exponential increase in the interval of teacher-forcing.

Style Transfer - 图8
image.png

Code

  • https://github.com/luofuli/DualLanST

    Thoughts

  • doesn’t sound feasible without pre-training

  • universal solution
  • worth trying
  • IJCAI, filled with incremental work

    SCALABLE SENTIMENT FOR SEQUENCE-TO-SEQUENCE CHATBOT RESPONSE WITH PERFORMANCE ANALYSIS

    Overview

  • propose five models to scale or adjust the sentiment of the chatbot response: persona-based model, reinforcement learning, plug and play model, sentiment transformation network and cycleGAN, all based on the conventional seq2seq model.

  • develop two evaluation metrics to estimate if the responses are reasonable given the input.
  • reinforcement learning and cycleGAN were shown to be very attractive.

    Approach

  • persona-based model

    • adding extra information to the input of the decoder at each time step
    • a sentiment score
    • the input of the decoder at every time step is then the concatenation of the word embedding and a sentiment score

image.png

  • Reinforcement Learning
    • semantic coherence 1: response y should be semantically coherent to the input x, so pre-train a seq2seq model to estimate this coherence with a probability
    • semantic coherence 2: two RNN encoders are used to represent the input x and output y as two embeddings, these two embeddings are concatenated and followed by a FC layer to produce a score between 0 and 1
    • sentiment score: based on the classifier above

Style Transfer - 图11

  • Plug and Play Model

image.png

  • Sentiment Transformation Network
  • CycleGAN
    • The model is trained with two sets of sentences in a corpus with labeled sentiments: positive sentiment set P and negative sentiment set N
    • train two discriminators, Dp and Dn. Dp and Dn takes a sequence of word embeddings as input and learn to distinguish whether this sequence is the word embeddings of a real sentence or generated by the model
    • translator outputs continuous word embeddings, so the gradient can be back-propagated
    • select words with the highest cosine similarity to transform the output sequence of word embeddings
    • loss for Dp and Dn

Style Transfer - 图13
Style Transfer - 图14

  • gradient penalty

Style Transfer - 图15
image.png

Evaluation Metrics

  • Sentiment coherence 1 and 2 (aforementioned in RL part)
  • Sentiment Classifier Score (aforementioned in Persona-based model)
  • Language Model Score

Style Transfer - 图17

Code

no code available

Thoughts

  • the Discriminator is simple, good
  • unique RL strategy, worth trying
  • could make use of these metrics

    Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation

    Overview

  • propose the Style Transformer

  • makes no assumption about the latent representation of source sentence
  • equips the power of attention mechanism in Transformer to achieve better style transfer and better content preservation

image.png

Approach

  • Style Transformer Network
    • To enable style control in the standard Transformer framework, add an extra style embedding as input to the Transformer encoder
    • the conditional probability for output sentence:

Style Transfer - 图19

  • Intuition
    • we only have the supervision for the case Style Transfer - 图20, because the input sentence x and chosen styles are both come from the dataset D, one of the optimum solutions is to reproduce the input sentence.
    • in other cases, we construct supervision from two ways.
    • for content preservation, we train the network to reconstruct original input sentence x when we feed the transferred sentence Style Transfer - 图21 to the Style Transformer network with the original style s.
    • for the style controlling, we train a discriminator network (2 architectures) to assist the Style Transformer network.
  • Discriminator Network
    • Conditional Discriminator: the discriminator is asked to answer whether the input sentence has the corresponding style.
    • Multi-class Discriminator: only one sentence is fed into the discriminator, and the discriminator aims to answer the style of this sentence. the discriminator is a classifier with K+1 classes. The first K classes represent K different styles, and the last class stands for the generated data from, which is also often referred as fake sample.
  • Learning
    • Discriminator

image.png

  • Style Transformer
    • Self Reconstruction
    • Cycle Reconstruction
    • Style Controlling

image.png
image.png
image.png

Code

https://github.com/fastnlp/style-transformer

Thoughts

  • impressive
  • we could try complete non-paired training as well
  • one discriminator is enough?

    MULTIPLE-ATTRIBUTE TEXT REWRITING

    Overview

  • propose a new model that controls several factors of variation in textual data where this condition on disentanglement is replaced with a simpler mechanism based on back-translation

  • Our method allows control over multiple attributes, like gender, sentiment, product type, etc., and a more fine-grained control on the trade-off between content preservation and change of style with a pooling operator in the latent space

    Approach

  • ARE ADVERSARIAL MODELS REALLY DOING DISENTANGLEMENT

    • disentanglement may not be achieved in practice
  • Denoising Auto Encoder:
    • a natural way to learn a generator that is both fluent and that can reconstruct the input, both the content and the attribute
    • a weak way to learn how to change the style, so as to force the decoder to also leverage the externally provided attribute information
    • We take an input(x,y) and encode x into z, but then decode using another set of attribute values Style Transfer - 图26, yielding the reconstruction Style Transfer - 图27. We now use Style Transfer - 图28 as input of the encoder and decode it using the original y to ideally obtain the original x, and we train the model to map(Style Transfer - 图29)into x
  • Attribute Conditioning:
    • In order to handle multiple attributes, we separately embed each target attribute value and then average their embeddings. We then feed the averaged embeddings to the decoder as a start-of-sequence symbol
  • Latent Representation Pooling:
    • to control the amount of content preservation, we use pooling
  • Model Architecture

    Code

    https://github.com/facebookresearch/MultipleAttributeTextRewriting

    Thoughts

  • complete innocent in DAE

  • could apply to our future work

    Style Transfer in Text: Exploration and Evaluation

    Overview

  • propose two models to achieve the goal of non-paired training

  • propose two novel evaluation metrics that measure two aspects of style transfer: transfer strength and content preservation

    Approach

  • multi-decoder

image.png

  • multi-decoder

image.png

Code

https://github.com/fuzhenxin/text_style_transfer

Thoughts

  • out-dated
  • Disentanglement is nearly impossible
  • could be used as a reference

    Adversarial Learning for Neural Dialogue Generation

    Overview

  • RL: Generator + Discriminator

  • describe a model for adversarial evaluation

    Approach

  • Adversarial REINFORCE

    • Generative model
    • Discriminative model: a binary classifier that takes as input a sequence of dialogue utterances {x, y} and outputs a label indicating whether the input is generated by humans or machines. The input dialogue is encoded into a vector representation using a hierarchical encoder, which is then fed to a 2-class softmax function, returning the probability of the input dialogue episode being a machine generated dialogue.
    • Policy Gradient: the score of current utterances is whether they’re human-generated ones assigned by the discriminator. The concatenation of the generated utterance y and the input x is fed to the discriminator.
  • Reward for Every Generation Step
    • vanilla REINFORCE model assigns the same negative reward to all tokens within the human-generated response, which is considered inappropriate.
    • proper credit assignment in training would give separate rewards, high rewards for good tokens, whereas low for bad ones. Rewards for intermediate steps or partially decoded sequences are thus necessary.
    • two strategies for computing intermediate step rewards: using Monte Carlo (MC) search and training a discriminator that is able to assign rewards to partially decoded sequences.
    • In Monte Carlo search, given a partially decoded s, the model keeps sampling tokens from the distribution until the decoding finishes. Such a process is repeated N (set to 5) times and the N generated sequences will share a common prefix s. These N sequences are fed to the discriminator, the average score of which is used as a reward for the current s.
    • directly train a discriminator that is able to assign rewards to both fully and partially decoded sequences. We break the generated sequences into partial sequences, namely Style Transfer - 图32 and Style Transfer - 图33 and use all instances in Style Transfer - 图34 as positive examples and instances Style Transfer - 图35 as negative examples.
      • this approach will cause over-fitting for the early token generated would be included in every sub-sequence
      • apply an alpha-go like trick, randomly sample only one example from Style Transfer - 图36, Style Transfer - 图37, do the training.
      • time-efficient, while worse performance
  • Teacher Forcing
    • Such a training strategy above is fragile: once the generator (accidentally) deteriorates in some training batches and the discriminator consequently does an extremely good job in recognizing sequences from the generator, the generator immediately gets lost.
    • To alleviate this issue and give the generator more direct access to the gold-standard targets, we propose also feeding human generated responses to the generator for model updates.
    • train the discriminator and generator on real data.
    • this modification is the same as the standard training of SEQ2SEQ models, making the final training alternately update the SEQ2SEQ model using the adversarial objective and the MLE objective.

image.png

Code

https://github.com/liuyuemaicha/Adversarial-Learning-for-Neural-Dialogue-Generation-in-Tensorflow
https://github.com/CatherineWong/dancin_seq2seq
https://github.com/jsbaan/DPAC-DialogueGAN

Thoughts

  • very classic, definitely works
  • must-try
  • finally find a GAN for seq2seq
  • could use its metrics for evaluating GAN

    Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach

    Overview

  • propose a cycled reinforcement learning method that enables training on unpaired data by collaboration between a neutralization module and an emotionalization module

  • tackles the bottleneck of keeping semantic information by explicitly separating sentiment information from semantic content

    Approach

  • Neutralization Module: first extracts non-emotional semantic content

    • use a single LSTM to generate the probability of being neutral or being polar for every word in a sentence
    • this module is responsible for producing a neutralized sequence
  • Emotionalization Module: attaches sentiment to the semantic content
    • use a bi-decoder based encoder-decoder framework, which contains one encoder and two decoders
    • One decoder adds the positive sentiment and the other adds the negative sentiment

image.png

  • Cycled Reinforcement Learning
    • refer the neutralization module as the first agent and the emotionalization module as the second one
    • In cycled training, the original sentence can be viewed as the supervision for training the second agent
    • The reward consists of two parts, sentiment confidence and BLEU

image.png

Code

https://github.com/lancopku/unpaired-sentiment-translation

Thoughts