PLMs (pre-trained language models) have excelled in a variety of natural language processing (NLP) tasks. Auto-encoding and auto-regressive PLMs are the most common classifications based on their formation processes. Bidirectional Encoder Representations from Transformers (BERT), which models input text through deep transformer layers and creates deep contextualized representations, is a representative work of self-coding PLM. The Generative Pre-training (GPT) model is a good example of auto-regressive PLM.

The hidden language model is the most common pre-training job for PLM auto-coding (MLM). The goal of MLM pre-training work is to retrieve a few entry tokens into the vocabulary space by replacing them with hiding tokens (i.e., [MASK]). MLM has a simple wording, but it can represent the contextual information around the hidden token, which is akin to the Continuous Bag of Words (CBOW) of word2vec.

Based on the MLM pre-training task, some modifications have been proposed to improve its performance, such as hiding whole words, hiding N-grams, etc. ERNIE, RoBERTa, ALBERT, ELECTRA, MacBERT and other PLMs are offered as part of the MLM Pre-Training System.

However, a logical question arises: can one employ a task other than MLM as a pre-training mission? In recent work, researchers aim to study a pre-training task that is not generated by MLM in order to answer this question. The original motive behind the strategy is fascinating. Many sayings exist, such as “Swap many Chinese characters has little effect on your reading”.

Although some words in the utterance are disorganized, the essential meaning of the sentence can still be understood. The team is intrigued by this phenomenon and wonders if they can model the contextual representation using permuted sentences. The team presents a new pre-training task called Permuted Language Model to study this topic (PerLM). The proposed PerLM attempts to recover the word order from a disordered sentence with the aim of predicting the position of the original word.


PERT and BERT have the same neural architecture, but the entry and training goals are slightly different. The training goal of the proposed PERT is to estimate the position of the original token using shuffled text as input. Here are the main features of PERT.

• In addition to replacing MLM with PerLM, PERT essentially follows the original BERT architecture, including tokenization (using WordPiece), vocabulary (direct adoption), etc.

• The artificial masking token is not used in PERT. [MASK].

• To improve performance, researchers apply both whole-word masking and N-gram masking.

• The prediction space depends on the length of the input sequence, not on the entire vocabulary (like MLM).

• PERT can directly replace BERT with proper fine tuning because its main body is the same as BERT.

To test their effectiveness, the team pre-trains both Chinese and PERT English. In-depth experiments, ranging from sentence level to document level, are undertaken on Chinese and English NLP datasets, including machine reading comprehension, text categorization, and more. The results demonstrate that the proposed PERT can help with some tasks. Meanwhile, the seekers discover their own flaws in others.

The following four domains (train/dev) are used to test the Word Order Recovery task: Wikipedia (990K/86K), Formal Doc. (1.4M/33K), Customs (682K/34K) and Legal (1.8M/13K). For subsequent studies, they report precision, recall, and F1 scores. In terms of all evaluation measures (P/R/F), PERT produces consistent and significant gains over all benchmark systems. This is in line with expectations, given that the tune-up work is quite similar to the PERT pre-training mission. Even if fine-tuning is done in a sequence-tagging mode (similar to NER), it can still benefit from PERT pre-training, which focuses on placing words in the correct order.


The researchers suggest PERT, a new pre-trained language model that uses Permuted Language Model (PerLM) as a pre-training task, in this work. The goal of PerLM is to predict the position of the original token in mixed input text, which differs from the MLM-like pre-training task. The researchers conducted comprehensive trials on NLU tasks in Chinese and English to assess the performance of PERT. The results of the experiments suggest that PERT improves performance on MRC and NER tasks. PERT undergoes additional quantitative assessments to better understand the model and requirements of each design. The researchers expect the PERT trial to encourage others to create non-MLM-like pre-training tasks for learning text representation.