ACL 2019 Highlights
This post discusses highlights of the main conference of the 2019 Annual Meeting of the Association for Computational Linguistics (ACL 2019). Note that these notes are written with business applications in mind.
The conference accepted 660 papers with an acceptance rate of 22.7%, had 6 parallel oral sessions plus one poster session and over 3K participants. It was sponsored by various industrial big players from the NLP research community (Full list of sponsors).
The presidential address tackled several topics including the following list, that stuck out with me most:
- Error analysis, interpretability and reproducibility
- Multilinguality, low resource tasks and domain adaptation
- Commonsense, reasoning and context aware modeling
- Ethical NLP: Bias and environmental impact
Error analysis, interpretability and reproducibility
Deep Learning models are very sensitive to noisy input. In machine translation for instance, the variations of the user’s input can reveal bias in the language modeling and training data. To limit these drawbacks, Cheng et al. (Google) suggests perturbing the model with adversarial inputs using an algorithm named Adversarial Generation (AdvGen), which generates plausible adversarial examples and feeds them back into the model. This idea is inspired by GANs but doesn’t rely on a discriminator: instead it simply augments the training data with these adversarial examples (more on AdvGen here).
Interpretability also keeps coming back in different papers, showing a general interest in understanding and explaining the models’ inside mechanisms. Serrano et al. argues that attention wights don’t always identify information that models find important. Many experiments were performed to prove that attention weights mostly don’t correlate with the model outputs; whereas gradient-based rankings of attention weights better predict their effects (to further develop in future papers). On the other hand, Bastings et al. show that sparse re-parameterized samples and unbiased gradients lead to effective latent rationales (a short and informative part of the input text) for text classification. The approach consists in training two models, one latent model that selects the rationale to input in the second one, that performs the classification task.
Equally, many papers try to explain models capabilities and low level representations. For instance, Yang et al. show that SANs (self-attention networks, popular for their parallelization and strong performance on various NLP tasks, such as machine translation) can’t efficiently learn the positional information even with the position embedding, when trained on a word reordering detection task. Nevertheless, SANs learn better positional information than RNNs when trained on a different downstream task such as MT. When it comes to Multi-head self-attention, Voita et al. show that the most important (in magnitude) and confident heads play consistent and often linguistically interpretable roles. Furthermore, they prove that pruning heads (using a method based on stochastic gates and a differentiable relaxation of the L0 penalty) leads to remove the vast majority without (or with minimum) impact on the performance. More generally, Jawaharet et al. (Inria) investigate which structure of language BERT learns. They show that, in the lower layers, BERT learns phrase-level information and in the intermediate layers a hierarchy of linguistic information (surface features at the bottom, syntactic features in the middle and semantic features at the top). They also prove that BERT requires deeper layers to encode long-distance information (e.g. track subject-verb agreement).
Further details and resources about interpretability will be shared in the BlackboxNLP workshop, following the main conference. Live streaming link.
On another note, there is a growing interest in robust evaluation and reproducibility among the NLP community. Indeed, the benchmark leaderboards race often comes with small core changes and “public hyper-parameters search with more training data” (e.g. RoBERTa), which leads to over-fitting benchmark datasets and drives the attention away from more fundamental research and new idea. One of the selected “outstanding papers” deals with this issue: Gorman et al. show that system rankings based on standard splits fail to reproduce, and recommend “Bonferroni-corrected random split hypothesis testing” instead. The best demo paper also provides an open-source framework for machine translation quality estimation called OpenKiwi.
Multilinguality, low resource tasks and domain adaptation
Since last year, an interest in multilingual training to support low-resource languages has be pronounced. Indeed, most of the NLP benchmarks and SOTA results are reported on English, whereas in real life applications other languages are used and needed. That led a wave of cross-lingual and multilingual research results. Ormazabal et al. argue that most cross-lingual embeddings are learnt with offline methods (learn embeddings in different languages then map them to a shared space) which underlay the isomorphism assumption, stating that embeddings in different languages have the same structure. They investigate whether this issue is due to the mapping or learning of embeddings. Eventually, they prove that joint learning (using an extension of skip-gram model with bilingual data) yields more isomorphic embeddings and better bilingual lexicon induction. Thus, they conclude that mapping methods have strong limitation and call for more research in joint learning. Indeed, so many research group have been interested in aligning cross-lingual embeddings to build bilingual dictionaries and induce word translation pairs through nearest neighbor or related retrieval methods. Furthermore, Artetxe et al. suggest inducing bilingual embeddings through a different approach: It consist of generating a synthetic parallel corpus through a phrase based translation system (built using cross-lingual embeddings) and then extract the bilingual lexicon using statistical word alignment methods. Thus, they use no more resources than the monolingual corpus used to learn the embeddings.
The works on cross-lingual and multilingual NLP led the community to adopt a more stimulating position about presenting papers and results for various languages. The Bender Rule (ref. Emily M. Bender) states that any work should state the language they use and not fall to English by default, since that emphasizes English as a reference language (though it doesn’t capture multilingual variations) and puts other languages in a second stand. On the one hand, Mielke, et al. investigate languages that are harder to model, given the observation that SOTA methods do not perform equally well on high resource languages (from the Europarl corpus). To do that, they start by defining an objective measure of difficulty using parallel corpora covering 69 languages (13 language families) and propose a paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients, handling missing data and inter-sentence variations. On the other hand, Rahimi et al. propose a “massively” multilingual transfer for NER models, applied to emergency response. They show that, when using cross-lingual embeddings, some language pairs ensure better performance than the ensemble voting models, yielding unexpected language pairs (Indonesian is best transfered from Italian - inducing that these results depend on the used embeddings). Although, the transfer results are best within language families, they remain overall noisy, thus they suggest applying transfer from multiple source languages. The main idea is to encode the input sentence to a cross-lingual representation, then use models trained on high resource languages with cross-lingual representations, to predict the tags. More details in the slides. In another work about multilingual BERT, Pires et al. show that transfer doesn’t really depend on vocabulary overlap but rather on typological similarity (SVO vs. SOV), indeed they noticed a performance drop when changing word order. Further more, they prove that mBERT is good at transfering mixed-language (code switching) but not transliterated targets. Finally, they confirm that translations have similar representations.
Furthermore, in the presidential address, Ming Zhou listed the topics to work on low-resource tasks including: Transfer learning, Unsupervised learning, Cross-language learning and Prior knowledge with human in the loop. For instance, Wang et al. suggest using the most related high resource language to improve low resource NMT, instead of using all available multilingual data. Moreover, they prove that using an intelligent selection method from other auxiliary languages can further enhance the performance. To do so, they propose an algorithm dabbed Target Conditioned Sampling (TCS), which first samples a target sentence, and then conditionally samples its source sentence. On a different note, Huck et al. propose an approach to better translate OOVs (Out Of Vocabulary words) in MT systems. Indeed, these words are usually represented with BPE tokens, but that can lead to bad translations, for languages like German. The main idea, here, is to translate the OOVs using bilingual embeddings and keeping track of 5 candidate translations, then back-translate the targets while forcing the target OOV to translate to the source OOV. By fine-tuning the model with this synthetic data, they report better OOVs translation performance. Moreover, Xia et al. suggest an approach to use monolingual data when training low-resource translation models, that pivots through a related high-resource language (HRL). First, they inject LRL words in HRL sentences using bilingual dictionaries. Second, they edit the modified sentences using an unsupervised MT framework. They show that the proposed method outperforms back-translation. Moreover, in an attempt to reuse elaborated methods in low-resource tasks, Beryozkin et al. (Google) use a tag hierarchy to adapt a pre-trained NER model with no additional training data. The main idea consists in constructing a tag hierarchy, training the model with the highest level of tags as targets and back-propagating with the fine-grained tags.
As of speech translation, recent papers showed that it was possible to achieve with end-to-end models (avoiding cascaded models: STT ➜ MT) but considering the unrealistic assumption of equal size of training data. Sperber et al. show that end-to-end speech translation models require more data to achieve the same performance as cascaded models. To solve this problem, they propose training end-to-end multi-task models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. Furthermore, they introduce an attention-passing technique that alleviates error propagation issues.
Commonsense, reasoning and context aware modeling
This topic comes back in each session dealing with QA; it seems to be more challenging and with a certain latency between the research results and the industry applications. Indeed, the research community is focusing more on seq2seq models - that do not ensure a perfect user experience, both in chit-chat and task oriented systems. Consequently industrial systems rely more on NLU, DST, DM and NLG components. In the second keynote, Pascal Fung explained various aspects related to QA systems and the ethical challenges they raise, including presenting them as humans deceiving the user at the other end, in certain use cases.
Research groups in companies as well as in academia are focusing efforts on building meaningful datasets, intended to make QA / reasoning systems closer to the end-user, for example Google Research has presented a dataset paper with user formulated requests matched to their answers from wikipedia articles, with two fine grained levels: paragraph containing the answer and entity corresponding to the exact answer to the question. This is pushing end-to-end systems to deal directly with the user’s query and infer the exact information to look for. Here is the link to the data. On the one hand, Xiong et al. propose a new end-to-end question answering model, which learns to aggregate answer evidence from an incomplete knowledge base (KB) and a set of retrieved text snippets. On the other hand, Lewis et al. follow the insights from Lample et el. on unsupervised machine translation, to investigate unsupervised extractive QA. They suggest various methods to perform cloze to natural question translation and show that modern QA systems have a good performance even when trained using only synthetic data. Moreover, Talmor et al. investigate generalization and transfer in reading comprehension tasks and show that training on a source RC dataset and transferring to a target dataset substantially improves performance. They propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on 5 RC datasets.
Moreover, there seems to be an important interest in building systems with in-domain data that ensure context aware predictions. Studies show that systems lacking context in real life applications (away from the SOTA and leaderboard benchmarks on very specific and studied datasets) can lead to biased decision making and erroneous predictions. Sap et al. show that systems trained on out of domain data lead to strongly biased decisions and dangerous censorship.
On a different note, the transformer-xl paper by Dai et al. was presented during the conference. The main idea consists in surpassing the fixed-length context, (a long text sequence is truncated into fixed-length segments processed separately), using two techniques: a segment-level recurrence mechanism and a relative positional encoding scheme. More details here. Furthermore, the first keynote, by Liang Huang, introduces a new architecture dabbed prefix-to-prefix as an evolution of sequence-to-sequence to take into consideration the temporal evolution of the captured context in simultaneous translation. More in the paper.
Ethical NLP: Bias and environmental impact
There was a significant number of papers dealing with debiasing NLP methods at ACL 2019. Indeed, recent studies have shown that pre-trained word embeddings and NLP resources hold a certain level of gender and geographical bias, among others. Since these models can be used in decision making models, ethical concerns have emerged, illustrated by the wide interest of the research community. For instance, Sun et al. present a literature overview of gender bias mitigating techniques. Furthermore, Zmigrod et al. show that commonly employed debiasing approaches produce ungrammatical sentences in morphologically rich languages and present a novel approach for converting between masculine-inflected and feminine-inflected sentences. They test their approach on 4 languages showing bias reduction without harming grammaticality. On demographic bias, Sweeney et al. argue that most demographic bias evaluation approaches rely on vector space based metrics like the Word Embedding Association Test (WEAT), which doesn’t measure the impact on downstream tasks. They suggest a new metric (Relative Negative Sentiment Bias, RNSB) to measure the relative negative sentiment associated with demographic identity terms. Moreover, a position paper by Jurgens et al. argues that the community needs to make three substantive changes to address online abuse: first, expanding the scope of problems to tackle both more subtle and more serious forms of abuse, second, developing proactive technologies that counter or inhibit abuse before it harms, and third reframing the efforts within a framework of justice to promote healthy communities.
More details on the workshop for Gender Bias in Natural Language Processing and WiNLP workshop.
Another interesting paper by Strubell et al. illustrates the environmental impact of training and especially tuning SOTA NLP models. The authors quantify the approximate financial and environmental costs of training a variety of recent neural network models for NLP and propose actionable recommendations to reduce costs and improve equity in NLP research and practice.
See also
- Video recordings
- NeuLab Presentations at ACL 2019
- Unsupervised Cross-lingual Representation Learning tutorial
- Best papers nominations here.
- Trends in Natural Language Processing: ACL 2019 In Review by Mihail Eric.
- Knowledge Graphs in Natural Language Processing @ ACL 2019 by Michael Galkin.
- Notes on ACL 2019 by Noe Casas.
- ACL 2019, my take home messages by Sergio Oramas.
- ACL 2019: Highlights and Trends by Maria Khvalchik.
- ACL 2019 Best Papers Announced by Synced.
- ACL 2019 Thoughts and Notes