Findings of the Fourth Workshop on Neural Generation and Translation Kenneth Heafield♠ , Hiroaki Hayashi♦ , Yusuke Oda♣ , Ioannis Konstas4 , Andrew Finch♥ , Graham Neubig♦ , Xian Li? , Alexandra Birch♠ ♦ Carnegie Mellon University, ♣ Google Research, ♠ University of Edinburgh 4 Heriot-Watt University, ♥ Apple, ? Facebook Abstract We describe the finding of the Fourth Workshop on Neural Generation and Translation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2020). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the three shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document-level generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language and 3) STAPLE task: creation of as many possible translations of a given input text. This last shared task was organised by Duolingo. 1 Introduction Neural sequence to sequence models (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2015) are the workhorse behind a wide variety of different natural language processing tasks such as machine translation, generation, summarization and simplification. The 4th Workshop on Neural Machine Translation and Generation (WNGT 2020) provided a forum for research in applications of neural models to machine translation and other language generation tasks (including summarization, NLG from structured data, dialog response generation, among others). Overall, the workshop was held with two goals. First, it aimed to synthesize the current state of knowledge in neural machine translation and generation: this year we continued to encourage submissions that not only advance the state of the art through algorithmic advances, but also analyze and understand the current state of the art, pointing to future research directions. Towards this goal, we received a number of high-quality research contributions on the workshop topics, as summarized in Section 2. Second, the workshop aimed to expand the research horizons in NMT: we continued to organize the Efficient NMT task which encouraged participants to develop not only accurate but computationally efficient systems. This task had three participants each with a number of individual systems. We organized the second shared task on “Document-level Generation and Translation”, which aims to push forward document-level generation technology and contrast the methods for different types of inputs. Unfortunately this task only had one participant. Finally, we introduced a new shared task, organised by Duolingo, which encouraged models to produce as many correct translations as possible for a given input. This task generated a lot of interest and there were 11 participants. The results of the shared task are summarized in Sections 3, 4 and 5. 2 Summary of Research Contributions Similar to last year we invited the MT and NLG community to contribute to the workshop with long papers, extended abstracts for preliminary work, and cross-submissions of papers that have appeared in other venues. Keeping up with with the main vision of the workshop, we were aiming for a variety of works at the intersection of Machine Translation and Language Generation tasks. We received a total of 28 submissions, from which we accepted 16. There were 2 crosssubmissions, 3 extended abstracts and 11 full papers. There were also 15 system submission papers. We elicted two double-blind reviews for each submission, avoiding conflicts of interest. With regards to thematology there were 8 papers with a focus on Natural Language Generation and 8 with the application of Machine Translation in mind. The underlying emphasis across submissions was placed this year on capitalizing on the use of pre-training models (e.g., BERT; (Devlin et al., 2019) especially for low-resource datasets. The quality of the accepted publications was very high; there was a significant drop in numbers though in comparison to last year (36 accepted papers from 68 submissions) which is most likely due to the extra overhead on conducting research under lockdown policies sanctioned globally due to COVID19 pandemic. 3 Efficiency Task The efficiency task complements machine translation quality evaluation campaigns by also measuring and optimizing the computational cost of inference. This is the third edition of the task, updating and building upon the second edition of the task (Hayashi et al., 2019). We asked participants to build English→German machine translation systems following the data condition of the 2019 Workshop on Machine Translation (Barrault et al., 2019) and submit them as Docker containers. Docker contains enabled consistent measurement of computational cost on several dimensions: time, memory, and disk space. These are measured under three hardware conditions: a GPU, a single CPU core, and multi-core CPU on all cores. Participants were free to choose what metrics and hardware platforms to optimize for. Three teams submitted to the shared task: NiuTrans, OpenNMT, and UEdin. All teams submitted to the GPU and multi-core CPU tracks; OpenNMT and UEdin submitted to the single-CPU track. Some CPU submissions from UEdin had a memory leak; their post-deadline fix is shown as “UEdin Fix.” Common techniques across teams were variations on the transformer architecture, model distillation, 16-bit floating point inference on GPUs (except OpenNMT), and 8-bit integer inference on CPUs (except NiuTrans). Curiously, all submissions used autoregressive models despite the existence of non-autoregressive models motivated by speed. 3.1 Hardware The GPU track used a g4dn.xlarge instance with one NVIDIA T4 GPU, 16 GB GPU RAM, 16 GB host RAM, and 2 physical cores of an Intel Xeon Platinum 8259CL CPU. The NVIDIA T4 GPU is relatively small compared to the NVIDIA V100 GPU, but the newer Turing architecture introduces support for 4-bit and 8-bit integer operations in Tensor Cores. In practice, however, participants used floating-point operations on the GPU even though both OpenNMT and UEdin used 8-bit integers in their CPU submissions. This was primarily due to code readiness. Timing was run on a nonexclusive virtual machine because the instance is not yet available without virtualization. The CPU tracks used a c5.metal instance which has two sockets of the Intel Xeon Platinum 8275CL CPU, 48 physical cores, hyperthreading enabled, and 192 GB RAM. As a Cascade Lake processor, it supports the Vector Neural Network Instructions (VNNI) that OpenNMT and UEdin used for 8-bit integer matrix multiplication. For the single core track, we reserved the entire machine then ran Docker with --cpuset-cpus=0. For the multi-core track, participants were free to configure their own CPU sets and affinities. The c5.metal instance runs directly on the full hardware; it is not a virtual machine. Teams were offered AWS time to tune their submissions on the test hardware. All participants experimented on the test hardware using provided time or their own funds. 3.2 Measurement Previous editions of the task specified the test set, but last year’s organizers removed a team for generating the test outputs even with empty input. Moreover, translation time for some submissions was approaching one second and often lower than loading time. Hence we updated the task to make it more robust to adversarial participants while also increasing reliability of speed measurements. We told participants the test set would have one million lines, lines would have at most 100 space-separated words, source sentences from an unspecified quality evaluation corpus would be hidden in their input, and quality would be evaluated with BLEU. After the submission deadline, we announced the main quality score is the unweighted average SacreBLEU1 (Post, 2018) on WMT test sets from 2010–2019, excluding 2012.2 The 2012 test set 1 BLEU+case.mixed+lang.en-de+numrefs.1+s mooth.exp+test.wmt*+tok.13a+version.1.4.8 for various WMT test sets 2 Participants are likely to have used these test sets in development. The WMT 2020 test set was not yet available and others were out of the domain the systems were trained for. Corpus EMEA Tatoeba Federal WMT10 WMT11 WMT13 WMT14 WMT15 WMT16 WMT17 WMT18 WMT19 Total Lines 759876 214943 785 2489 3003 3000 2737 2169 2999 3004 2998 1997 1000000 Words 13152485 1398154 13458 54021 65829 56089 54268 40771 56789 56435 58628 42034 15048961 Characters 86584513 7303297 87724 328648 396884 332972 329121 241016 337711 336817 351779 249742 96880224 Table 1: Size of corpora in the efficiency task input. was excluded because it has lines longer than 100 words. We refer to this score as WMT1* while also reporting the usual WMT19 scores for the translation task. Shown in Table 1, the test set consisted of the aforementioned WMT input sentences and filler. For filler, we used parallel corpora outside the WMT data condition to verify that the system was still translating reasonably. Specifically, we used a recent crawl of the European Medicines Agency (EMEA),3 the Tateoba project,4 and a crawl of the German Federal Foreign Office Berlin5 all gathered by the European Language Resource Consortium. We do not consider the filler corpora clean or indomain enough to be official evaluations of quality; results appear in supplementary material. To meet our promise to participants that lines would not be longer than 100 words (space-separated tokens), we excluded WMT12 and removed any English sentences longer than 100 words from the filler. We then truncated the German Federal Foreign Office Berlin corpus to obtain a total of 1 million lines. The input sentences were randomly shuffled and mixed across corpora, retaining a separate file to enable reconstruction. The final corpus and evaluation tools are available at http: //data.statmt.org/heafield/wngt20/test/. Time was measured with wall (real) time reported by time and CPU time reported by the kernel for the process group. We no longer measure loading time because it is small compared to 3 https://edin.ac/2TSPnC7 https://edin.ac/2ywYp01 5 https://edin.ac/3bWrBes 4 the cost of translating 1 million sentences, is easy to game with busywork, and some toolkits do lazy initialization which makes loading time difficult to measure. Peak RAM consumption was measured using memory.max usage in bytes from the kernel for the CPU and by polling nvidia-smi for the GPU. Swap was disabled. Participants were told to separate their Docker images into model and code files so that models could be measured separately from the relatively noisy size of code and libraries. A model was defined as “everything derived from data: all model parameters, vocabulary files, BPE configuration if applicable, quantization parameters or lookup tables where applicable, and hyperparameters like embedding sizes.” Code could include “simple rule-based tokenizer scripts and hard-coded model structure that could plausibly be used for another language pair.” They were also permitted to use standard compression tools such as xz to compress models; decompression time was included in results but small relative to the cost of translation. We report size of the model directory and Docker image size, both captured before the model ran. Each evaluation started from a fresh boot of a constant Ubuntu 18.04 LTS disk image (one for CPU and one for GPU). Internet access was blocked at the cloud provider level except for the evaluation controller. This also prevented automatic upgrades. 3.3 Results Measurements are reported in Table 2. The tradeoffs between quality, model size, speed, and RAM are shown in Figure 1. We compare the costeffectiveness of GPU and multi-core CPU hardware at the prices charged by Amazon Web Services in Figure 2. Every team had a Pareto optimal submission for speed. This is largely due to teams focusing on different parts of the Pareto curve. OpenNMT focused on fast, small, and lower-quality systems plus one higher-quality submission. UEdin focused on higher-quality systems that were slower. Two of NiuTrans’s four GPU submissions were Pareto optimal on speed, lying between OpenNMT and UEdin; their multi-core CPU submission performed poorly on all metrics. Regarding model size, OpenNMT and UEdin made a range of Pareto-optimal submissions, NVIDIA T4 GPU Team UEdin UEdin OpenNMT UEdin UEdin NiuTrans NiuTrans NiuTrans NiuTrans OpenNMT OpenNMT OpenNMT Variant large base base tiny.untied tiny.push.i6 35 6 35 1 18 1 91 4-3-256-2ffn 6-3-256 4-3-256 BLEU WMT19 WMT1* 42.9 35.3 42.7 34.5 42.9 34.0 41.9 33.3 41.1 32.4 40.9 32.2 40.7 32.0 40.2 31.4 40.0 31.1 40.0 30.9 39.9 30.7 38.9 30.0 Seconds Wall CPU 5441 5462 2385 2406 2328 2377 1971 1994 1536 1558 3166 3450 2023 2318 1355 1646 978 1260 762 812 731 782 706 758 Disk MB Model Docker 422 933 157 668 104 308 73 584 64 579 291 887 251 847 149 745 95 691 32 235 30 233 28 232 RAM MB CPU GPU 5463 4992 3793 3196 488 1528 3146 2514 1000 1228 2115 7748 2115 5700 2117 5700 2117 5444 388 1256 393 892 402 1064 Single core Intel Cascade Lake CPU Team UEdin UEdin Fix OpenNMT UEdin UEdin Fix UEdin UEdin UEdin Fix UEdin Fix OpenNMT OpenNMT OpenNMT UEdin UEdin Fix Variant base32 base8 base tiny tiny tiny.steady.i12 tiny.pushy.i6 tiny.steady.i12 tiny.pushy.i6 4-3-256-2ffn 6-3-256 4-3-256 micro.voc8k micro.voc8k BLEU WMT19 WMT1* 42.6 34.5 42.5 34.3 42.2 33.6 41.6 32.9 41.6 32.9 40.8 32.0 40.5 32.0 40.8 32.0 40.5 32.0 39.8 30.8 39.5 30.5 38.7 29.8 37.5 29.0 37.5 29.0 Seconds Wall CPU 18649 18648 9128 9127 15978 15977 14634 14634 4799 4799 14553 14553 14399 14399 4577 4577 4554 4554 3922 3922 3717 3717 3348 3348 7184 7184 4660 4660 Disk MB Model Docker 160 659 54 751 104 198 41 737 34 559 49 578 49 578 49 587 49 587 32 125 30 123 28 122 27 723 19 716 RAM MB CPU 1728 2001 378 164686 1549 163388 164427 674 675 238 233 220 77158 2540 Multi-core Intel Cascade Lake CPU Team OpenNMT UEdin UEdin Fix OpenNMT OpenNMT OpenNMT UEdin UEdin Fix NiuTrans Variant base tiny tiny 4-3-256-2ffn 6-3-256 4-3-256 micro.voc8k micro.voc8k cpu BLEU WMT19 WMT1* 42.0 33.5 41.5 32.9 41.5 32.9 39.7 30.7 39.4 30.5 38.6 29.7 37.4 29.0 37.4 29.0 33.8 27.0 Seconds Wall CPU 795 38300 215 10014 210 9840 181 8735 155 7471 144 6959 188 8711 190 8768 811 36198 Disk MB Model Docker 104 198 41 737 34 737 32 125 30 123 28 122 27 723 19 723 64 432 RAM MB CPU 1552 108124 28890 1283 904 958 77157 35051 19732 Table 2: Submissions to the efficiency shared task sorted in decreasing order of WMT1* BLEU. Systems translated 1,000,000 lines with 15,048,961 space-separated words. 36 34 34 WMT1* BLEU WMT1* BLEU 36 32 30 NiuTrans OpenNMT UEdin 28 0 32 30 NiuTrans OpenNMT UEdin 28 50 100 150 200 250 300 350 400 450 Model size (MB) 1 2 4 GPU RAM (GB) (a) Model size on disk regardless of hardware. 36 NiuTrans OpenNMT UEdin 34 WMT1* BLEU WMT1* BLEU 36 (b) Peak GPU RAM usage. 32 30 28 34 32 30 NiuTrans OpenNMT UEdin 28 0 5 10 15 20 Thousand words per real second 25 0.5 1 2 CPU RAM (GB) 4 (c) GPU submissions including host CPU memory usage. GPU RAM is shown above. OpenNMT UEdin UEdin Fix 34 32 30 36 WMT1* BLEU WMT1* BLEU 36 28 OpenNMT UEdin UEdin Fix 34 32 30 28 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Thousand words per real second 0.25 0.5 1 2 4 8 16 32 64 128 CPU RAM (GB) (d) Single core CPU submissions. NiuTrans OpenNMT UEdin 34 32 30 28 36 WMT1* BLEU WMT1* BLEU 36 NiuTrans OpenNMT 34 UEdin UEdin Fix 32 30 28 0 20 40 60 80 100 120 Thousand words per real second 1 2 4 8 16 32 CPU RAM (GB) (e) Multi-core CPU submissions; UEdin’s fixed submissions had similar speed. Figure 1: Performance of Efficiency Task Submissions. 64 WMT1* BLEU 36 NiuTrans: GPU OpenNMT: GPU UEdin: GPU 34 CPU CPU CPU 32 30 28 0 20 40 60 80 100 Million Words per USD 120 140 160 Figure 2: Price comparison of GPU and multi-core CPU submissions based on Amazon Web Services pricing of $4.08/hr for the c5.metal CPU instance and $0.526/hr for a g4dn.xlarge GPU instance. A single CPU core does not have a well-defined price. mostly driven by the number of parameters and 8-bit quantization. OpenNMT’s small lower-quality models have low CPU RAM and Docker image size; UEdin is Pareto-optimal for higher-quality models. OpenNMT was the only team to optimize for these metrics in their system description. In their multicore CPU submission, OpenNMT shared memory amongst processes while other participants simply used multiple processes with copies of the model. 4 Document Generation and Translation Task Following the previous workshop, we continued with the shared task of document-level generation and translation. This task is motivated as the central evaluation testbed for document-level generation systems with different types of inputs by providing parallel dataset consisting of structured tables and text in two languages. We host various tracks within the testbed based on input and output constraints and investigate and contrast the system differences. In particular, we conducted the following six tracks: • NLG (Data → En, Data → De): Generate a document summary in the target language given only structured tables (i.e., data-to-text). • MT (De ↔ En): Translate a document in the source language to the target language (i.e., document-level translation). • MT+NLG (Data+En → De, Data+De → En): Generate a document summary given the structured tables and the summary in another language. 4.1 Evaluation Measures We employ standard evaluation metrics for the tasks above along two axes following (Hayashi et al., 2019): Textual Accuracy: BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) as measures for surface-level texutal accuracy compared to reference summaries. Content Accuracy: Relation generation (RG), content selection (CS), and content ordering (CO) metrics (Wiseman et al., 2017) to assess the fidelity of the content to the input data. An information extraction model is employed for content accuracy measures for each target language. We followed (Wiseman et al., 2017) and ensembled six information extraction models (three CNN-based, three LSTM-based) with different random seeds. 4.2 Data We re-use Rotowire English-German dataset (Hayashi et al., 2019), which consists of a subset of the Rotowire dataset (Wiseman et al., 2017) with professional German translations. Each instance corresponds to an NBA game and consists of a box-score table for the match, base information about the teams (e.g. team name, city), English game summary, and the same game summary translated to German. Final evaluation was performed on the test split of the Rotowire English-German dataset. We followed the same setting in terms of additional resources participants could adopt. Systems conforming to the data requirements are marked constrained, otherwise unconstrained. Results are indicated by the initials (C/U). System BLEU Type FJWU 45.04 C FairSeq-19 42.91 C Table 3: DGT results on the MT track (De → En). 5 STAPLE Task NCP+CC: We use a two-stage model from (Puduppully et al., 2019) for NLG tracks. English model was with the pretrained weights by the author and German model was trained only on Rotowire English-German dataset. Machine translation systems are typically trained to produce a single output, but in certain cases, it is desirable to have many possible translations of a given input text. At Duolingo, the world’s largest online language-learning platform,7 we grade translationbased challenges with sets of human-curated acceptable translation options. Given the many ways of expressing a piece of text, these sets are slow to create, and may be incomplete. This process is ripe for improvement with the aid of rich multi-output translation and paraphrase systems. To this end, we introduce a shared task called STAPLE: Simultaneous Translation and Paraphrasing for Language Education (Mayhew et al., 2020). 4.4 5.1 4.3 Baselines We prepared two baselines for different tracks: FairSeq-19 We use FairSeq (Ng et al., 2019) (WMT’19 single model6 ) for MT and MT+NLG tracks. Submitted Systems One team participated in the task, who focused on the German-English MT track of the task. Team FJWU developed a system around Transformer-based sequence-to-sequence model. Additionally, the model employed hierarchical attention following (Miculicich et al., 2018) for both encoder and decoder to account for the documentlevel context. The system was trained in a twostage process, where a base (sentence-level) NMT model was trained followed by the training of hierarchcal attention networks component. To handle the scarcity of in-domain translation data, they experimented with upsizing the in-domain data up to three times to construct training data. Their ablation experiments showed that this upsizing of in-domain data is effective at increasing the BLEU score. 4.5 Results We show the MT track results in Table 3. We confirm that the use of both document-level models and in-domain data helps achieve better BLEU score, which has also been shown from the last workshop (Hayashi et al., 2019). 6 Model identifier: transformer.wmt19.en-de, transformer.wmt19.de-en. Task Description In this shared task, participants are given a training set consisting of 2500 to 4000 English sentences (or prompts), each of which is paired with a list of comprehensive translations in the target language, weighted and ordered by normalized learner response frequency. At test time, participants are given 500 English prompts, and are required to produce the set of comprehensive translations for each prompt. We also provide a high-quality automatic reference translation for each prompt, in the event that a participant wants to work on paraphrase-only approaches. The target languages were Hungarian, Japanese, Korean, Portuguese, and Vietnamese. 5.2 Submitted Systems There were 20 participants who submitted to the development phase, 14 participants who submitted to the test phase, and 11 participants who submitted system description papers. Submission models largely consisted of high-quality machine translation systems fine-tuned on in-domain shared task data from Duolingo, with different tricks for training, ensembling, and output filtering. In the test phase, three teams submitted to all 5 language tracks, and one team submitted to two 7 www.duolingo.com tracks (Portuguese, and Hungarian). Of the remaining single-language submissions, Portuguese and Japanese were the most popular. In these single language submissions, teams did not tend to take language-specific approaches. 5.3 Results Submission performance varied widely, but nearly all submissions improved significantly over organizer-provided baselines. The top submissions have comparable scores to taking the top 5 translations from each gold translation set. Techniques popular among the more successful teams included weighting of training data according to learner response frequency, and classifierbased output filtering. Interestingly, techniques such as diverse beam search and beam reranking did not appear to improve results, despite their close relevance to the task. For more details and analysis, see Mayhew et al. (2020). 6 Conclusion This paper summarized the results of the Fourth Workshop on Neural Generation and Translation, where we saw a number of research advances. Particularly, this year introduced a more rigorous efficiency task, and a new STAPLE task. 7 Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics. Acknowledgements The efficiency shared task was partly funded from European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303 (Bergamot) and by the Connecting Europe Facility (CEF) - Telecommunications from the project No 2019-EU-IA-0045 (User-focused Marian). This work represents the authors’ opinions, not necessarily those of the European Union. We thank Amazon Web Services for its gift of credits to support the efficiency shared task evaluation. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015. Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, and Katsuhito Sudoh. 2019. Findings of the third workshop on neural generation and translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 1–14, Hong Kong. Association for Computational Linguistics. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of EMNLP. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Stephen Mayhew, Klinton Bicknell, Chris Brust, Bill McDowell, Will Monroe, and Burr Settles. 2020. Simultaneous translation and paraphrase for language education. In Proceedings of the ACL Workshop on Neural Generation and Translation (WNGT). ACL. Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics. Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319, Florence, Italy. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computational Linguistics. Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6908–6915. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112. Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in Data-to-Document Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2253–2263.