This blog post describes our recent paper:
Federico Bianchi and Dirk Hovy (2021). On the Gap between Adoption and Understanding in NLP. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
The main focus of this work is to describe issues that currently affect NLP research and hinder scientific development. NLP is driven by methodological papers, which are faster to produce and therefore meet researchers’ incentives to publish as much as possible. However, the speed with which models are published and then used in applications can exceed the discovery of their risks and limitations. And as their size grows, it becomes harder to reproduce these models to discover those aspects.
Dirk and I explore 5 of these issues that are of most concern: the Early Adoption of the models, the prevalence of Computational Papers in our community, the Publication Bias, the Computational Unobtainablity of the models and the Unexplainabilty of Methods.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
These issues are briefly summarized in the following picture:
In short, with increased output (models, publications, data sets), we have created a situation where these methods are widely used before they are fully understood, and an incentive system to perpetuate the situation.
We do not have complete control of language models. Adoption of new technologies without full awareness of their side effects is a risky take.
The time from research to industry application has greatly shrunk in the last year. For example, today anyone can release a model on the HuggingFace repository and once that is done, it is readily available to be used by any researcher around the world. However, this ready availability might come with a high cost.
We know that models like GPT-3 contain systematic bias against minorities. Abid et al., 2021, show that GPT-3’s anti-muslim bias appears consistently and creatively in different uses of the model.
Thus, how reliable can applications built using those models be?
In a recent report, Buchanan and colleagues suggested that while GPT-3 cannot create misinformation on its own, it can be used as a tool to generate high-quality misinformation messages in a way that has never been done before.
This gap between the widespread adoption and complete understanding (or GAU) is influenced by the different aspects we are going to see in the next sections of this blog post.
If we do not evaluate source code we cannot be sure about the quality of the paper.
As a computational science, Natural Language Processing is filled with computational papers. New models or methodological approaches tend to be highly appreciated by the community because they provide better solutions for a given task.
There is, however, limited consideration for the after evaluation of these papers. Once a paper is published, is it possible to use the code to replicate the results? How easy is it to do this?
This issue is already known in machine learning. For example, Musgrave et al (2020) have reported that due to methodological flaws in the experiments, the increase of accuracy in metric learning was wrongly reported in several papers.
To quickly discover methodological errors, it is important to have access to the code. Code release is an issue in our community as it happens to see code being released under the form of a jupyter-notebook with few comments.
Indeed, there is a lack of systematicity in the way these kinds of papers are evaluated. To name one issue, code review is not a mandatory practice, and a lot of published papers lack code altogether.
Better guidelines on how to design and release code can help alleviate this replication problem that affects our — and in general the ML — community. We already know the big advantage that easy-to-use code implementation has: HuggingFace is a noticeable example of this.
Publish or perish. An unfortunate saying that tends to be true and to which all academics have to abide by till tenure.
Publishing papers and peer review are the foundations of modern science; you cannot propose a new model if you do not provide substantial proof that is novel and/or better than what other researchers have done in the past. Indeed, the quality of a researcher is assessed mainly through the quality of their research output.
However, paper review requires a lot of care and attention, and with the large number of papers submitted to our conferences, this task often rests on the shoulders of 3 reviewers and 1 or 2 area chairs. They have to evaluate dozens of papers and decide what is good and what is not.
But does this rule help science or it just puts junior researchers under pressure?
With this push towards publication, getting a paper published at a main conference in the field is becoming more and more difficult. To ease the process, services like ArXiv can be used to publish research. There is nothing wrong with ArXiv, but as a changeable venue with no barrier to entry, it comes with limitations: publication records built solely on ArXiv are set on quicksand and might fall over.
Most of us can use pre-trained BERT models, some of us can fine-tune BERT, very few of us can train BERT from scratch. Not to mention GPT-3.
Can you train your own BERT model from scratch multiple times to evaluate your new hypothesis? Probably not: it takes a lot of computational power to train BERT even once, and doing this repeatedly is too demanding for all but a few research outfits. Even BERT fine-tuning is sometimes an operation with a high cost.
Evaluating papers by asking for impossible experiments is a form of gatekeeping that we need to take care of.
We need to be aware, when we review and analyze papers, that not everyone can retrain models like BERT, and that low-resource universities might not be able to provide the resources necessary even for fine-tuning.
Limits of technologies are not less important than the advancements they can bring. We should describe GPT-3 limits alongside its capabilities.
GPT-3 appeared on The Guardian last year. The article was “written” by GPT-3 and led to a public outcry regarding the future of AI and of human labor. But while all the arguments made about the possibilities were well worth the discussion, a lot of people somehow forgot to also describe the limits of this method.
The description of limits didn’t find its way into newspapers; instead, they found their place in specialized journals that didn’t necessarily make it to the general public (Bender & Koller, 2020; Floridi & Chiriatti, 2020).
While it is true that some technological news sources described the issues related to GPT-3, this news didn’t make it to major newspapers.
The methods we have are often unexplainable in terms of how they work: telling people that GPT-3 can write articles might create panic because the first hypothesis non-experts have is of machines rising against human labor. When we communicate these unexplainable methods to the public, we also need to describe the limits that these techniques have.
Obviously, these are not the only issues our field has to deal with. To mention two we did not cover here (but that are covered in other papers), I am pointing the reader to these papers:
- Dual-use issues (Lau & Baldwin, 2020). Should we check if and how the technology we use can be somehow used in a bad manner?
- Environmental Sustainability (Strubell et al, 2020). How are we going to take care of the fact that the models we are developing are not sustainable?
This blog post explored some issues that the current publication trends in NLP have exacerbated to a point where it is not possible to ignore them. We keep pushing ever-larger models to market, without ensuring that we understand their risks, communicating their limits, or making them widely replicable. It is important to take action and provide solutions to these problems if we want future positive outcomes in our community.
If you are interested, you can find me on Twitter.
Thanks to Dirk for editing and suggesting improvements to this article.
Bender, E. M., & Koller, A. (2020, July). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198).
Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4), 681–694.
Lau, J. H., & Baldwin, T. (2020, July). Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2908–2913).
Strubell, E., Ganesh, A., & McCallum, A. (2019, July). Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650).
Abid, A., Farooqi, M., & Zou, J. (2021). Persistent Anti-Muslim Bias in Large Language Models. arXiv preprint arXiv:2101.05783.
Musgrave, K., Belongie, S., & Lim, S. N. (2020, August). A metric learning reality check. In European Conference on Computer Vision (pp. 681–699). Springer, Cham.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.