Embracing Ethical AI at NeurIPS 2021

21 Dec 2021

7 min read

Anna Bethke

Image of two hands holding a string of white lights to symbolize embrace of an idea

December 21, 2021

The leading AI research conference, NeurIPS 2021, has recently wrapped up, spanning seven very full days, 2,344 accepted papers, eight invited talks, ten tutorials, and nearly 70 workshops. Though there was diverse and innovative thought leadership on display, I found myself drawn to the particular topics of Ethical AI and AI for Social Good which were well represented at the conference. Three of my main takeaways from the conference include:

Benchmark datasets, the hallmark of SOTA (State of the Art) leaderboards and the foundation to train many of the models we use today, are not accurate or representative of everyone or everything, everywhere, and can result in harm.
Including those most impacted by AI is critical to developing accurate, robust, and safe AI, otherwise, you cannot assess ground truth or know when harm is occurring
There are many positive uses for AI including climate science, disaster response, health, and education, but these systems also cannot be built in a vacuum. They perform much better when human expertise is inserted into both the design and operation of the system

None of these insights are necessarily new to the field of Ethical AI and AI for Social Good, but they bear repeating. The topics aren’t necessarily new to NeurIPS either, but what was new to the conference was their force (as measured by someone who is intensely interested in the subject). The conference organizers felt these subjects to be consequential enough to create a new track dedicated to their consideration and added a review process to assess the ethics of submitted papers. It is clear to me that there is a real concerted effort to ensure that everyone building AI, not just our smaller Ethical AI community, hears these points and understands the value of using better datasets, and including impacted communities in our work, even if it is “just research.”

Datasets, Benchmarks, SOTA – There’s not something for everyone

I have generally found the tutorials and workshops to be the most interesting part of NeurIPS, however, this year one of the highlights was the new “Datasets and Benchmarks” track. As the new track organizers state in their announcement blog:

“Because there are no good models without good data, and only robust benchmarks measure true progress, NeurIPS launched the new Datasets and Benchmarks track, to serve as a venue for exceptional work focused on creating high-quality datasets, insightful benchmarks, and discussions on how to improve dataset development and data-oriented work more broadly.”

A robust and representative dataset is foundational for a high-quality model and to ensure it is accurate for all. Benchmark datasets have been used in our community for quite a long while to track progress towards a perfectly “accurate” solution but have been shown to be biased and lack robustness (paper1, paper2, paper3, book chapter, etc.) They have their utility, but as the authors of AI and the Everything in the Whole Wide World Benchmark paper point out: “In particular, we argue that benchmarks presented as measurements of progress towards general ability within vague tasks such as ‘visual understanding’ or ‘language understanding’ are as ineffective as the finite museum is at representing ‘everything in the whole wide world,’ and for similar reasons — being inherently specific, finite and contextual.” In short, we have been using benchmark datasets to represent all populations, tasks, language, and objects when they actually do not. And because of this, our algorithms and the people impacted by them have suffered.

At NeurIPS I found the papers that investigate new ways of dataset curation and labeling techniques such as PASS and PROCAT particularly interesting as they mitigate some of the label bias that often occurs in dataset curation. By using unsupervised learning techniques, labels are much less subjective, and thus less prone to unconscious bias. They have some limitations with regards to requiring visual similarity, but semi-supervised approaches could be able to scale in a more robust and accurate way. And new novel datasets such as the many medical diagnosis datasets, or datasets such as those presented in Constructing a Visual Dataset to Study the Effects of Spatial Apartheid in South Africa open new beneficial research avenues.

Humane solutions require human inclusion

Other highlights of the main conference included the invited speakers. I was particularly enthralled by Dr. Mary Gray’s talk on “The Banality of Scale: A Theory on the Limits of Modeling Bias and Fairness Frameworks for Social Justice (and other lessons from the Pandemic).” She presented without slides because as she stated, “I know I need a break from looking at a screen, so if you want you can just close your eyes and just listen to me because this will be a dramatic reading” which was an incredibly refreshing break and a clear acknowledgment that most of us have been spending more times in front of our computers. And this choice reflects many of the key messages in her talk as well. Dr. Gray called for a closer tie between subject matter experts, those that have been working in the field firsthand to solve an issue, those that have faced historical bias in the past, and those that will be most affected by the technologies that we are creating. She encouraged us to not try to solve our technology problems with more technology, but by working directly with more people. Here at Salesforce, we have been conducting consequence scanning workshops to help us understand the intended and unintended positive and negative consequences of particular technologies. As Dr. Gray mentioned in her talk we “cannot shield ourselves from the reality of the social world.” In her keynote, she painted a picture of how human rights and technology should and can work together.

Dr. Gray’s sentiments were echoed in several other tutorials, workshops, and papers at the conference. Dr. Timnit Gebru and Dr. Emily Denton hosted a powerful tutorial, “Beyond Fairness in Machine Learning” where they spoke about the benefactors and victims of machine learning systems, as well as the difficulty with determining a ground truth in many datasets. They were joined for the live discussion and Q&A (at the 3:10:00 mark in the video replay) with William Agnew, Dr. Ria Kalluri, and Dr. Alex Hannah, but also by Dr. Emily Bender and Dr. Mary Gray in the live chat. The need to have a greater understanding of potential harm was also echoed in the Human-Centric AI workshop. One particularly interesting statement that stood out was from Dr. Miguel Sicart: “The only ethical method of machine learning is supervised learning.” This sentiment is in fact reflected in GDPR and other AI regulations for high-risk AI where a machine cannot make the final decision in legal or "similarly significant" effects decisions which generally encompasses employment, housing, or financial decisions like loans. Humans need to be able to understand AI suggestions and make final decisions. By building a more ethical and robust human-computer system by design, then unintended consequences are easier to detect and mitigate.

AI for Good can be Good for AI

Beyond acknowledgment of AI’s large carbon footprint, in sessions on AI for Social Impact, climate change was discussed both in tutorials and workshops. As a space nerd whose father studies the earth’s atmosphere via remote sensing satellites, I loved the content on the “Machine Learning and Statistics for Climate Science” tutorial. Karen McKinnon and Andrew Poppick gave instructions on the type of climate data available, how it can be obtained, considerations and limitations of the data, methods to process and interpret the data, and finally ways that researchers are using AI to understand climate science. Beyond the fact that this is something my 66-year-old dad is utilizing in his own research, climate change is an area that the AI community could contribute more than it is currently doing.

This sentiment was echoed in the Tackling Climate Change with Machine Learning workshop, which contained a long list of applied AI examples. The DeepQuake paper was particularly compelling, using both physical and environmental data to increase the accuracy of earthquake predictions. I also was quite interested in the A day in a sustainable life tutorial, primarily because we have been trying to optimize our own household’s electricity usage since we installed solar panels; we now only do our laundry when the sun is shining because of our dryer. I fully appreciate the idea that ML can be used to optimize our carbon footprint in both our day-to-day lives and in larger decisions such as building design, as presented by Dr. Tianzhen Hong in his invited talk on Machine Learning for Smart Buildings.

The author’s solar energy production and energy consumption

Another NeurIPS staple workshop, “Machine Learning for the Developing World (ML4D),” specifically looked at the question of how the COVID-19 pandemic “makes us question how global challenges such as access to vaccines, good internet connectivity, sanitation, water, as well as poverty, climate change, environmental degradation, amongst others, have and will have local consequences in developing nations, and how machine learning approaches can assist in designing solutions that take into account these local particularities.” And the “Third Workshop on AI for Humanitarian Assistance and Disaster Response” presented how AI and ML could help first responders provide assistance when disasters occur.

Finally, AI for Health and AI for Education were concepts and communities also present, the latter even in the keynote by Duolingo CEO Luis von Ahn. One common theme across all the AI for Social Impact sessions was that deep subject matter expertise is necessary for the AI system to be most effective. Again, while not new insights, the force with which these humane recommendations were made was evident. Of note: additional human intuition is required to know if a tornado will form, if a tumor is present, or if an individual will pass their A-level exams.

What’s striking across sessions and issues is the sense that humans and machines must work together to achieve the best results, each leaning on the strengths of the other and bolstering their weaknesses. As we assess the best ways to bring AI’s advantages to the challenges we face, conferences like NeurIPS give us an opportunity to share what we learn about its peaks and pitfalls — as well as our own.

Many thanks to Kathy Baxter, Yoav Schlesinger, and Rayce Smallwood for their comments and contributions to this article.