AUTHORS: Zeyuan Chen, Ran Xu, Luyu Yang, Donald Rose
TL;DR: Many methods designed to preserve online privacy propose complex security measures to protect sensitive data. We believe that not storing any sensitive data is the optimal way to preserve privacy, so we propose a “burn after reading” online framework: user data samples are immediately deleted after they’re processed. Even though user data are unlabeled and not stored, our model can adapt to the novel samples very well and achieves state-of-the-art performance on four domain-adaptation benchmarks.
The pandemic has made the internet even more ubiquitous in our lives, and along with its many benefits comes a new reality: it’s extremely hard to escape one’s past now that every photo, status update, and tweet lives forever in the cloud. Face it: your face (along with other data) is out there, and hard to get rid of.
But changes may be afoot. Recommender systems that actively explore user data with data-driven algorithms have sparked debate about whether the right to privacy should take priority over convenience. And we have the Right to Be Forgotten (RTBF), which gives individuals the right to ask organizations to delete their personal data.
Since machine learning models use a lot of data, a valid question is whether there are ways to protect the privacy of the people associated with that data. We believe the answer is yes, and it’s fertile ground for researchers to explore. For example, here are two approaches:
While the above approaches show promise, they also face challenges and drawbacks:
Leaking data is, of course, never a good thing - and that’s why we argue that not storing any sensitive data is the optimal method of preserving privacy (deleting user data after use), which necessitates an online framework.
However, existing online learning frameworks cannot meet this need without addressing the distribution shift from source to target domains. This task, which is seemingly an extended setting of UDA, cannot simply be solved by an online implementation of the offline UDA methods. To begin with, existing offline UDA methods rely heavily on the rich combinations of cross-domain mini-batches that gradually adjust the model for adaptation, which the online streaming setting cannot afford to provide. In particular, many existing methods require discriminating a large number of source-target pairs to achieve the adaptation. Recently, state-of-the-art offline methods show promising results by exploiting target-oriented clustering, but that requires offline access to the entire target domain.
The upshot is that the online UDA task needs new solutions in order to succeed, given the scarcity of data from the target domain.
We believe we’ve found a new and innovative approach. In this work:
In addition to the distribution shift issue, we focus on the fundamental challenge of the online task—the lack of diverse cross-domain data pairs.
How Our Method Works
We aim straight at the most fundamental challenge of the online task—the lack of diverse cross-domain data pairs—and propose a novel algorithm based on cross-domain bootstrapping for online domain adaptation. At each online query, we increase data diversity across domains by bootstrapping the source domain to form diverse combinations with the current query. To fully exploit the valuable discrepancies among the diverse combinations, we train a set of independent learners to preserve the differences. We later integrate the knowledge of these learners by exchanging their predicted pseudo-labels on the current target query to co-supervise the learning on the target domain, but without sharing the weights to maintain the learners’ divergence. We obtain more accurate prediction on the current target query by an average ensemble of the diverse expertise of all the learners.
The figure below shows an overview of our proposed CroDoBo algorithm. CroDoBo adapts from a public source domain to a streaming online target domain. At each timestep j, only the current j-th query is available from the target domain. CroDoBo bootstraps the source domain batches that combine with the current j-th query in order to increase the cross-domain data diversity. The learners (w) exchange the generated pseudo-labels (y) as co-supervision.
Once the current query is adapted in the training phase, it is tested immediately to make a prediction. And then comes the privacy-preserving part: each target query is deleted after being tested (note the flame icons in the lower left of the figure, on the “j-1” box, indicating that data is deleted once it’s processed).
Example images of the source domain are from Fashion-MNIST, adapting to the target domain Deep-Fashion.
Given the labeled source data and unlabeled target data drawn from the target distribution, both offline and online adaptation aim at learning a classifier that makes accurate predictions on the target domain. The offline adaptation assumes access to every data point in both the source and target domains, and needs to train the model on all of those data points before making inferences.
However, for online adaptation, we can only assume access to the entire source domain, while data for the target domain arrives in a random streaming fashion of mini-batches. Each mini-batch (or what we call a “query”) is adapted, tested, and then erased without replacement to preserve privacy.
The fundamental challenge of the online task is limited access to the training data at each inference query, compared to the offline task. Undoubtedly, online adaptation faces a significantly smaller data pool and less data diversity, which means the training process of the online task suffers from two major drawbacks:
The goal of our proposed method is to minimize these drawbacks of the online setting. We first propose to increase data diversity by cross-domain bootstrapping, while data diversity is preserved in multiple independent learners. Then we fully exploit the valuable discrepancies of these learners by exchanging their expertise on the current target query to co-supervise each other.
The diversity of cross-domain data pairs is crucial for most prior offline methods to succeed. Since the target samples cannot be reused in the online setting, we propose to increase the data diversity across domains by bootstrapping the source domain to form diverse combinations with the current target domain query. At a high level, the bootstrap simulates multiple realizations of a specific target query given the diversity of source samples. The bootstrapping brings multi-view observations on a single target query.
After the independent learners have preserved the valuable discrepancies of cross-domain pairs, the question now is how to fully exploit the discrepancies to improve the online predictions on the target queries. On the one hand, we want to integrate the learners’ expertise into one better prediction on the current target query. On the other hand, we hope to maintain their differences. We train the learners jointly by exchanging their knowledge on the target domain as a form of co-supervision. Specifically, the learners are trained independently with bootstrapped source supervision, but they exchange the pseudo-labels generated for target queries.
We consider two metrics for evaluating online domain-adaptation methods: online average accuracy and one-pass accuracy. The online average is an overall estimate of the streaming effectiveness. The one-pass accuracy measures after training on the finite sample how much the online model has deviated from the beginning. A one-pass accuracy much lower than the online average indicates that the model might have overfitted to the fresh queries, but compromised its generalization ability to the early queries.
We conducted the experiments on four benchmarks.
VisDA-C is a classic benchmark adapting from synthetic images to real.
COVID-DA is adapting CT image diagnosis from common pneumonia to the novel disease.
WILDS-Camelyon17 is a large-scale medical dataset that contains histopathology images with patient population shift from source to target.
Fashion-MNIST to DeepFashion. Due to the lack of cross-domain fashion prediction datasets, we propose to evaluate adapting from these two. We select six fashion categories shared between the two datasets, and design the task as adapting from grayscale samples of Fashion-MNIST to real-world commercial samples from DeepFashion.
Below are image samples from these benchmarks and a qualitative comparison between our method (CroDoBo) and existing state-of-the-art (SOTA) methods.
We are pleased to report that CroDoBo outperforms other methods in the online setting by a large margin, and is comparable to the offline result from the state-of-the-art approach ATDOC-NA. With regard to time efficiency, CroDoBo is superior to other approaches by achieving high accuracy in only one epoch.
The table below summarizes the results on VisDA-C, and we plot the online streaming accuracy in the figure below. For more results on other datasets, such as WILDS-Camelyon17, Fashion-MNIST to DeepFashion, etc., please read our research paper.
Results of online accuracy on VisDA-C. X-axis is the streaming timestep of the target queries. Each approach takes the same randomly perturbed sequence of target queries.
Our work proposes an online domain-adaptation framework in which target data is erased immediately after being processed. This is beneficial, both with regard to an individual’s right to be forgotten and related aspects of online privacy protection. However, while our method improves data privacy, there still is a risk of data leakage. Like most neural networks, someone could purposefully exploit the memorization effect of the deep neural network weights to restore private information. We will focus on mitigating this issue in future work.
In addition to a method to achieve increased security, we developed a novel online UDA algorithm to tackle the lack of data diversity (the diverse source and target data combinations); our method augments data to make it easy to adapt to an unknown target domain. We hope this approach can aid future research in UDA and related areas of AI and ML. The fact that our proposed method achieves state-of-the-art online results, and comparable results to the offline domain adaptation approaches, validates our approach as a useful and fruitful line of research. Future work to expand its capabilities (for example, we would like to extend CroDoBo to more tasks, such as semantic segmentation) should further increase its positive impacts.
Key takeaways and contributions:
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.
Zeyuan Chen is a Lead Research Engineer at Salesforce Research. His research interests are in computer vision, robotics process automation (RPA), and multimodal foundation models.
Ran Xu is a Director of Applied Research at Salesforce Research. He leads computer vision and multimodal research, helps the team to transform research into AI products, and drives customer success.
During this work, Luyu Yang was a research intern at Salesforce Research. She received her Ph.D. degree from the computer science department at the University of Maryland. Her research interests include domain adaptation, noisy-label learning, and robust model learning.
Donald Rose is a Technical Writer at Salesforce Research. Specializing in content creation and editing, he works on multiple projects including blog posts, video scripts, news articles, media/PR material, writing workshops, and more. His passions include helping researchers transform their work into publications geared towards a wider audience, and writing think pieces about AI.