Share This Article
The European Data Protection Board (EDPB) recently published a report by a pool of experts on the enforcement of data subjects’ privacy right in the context of AI-complex algorithms.
More specifically, the GDPR empowers data subjects with rights such as the right to rectification, the right to erasure, and the right to object to automated decision-making. However, implementing these rights in AI-driven systems presents substantial challenges due to the way AI models learn and retain information from personal data.
Challenges in implementing data subjectsโ rights
AI models, particularly those based on deep learning, memorize training data in a compressed form. This creates difficulties in ensuring compliance with the right to rectification and the right to erasure. The key challenges include:
- Limited understanding of how each data point impacts the model: AI models function as black boxes, making it difficult to determine the specific impact of individual data points.
- Stochasticity of training: The training process is inherently random due to batch sampling, random ordering, and parallel processing, leading to variations in the trained model.
- Incremental training process: In federated learning environments, data updates influence subsequent updates, making the removal of a single data point insufficient to eliminate its effect.
- Stochasticity of learning: the learning algorithm is also probabilistic, therefore it could be difficult to correlate how a specific data point contributed to the โlearningโ in the model.
Techniques for deleting and unlearning data
- Retraining models from scratch
A straightforward approach to data erasure is deleting the personal data, retraining the model without it, and replacing the old model with the retrained version. While effective for small models, this method is computationally expensive for large-scale AI systems, making it impractical for frequent data deletion requests.
- Exact unlearning methods
Several machine unlearning methods have been developed to remove specific data points without retraining the entire model:
- Model agnostic unlearning: This method stores model gradients or modifies the training process to facilitate unlearning. A notable approach is the SISA (Sharded, Isolated, Sliced, and Aggregated) technique, which divides training data into multiple shards, limiting the influence of individual data points to specific portions of the model.
- Model intrinsic unlearning: Some unlearning techniques are designed for specific AI models, such as decision trees and random forests, where strategic modifications allow selective forgetting.
- Application specific unlearning: In recommendation systems, where data sparsity is high, efficient data structures can be used to remove personal data without retraining the entire model.
- Approximate unlearning techniques
When exact unlearning is computationally prohibitive, approximate methods are used to minimize the influence of deleted data without completely retraining the model:
- Finetuning: The model undergoes limited additional training to reduce the impact of specific data points.
- Influence unlearning: This method estimates the influence of deleted data on the model and updates parameters accordingly.
- Intentional misclassification: Instead of removing data, models are retrained to misclassify deleted data points, making them unrecognizable.
- Parameter deletion: By storing historical parameter updates, unlearning can be achieved by rolling back specific updates.
Verification and concerns with machine unlearning
One of the biggest challenges in unlearning is verification. Metrics such as unlearning accuracy, remaining accuracy, and membership inference attacks are used to assess whether a model has successfully forgotten data. However, approximate unlearning lacks strong guarantees, and some models can produce nearly identical outputs despite differences in training data.
Additional concerns include:
- Privacy risks: If attackers can compare model outputs before and after unlearning, they may infer which data was removed.
- Bias and fairness issues: Deletion requests are more likely from certain demographic groups, which could introduce biases in AI models.
Addressing data leakage in generative AI
Generative AI models, such as large language models and image generators, pose unique risks as they may inadvertently output personal data. To mitigate these risks, several approaches have been developed:
- Model finetuning: Adjusting training to prevent the generation of specific data or concepts.
- Data redaction: Using adversarial training to prevent models from learning certain types of personal information.
- Output modification: Employing classifiers to filter and block certain outputs before they reach users.
Conclusion
Ensuring compliance with data subjectsโ rights in AI systems remains a complex challenge. While retraining from scratch offers the most robust solution, it is impractical for large models. Emerging unlearning techniques, both exact and approximate, provide alternative solutions, though they still require refinement.
As AI continues to evolve, the focus should be on data protection by design, incorporating mechanisms for data rectification and deletion from the outset. Additionally, stricter regulations and transparency measures can help ensure that AI systems respect individualsโ rights while balancing technical feasibility.
On the same topic, you can read the article “How synthetic data can address IP and privacy issues of artificial intelligence“.