Share This Article
The web scraping of information and personal data by artificial intelligence (AI) systems for their training has come under the scrutiny of the Italian privacy authority.
The Italian Data Protection Authority (the Garante) recently launched an investigation into websites to verify their adoption of appropriate security measures to prevent the massive collection of personal data by artificial intelligence systems.
With this recent investigation, the Garante aims to highlight the growing risk to personal data posed by the uncontrolled expansion of artificial intelligence (AI) systems. Such systems are trained through the processing of large amounts of data to refine and enhance their language processing capabilities. Large language processing models rely primarily on knowledge composed largely of data, some of which may be of a personal nature, from the Internet.
Technical measures can be adopted to limit the collection of data through the “spiders” of these systems. For example, some French newspapers have already taken steps to prevent the content they publish from being used unchecked for the development of AI systems. In this context, the investigation conducted by the Garante aims to gather relevant comments and inputs on the security measures already taken and those that can be implemented to counter the massive collection of personal data on websites.
In February 2022, the Italian Data Protection Authority issued a โฌ20 million fine against Clearview AI for web scraping activities. According to the Garante, the company operated a monitoring and collection of biometric data within Italian territory with the use of AI systems capable of creating individuals’ profiles based on biometric data in images and other related information. The urgency of this initiative likely originated from the launch of LLMs (Large Language Models) that process data available on the web.
The Garante’s renewed interest in the practice of web scraping could lead to the issuance of emergency measures addressed to all operators, private and otherwise, who publish personal data on their websites. These measures will contain specific security measures to prevent the uncontrolled collection of data from their websites. If implemented, these measures might represent both a significant brake on the development of AI systems and an added protection for individuals whose personal data are published on the Internet.
The investigation covers all public and private entities, operating as data controllers, established in Italy or offering services in Italy, that make personal data freely available online. It is possible to contribute to the investigation by sending comments and input on the security measures taken and adoptable against the massive collection of personal data for algorithm training purposes, at webscraping@gpdp.it, by January 21, 2024.
The timeless contrast between technological innovation and the protection of personal data rights continues to evolve in new forms. On the same topic, you can read the article “Is Artificial Intelligence Reshaping Copyright Protection?“.
Authors: Giulio Coraggio and Marco Guarna