OpenAI and Google have issued guidelines for website proprietors seeking to prevent the utilization of their site content for training the companies’ large language models (LLMs).
While we have consistently advocated for the right to scrape websites—a technique involving the use of a computer to retrieve and analyze pages for research, journalism, and archiving purposes—we acknowledge that the legality of this practice for collecting training data for generative AI is distinct from the considerations of propriety, taste, or desirability.
As societal norms evolve regarding acceptable scraping methods and the utilization of scraped data, providing website operators with an automated means to communicate their preferences to web crawlers becomes valuable.
The process of requesting OpenAI and Google (and other entities willing to respect these preferences) not to incorporate scrapes of a particular site into their models is straightforward, provided access to the site’s file structure is available.
In previous discussions, we delved into how these models leverage art for training, and the fundamental concept and process are parallel when it comes to text. Researchers have traditionally utilized datasets obtained through internet scraping for studies spanning censorship, malware, sociology, language, and various applications, including generative AI.
In the contemporary landscape, both academic and profit-driven researchers employ bots to scour the web, gathering and storing the content of encountered sites to compile training data for AI.
This data may fuel the creation of text-centric tools or systems that collect images linked to specific text, aiming to uncover correlations between words and images during the training phase. The current outcome of these endeavors manifests in chatbots like Google Bard and ChatGPT.
If you prefer not to have your website’s content utilized in this training process, you have the option to request Google and OpenAI’s bots to exclude your site. It’s important to note that this request only pertains to future scraping activities.
If Google or OpenAI have already gathered data from your site, they will not retroactively remove it. Additionally, this request does not impact other companies independently training their own large language models (LLMs).
Content you’ve posted on other platforms, such as social networks or forums, remains unaffected. Furthermore, it does not impede models trained on extensive datasets from scraped websites that are not affiliated with a specific company.
For instance, both OpenAI’s GPT-3 and Meta’s LLaMa were primarily trained using data sourced from Common Crawl, an open-source archive covering significant portions of the internet for research purposes. While you can block Common Crawl, this action prevents the web crawler from using your data across all its datasets, including those unrelated to AI.
Bots are not obligated to comply with your requests from a technical standpoint. Currently, only Google and OpenAI have officially acknowledged this opt-out method, meaning other AI companies might disregard it entirely or introduce their own procedures for opting out.
It’s crucial to recognize that this opt-out does not impede various other forms of scraping conducted for research or other purposes. If you generally support scraping but harbor reservations about your website content being utilized in a corporate AI training set, employing this method represents one available recourse.
Before delving into the process, let’s clarify the specific elements you’ll be modifying to effect this change.
What’s a Robots.txt?
To request that these companies refrain from scraping your site, you’ll need to modify or create a file known as “robots.txt” on your website. Robots.txt serves as a set of instructions for bots and web crawlers.
Historically, it has been employed to furnish valuable information to search engines as their bots navigate the web.
If website owners wish to instruct a specific search engine or bot not to scan their site, they can include this directive in their robots.txt file. While bots can opt to disregard these instructions, many crawling services choose to respect such requests.
Although it may sound technical, the process involves a small text file located in the root folder of your site, such as “https://www.example.com/robots.txt.” This file is accessible to anyone visiting the website. For instance, you can view The New York Times’ robots.txt, which currently restricts both ChatGPT and Bard.
If you manage your own website, you should have a means to access the file structure, either through your hosting provider’s web portal or FTP. Consult your provider’s documentation if you need assistance in locating this folder. Typically, your site will already have a robots.txt file, even if it’s empty.
However, if you need to create a file, you can do so using any plain text editor. Google offers guidance on this process here.
What to Include in Your Robots.txt
To exclude ChatGPT and Google Bard, include specific lines in your robots.txt file. Whether you want to cover your entire site or only certain folders, we provide the necessary directives. Understand the nuances of opting out and ensure your preferences are communicated effectively.
Opting Out Entirely
Opting Out Specific Folders
Considerations and Limitations
While opting out is straightforward, it has limitations. Excluding ChatGPT and Google Bard doesn’t affect other AI companies or models trained on different data sets. Understand the technical aspects and be aware that not all bots may honor your requests.
Frequently Asked Questions
Q: Can I remove existing data?
A: No, exclusion only applies to future scraping. If data from your site already exists, it won’t be removed.
Q: Does this block other scraping?
A: No, it only signals exclusion to ChatGPT and Google Bard. Other scraping activities for research purposes may still occur.
Q: Are there alternatives to exclude my site?
A: Currently, only Google and OpenAI have announced exclusion options. Other AI companies may or may not provide similar features.
Empower yourself as a website owner by expressing your preference regarding the use of your content in AI training.
By understanding and implementing the steps outlined in this guide, you can effectively exclude ChatGPT and Google Bard, contributing to the ongoing dialogue about the ethical use of scraped data.
Remember, while you have the power to make this choice, it’s essential to stay informed about evolving norms and practices in the realm of web scraping and AI training.