Guarding Content From AI: The Guardian Blocks OpenAI From Using Its Content in ChatGPT

British news outlet The Guardian confirmed Friday (September 1) it has blocked artificial intelligence company OpenAI from using its content to power its software products like ChatGPT after writers of the news firm raised concerns and filed lawsuits accusing OpenAI of using unlicensed content to create its AI tools.

The action was taken after creative industries in the UK have called for safeguards to protect their intellectual property.

A Menace to the Creative Industry?

Generative AI technology, a range of products aimed at generating convincing text, image, and audio from simple human prompts, has long dazzled the public since a breakthrough version of its ChatGPT chatbot launched last year.

However, there are also fears that the technology could be used to potentially mass-produce disinformation and the way such tools were built, as the machine learning (ML) ChatGPT and other technological counterparts use involves feeding vast amounts of data culled from the open internet, including news articles, which enable the tools to predict the likeliest word or sentence to come after the user's prompt.

OpenAI has not disclosed the data that helped build the model behind ChatGPT, but announced last month it would enable website operators to block its web crawler from accessing their content. However, the move does not allow material to be removed from existing training datasets.

Guarding The Guardian's Intellectual Property

A spokesperson for Guardian News & Media, publisher of The Guardian and The Observer, stressed the scraping of intellectual property from its website for commercial purposes would always be contrary to its terms of service. The firm added its commercial licensing team has many beneficial commercial relationships with developers around the world and looked forward to building such relationships in the future.

Aside from The Guardian, a number of publishers and websites have now blocked the GPTBot crawler.

Originality.ai, a website detecting AI-generated content, listed several news sites that have blocked the GPTBot crawler, which takes data from webpages and feed them into its AI models. The sites included CNN, Reuters, the Washington Post, Bloomberg, the New York Times and its sports site the Athletic.

Other sites that have blocked GPTBot include Lonely Planet, Amazon, the job listings site Indeed, the question-and-answer site Quora, and dictionary.com.

Government Action Urged

Meanwhile, the Publishers Association has urged British Prime Minister Rishi Sunak to protect the intellectual property rights of creative industries by adding to it the agenda at the November summit on AI safety being hosted in the UK.

In its letter to Sunak, the organization, representing publishers of digital and print books as well as research journals and educational content, asked him to clarify that intellectual property law must be respected when AI systems are being built.

Several actions have been done for this purpose, including Elon Musk imposing limits on X, the social media platform previously known as Twitter, to address what he claimed were "extreme levels of data scraping" by AI firms building their models, which prompted the firm to deploy more servers at a cost to cope with the demand.

However, Musk also confirmed he would use public tweets to train models developed by his newly-announced AI business, xAI.

On the other hand, Google's privacy policy now stated the company, which uses web crawlers to help find search results for users, might collect publicly available information to train models for its AI products, including the Bard chatbot.

Meanwhile, Meta, a major AI developer firm more known for owning Facebook and Instagram, introduced a new policy allowing users to say if they do not want their personal information used for training AI models.

OpenAI has yet to respond about The Guardian's announcement as of this report.