AI, copyright and the legal grey zone

20 Oct 2025Feature

As AI reshapes content creation, legal frameworks struggle to keep pace with data scraping and copyright rights.

The digital age has revolutionised how content is created, shared and consumed, but it has also introduced unprecedented challenges in protecting intellectual property rights. Historically, disputes over automated data extraction were addressed through contract law and website terms of use, but recent court decisions (particularly in the USA) signal an international shift toward copyright law as the primary framework for regulating scraping practices. At the same time, the explosive growth of AI models trained on vast datasets has intensified concerns over unauthorised use of creative works, prompting litigation and legislative scrutiny worldwide. As these developments reshape the boundaries of fair use and content ownership, copyright holders face new challenges in safeguarding their rights against increasingly sophisticated AI-driven technologies.

In this article, we review how the law of copyright has evolved due to online content and how data-scraping is infringing a copyright owner's rights and what policy creators are doing to tackle this issue.

What is digital copyright?

Digital copyright emerged alongside the rapid expansion of the internet, as traditional frameworks for protecting IP struggled to adapt to a world where content could be copied, shared, and distributed instantly – irrespective of borders. Unlike print or broadcast media, digital platforms introduced both; unprecedented opportunities for creators to reach global audiences, and significant vulnerabilities in terms of unauthorised use. Internationally, the legal landscape has been developing with the rise of digital copyright. EU Directive 2019/790 was the EU’s primary adaptation of copyright exceptions to the digital and cross-border environment. Similarly, the Copyright, Designs and Patents Act 1988, which governs copyright law in the UK, has been amended by way of the Digital Economy Act 2017, updating the relevant law to address digital technology and online infringement. However, it seems that with the rise of AI; legislators have been unable to keep up.

In the UK and EU, copyright protection does not require formal registration; it arises automatically as soon as an original work is created and fixed in a tangible medium of expression. This principle, rooted in the Berne Convention for the Protection of Literary and Artistic Works 1886, extends equally to digital creations such as blog posts, photographs, videos, software code, and even social media content. In the online environment, this means that virtually every piece of creative output, whether professional or amateur, is protected by copyright the moment it is created.

Nevertheless, enforceability is complicated by factors such as jurisdictional differences and the digital nature of the content itself. In terms of jurisdictional differences, a jurisdiction often described as a stark contrast to the British jurisdiction is the USA.

In the USA, authors are generally advised to register their work with the US Copyright Office. An unregistered work is effectively unenforceable in federal court, and criminal proceedings are not possible against infringers of unregistered copyright works. Further, statutory damages and costs are only available for copyright claims in registered works. The difference is that in the UK, it is impossible. Adding to the jurisdictional differences, enforcement is complicated further by the digital nature of the content. Digital content can be accessed, duplicated, altered or shared all with the click of a button. For copyright owners, this means that they may simply lose sight of the sheer volume of infringement. In short, policing one's intellectual property rights can be a very challenging endeavour.

What is data scraping?

Data scraping refers to the automated process of extracting large volumes of information from websites or online platforms, typically using software tools or bots. Unlike manual scraping (which is not used for AI), automatic scraping systematically collects structured or unstructured data. This includes anything from text to images and metadata. The process is done at scale – often for purposes such as analytics or research. Data scraping is used to train every AI model, irrespective of use case (i.e. whether it is a large language model, a text-to-speech model, text-to-image model, or text-to-video model).

A data scraper works by using a small, non-commercial, AI model (a “Bot”) to automatically identify and extract relevant information from websites. It then uses all of the scraped data to create combined data sets, which are then used as the base training datasets for the final AI product. Using the below Reddit example, that would mean that the Bot would scrape all posts on Reddit to identify any useful information. It would then collate these posts into a single data-set, together with any other relevant posts found on other such platforms, and the final AI product would be trained on that data.

The data which is used in scraping for AI models, is found largely on the internet. One of the most popular AI models currently in use, GPT-5 (and many of its predecessors), were trained on websites such as Wikipedia, Pinterest, DeviantArt, Wikimedia, Tumblr or even Reddit. Although most licence arrangements between AI providers and data-holders began in or around 2024, it is the case that models have been trained on those websites in the years prior to those licences being entered into. The only reason those licences exist, is because of threats of litigation by way of copyright infringement from the data-holders.

Consider also that most of the data which was scraped from the above sources was taken from the core authors without their explicit consent. Take Reddit’s justification, for instance. Its User Agreement provides:

When Your Content is created with or submitted to the [Reddit] Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations or individuals who partner with Reddit. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.

This term of the User Agreement applies uniformly across all jurisdictions in which Reddit operates. In other words, the second a user joins Reddit, and posts anything on the platform, that data can be sold to a data-scraper and used for the purposes of training an AI model. Considering that 72% of young consumers (the target audience for all of the above listed platforms) do not, or only rarely, read terms of service and user agreements, this suggests that the explicit consent of the user is not granted for the purposes of data scraping and thereby transformation into the AI’s output. This would be a question for the Court in each relevant jurisdiction. It is important to note that Reddit (and other platforms with similar terms), of course, profit from this by entering into what is an effective sub-licensing agreement with the data scraping entity.

What is fair use and is data scraping fair use?

Depending on the jurisdiction, fair use doctrines are broad defences for copyright infringement. In the UK, fair dealing applies when works are used for non-commercial research or private study, criticism or review, news reporting or caricature, parody or pastiche. What amounts to 'fair dealing' will vary by jurisdiction, however, the UK Courts have held that fair dealing is a question of degree or of fact and impression. It was held in Hubbard v Vosper [1972] 2 Q.B. 84; Beloff v Pressdram Ltd [1973] RPC 765 that if the relevant parties are competitors and infringing on the work affects the original market of the copyright protected goods, it is unlikely that a defence of fair dealing exists. Further, we know that the quantity of the work taken is also taken into account when deciding on whether the fair dealing doctrine applies. If the infringer uses the entirety (or majority) of a work, it is unlikely that fair dealing applies. Nevertheless, it is also noted that the Courts will take into account whether the work was sufficiently modified prior to reaching the end-consumer of the infringing goods.

In the USA, a broader definition is used to define fair use. For instance, a defining characteristic is whether an infringing work is sufficiently transformative. Data mining and scraping have broadly been accepted as falling under the definition of fair use in regard to AI, as the end result is sufficiently transformative. Generative AI models have similarly been seen as “spectacularly transformative” (although it is also noted that the fair use argument was rejected in regard to non-generative AI models).

In the UK, it is unlikely that data-scraping would be considered 'fair dealing'. In the absence of a sub-licensing agreement, a company such as Reddit will likely have with AI models, it is often the case that an AI company scrapes data without a sub-licensing deal. This potentially means that the copyright owner's rights (i.e. the content-creator) did not agree to the material being used, by way of explicit consent or otherwise. In fact, the initial models released by OpenAI used data, inclusive of data available on Reddit and other social media platforms, without express agreement from the data controller (and thereby the implicit consent of the consumer). This was possible due to the fact that previous user agreements allowed free access. Further, there simply was no guidance on whether or not using the data was infringing the IP rights of the data controller, or the user. In terms of the policies, they were reversed as recently as 2023.

In the UK, the Courts have not yet provided any guidance on whether data scraping for the purposes of AI model training infringes on the IP rights of the owner (in the absence of a licensing agreement). It could be argued that, as in the United Sates, consumer data scraped for AI purposes is sufficiently transformed to qualify for fair dealing. However, this is unlikely to succeed in England because scrapers copy entire works for commercial use, potentially harming site revenues. To date, the Courts have not tested this argument.

The Copyright, Designs and Patents Act 1988 was amended in 2014 by the Copyright and Rights in Performances (Research ,Education, Libraries and Archives) Regulations 2014 to include a new section 29A which created a defence for data-scraping (in this case it is called data-mining) where the reason to data-scrape is for non-commercial reasons. However, clearly this defence would not apply for AI models as creators of AI obviously want to commercial the data the model uses.

The Copyright and Rights in Databases Regulations 1997 allows a fair dealing defence for data-scraping only if the database is publicly available, accessed lawfully, used for teaching or research (not commercial purposes) and properly attributed. Again, this defence is unlikely to apply in regard to data scraping for the training of generative LLMs.

Holistically, it is therefore inherently unlikely that the UK’s fair dealing exception would apply to data scraping in the context of generative AI.

What are the challenges for developers and content creators?

Content creators and businesses face mounting challenges as AI-driven data scraping reshapes the digital economy. The widespread use of automated bots to harvest creative works, ranging from articles and images to music and video, undermines traditional revenue models that rely on traffic, subscriptions and advertising. As AI-generated summaries and outputs increasingly replace direct engagement with original sources, creators risk losing both visibility and income. Legal recourse is impractical for smaller creators due to the cost and complexity of litigation – but also the fact that there has not been any clear guidance in jurisdictions like the UK in respect of whether such scraping is in fact IP infringement. This leaves many reliant on technical measures such as bot-blocking tools or restrictive licensing terms. However, these measures offer limited protection against large scale scraping, creating a persistent imbalance between the value extracted by AI developers and the compensation received by rights holders.

AI developers face a different but equally complex set of challenges. Training state of the art models requires vast datasets, much of which is sourced through scraping practices that sit in a legal grey area. While some jurisdictions provide limited exceptions, such as text and data mining rights in the EU, others impose strict copyright and database protections creating a patchwork of compliance obligations. Developers must also navigate contractual restrictions in website terms of service, potential claims of copyright infringement and emerging litigation over outputs that replicate protected works or mimic distinctive styles. Beyond legal risk, reputational concerns and the demand for ethical AI practices are driving calls for greater transparency in dataset composition and licensing arrangements. Balancing innovation in respect for IP is becoming a strategic imperative for AI companies operating in this uncertain regulatory environment.

Policymakers, particularly in the UK, are tasked with reconciling two competing priorities: fostering AI innovation and safeguarding IP rights. Existing copyright frameworks, many of which predate modern AI technologies, struggle to address the scale and nature of data scraping. Proposals such as transparency obligations for training datasets, licensing schemes and remuneration models for rights holders are gaining traction, but these measures remain fragmented and often lack practical enforcement mechanisms. Policymakers must also consider adjacent issues, including privacy compliance, database rights, and the potential for AI outputs to infringe moral or publicity rights. Achieving a coherent, future-proof regulatory framework will require international coordination and a nuanced balance between innovation, fairness and accountability.

Conclusion

It is unlikely that any defences apply to data scraping in the UK from an intellectual property infringement standpoint. Nevertheless, the licensing chain which has been incorporated by most data controllers (such as the Reddit example used within this article) has arguably circumvented the need for the defence. It is noteworthy however, that there is a current copyright claim in the UK courts which is being decided – Getty Images v Stability AI, precisely about this issue. At the time of writing, judgment has not been handed down, nevertheless it is clear that irrespective of the result it will have a transformative impact on the British AI-IP landscape.

In terms of overall government policy, the government is currently examining how data scraping for training generative AI should be regulated. While this discussion falls outside the scope of this article, it is important to note that any forthcoming changes are unlikely to amend existing data protection laws. Under the Data Protection Act 2018, user information – such as usernames and IP addresses – qualifies as personal data. Consequently, AI companies that scrape such data without consent risk breaching these obligations. Licensing deals do not circumvent this requirement, and neither do user agreements.

Legal News desk contact: editorial@solicitorsjournal.com