On the Frontlines of Copyright and AI

The New York Times Company v. Microsoft Corporation and OpenAI, Inc. de la publicación.

Josué David Citalán Hernández

7/9/20258 min read

Key Takeaways

  1. The New York Times is suing OpenAI and Microsoft, claiming their AI models were trained on Times articles without permission, raising major questions about copyright in the AI era.

  2. A court has ordered OpenAI to preserve all AI output data, even if users request deletion, spotlighting the clash between legal discovery and privacy rights like those under GDPR.

  3. The outcome of this case could reshape how AI companies handle both copyrighted and personal data, setting new standards for intellectual property and privacy compliance.

  4. This case is closely watched alongside the Bartz v. Anthropic resolution: while Bartz focused on the legality of training data, the Times case tests if AI-generated outputs themselves can infringe copyright, showing courts are tackling both how AI learns and what it produces.

On May 13, 2025, the United States District Court for the Southern District of New York issued a preservation order in the ongoing litigation between The New York Times Company (The Times) and OpenAI, along with Microsoft and related entities. This case has quickly become a defining moment in the evolving intersection of artificial intelligence and intellectual property law. At its core, the dispute centers on allegations that OpenAI and Microsoft used millions of Times articles to train large language models without authorization. The Times contends that this is not simply a conflict between a leading news organization and a technology innovator, but a pivotal legal battle that will shape the future boundaries of copyright protection, data privacy, and the development of AI technologies.

The preservation order, issued by Magistrate Judge Ona T. Wang, requires OpenAI to preserve and segregate all output log data that would otherwise be deleted, including data subject to user deletion requests or privacy regulations, on a going-forward basis. This directive, which applies to all variants of ChatGPT, reflects the court’s recognition of the need to balance the demands of legal discovery with the privacy interests of users. The order underscores the significance of this litigation, not only for the parties involved but for the broader legal and technological landscape, as courts and companies alike grapple with the complex interplay between AI innovation, intellectual property rights, and data protection obligations.

Copyright, Competition, and the Value of Journalism

At the core of The New York Times v. Microsoft Corporation, OpenAI, et al. lies a fundamental allegation: OpenAI and Microsoft, in their quest to build ever more powerful generative models, systematically ingested and utilized the Times’ copyrighted content without authorization. The complaint is sweeping in its scope, detailing how the defendants allegedly harvested articles from the Times’ digital archives, spanning back to 1851 (woah!), and incorporated them into the training data for models such as GPT-4. The Times asserts that this was not a mere technical oversight, but a calculated act of commercial exploitation. The models, the complaint alleges, are capable of reproducing near-verbatim copies of Times articles, sometimes even stripping away copyright management information in the process.

The Times’ complaint lays out a wide range of concerns about how its journalism has been used by OpenAI and Microsoft. At its core, the Times argues that its articles (created through years of investment in reporting and editorial work) were taken and used to build AI products without permission. The Times claims this isn’t just a technical or accidental issue, but a deliberate strategy that allowed OpenAI and Microsoft to benefit from the Times’ work without sharing in the costs or respecting the rights of the original creators.

Beyond the question of copying, the Times points to the ways this use of its content could harm its business. By making Times journalism available through AI tools, the complaint argues, OpenAI and Microsoft are drawing readers and potential subscribers away from the Times’ own platforms, threatening its ability to fund quality reporting. The Times also expresses concern that its brand and reputation could be damaged if AI-generated content is mistaken for authentic Times journalism, especially if that content is inaccurate or misleading.

A Call for Accountability and Restraint

The Times is seeking a broad range of remedies in this case, reflecting just how seriously it views the alleged misuse of its journalism. The company is asking the court for financial compensation to address the harm it believes has been done. This could include damages for each article used, repayment of any profits OpenAI and Microsoft made from using Times content, and coverage of legal costs.

However, the Times is also looking for much more than money. It wants the court to put a stop to any further unauthorized use of its work, to require the destruction of AI models that were trained using Times material, and to ensure that strong safeguards are put in place to prevent similar issues in the future.

Importantly, the New York Times does not name a specific dollar amount in its complaint. This is typical in these kinds of cases, since the actual amount would depend on how many articles were involved, whether the court finds the infringement was intentional, and how much profit can be traced to the use of Times content. Under U.S. copyright law, damages can be substantial: Up to $150,000 per work if the infringement is found to be willful. But the final figure would only be determined after a full review of the evidence.

As articulated in the complaint’s prayer for relief:

“An award of damages, including statutory damages, actual damages, and any profits of Defendants attributable to the infringement of The Times’s copyrights, as provided by 17 U.S.C. § 504, and all other relief allowed by law.”
— Page 75, Paragraph 1 of the Complaint (The New York Times Company v. Microsoft Corporation, OpenAI, Inc., et al., Case No. 1:23-cv-11195, S.D.N.Y.)

The Practical Impact of the Preservation Order

In practical terms, the preservation order issued by the court imposes a significant operational and compliance obligation on OpenAI. The order requires OpenAI to retain and segregate all output log data generated by its ChatGPT models that would otherwise be deleted, whether due to routine data retention policies, user-initiated deletion requests, or requirements under privacy regulations such as the GDPR or CCPA. This means that, for the duration of the litigation or until further notice from the court, OpenAI must suspend its standard data deletion processes for this category of data and ensure that such information is preserved in a manner that prevents alteration or loss.

Implementing this order necessitates technical and procedural adjustments. The company must also establish secure storage protocols to protect the integrity and confidentiality of the preserved data, given the sensitive nature of user interactions and the heightened privacy risks involved. Access to this data must be strictly controlled and documented, both to comply with the court’s directive and to mitigate potential exposure of personal or confidential information.

The scope and intent of the order are made explicit in the court’s language:

“OpenAI is NOW DIRECTED to preserve and segregate all output log data that would otherwise be deleted on a going forward basis until further order of the Court (in essence, the output log data that OpenAI has been destroying), whether such data might be deleted at a user’s request or because of ‘numerous privacy laws and regulations’ that might require OpenAI to do so.”

— Page 2, Paragraph 2 of the May 13, 2025 Preservation Order (NYT v. Microsoft Corporation, OpenAI, Inc., et al., 25-md-3143, S.D.N.Y.)

From a legal perspective, the preservation order ensures that potentially critical evidence (such as records of specific outputs generated by ChatGPT, which may demonstrate the use or reproduction of New York Times content) remains available for discovery and adjudication. At the same time, the order highlights the tension between the requirements of U.S. litigation and the privacy rights of users, particularly those protected by international data protection regimes. OpenAI is thus placed in the position of having to navigate and reconcile these competing obligations, under close judicial supervision.

Ultimately, the preservation order serves as a concrete example of how courts are beginning to address the unique evidentiary and privacy challenges posed by AI systems. It signals to the broader industry that, in the context of litigation, data that would otherwise be ephemeral or subject to deletion may need to be retained, even in the face of user privacy requests or regulatory mandates. This development underscores the importance for AI companies to build flexible, legally compliant data management frameworks that can respond to evolving judicial and regulatory demands.

Understanding the Court’s Preservation Order

A central element of the court’s preservation order is its focus on the “30-day tables of consumer output log data.” In practice, this refers to the records OpenAI keeps of ChatGPT’s responses to users—essentially, the full outputs generated during user interactions. Under OpenAI’s standard policy, these logs are stored for 30 days and then automatically deleted to manage storage and reduce privacy risks.

The court’s order temporarily suspends this routine deletion process. OpenAI is now required to preserve and segregate all such output logs that would otherwise be deleted after 30 days, starting from the date of the order and continuing until further notice. Importantly, the order mandates the preservation of the complete content of these outputs (So not just summaries or metadata). This rather complex approach ensures that the court and parties have access to the full evidentiary record needed to assess whether any AI-generated outputs unlawfully reproduce or closely mimic New York Times articles.

Additionally, the court has instructed both parties to engage in “sampling” of these preserved logs. This means selecting and reviewing representative samples to determine whether there are material differences between the data being retained and what would have been deleted. The purpose is to inform the court’s analysis of whether deleting such data would result in the loss of important evidence, particularly in the context of potential copyright infringement.

Interplay Between The New York Times v. Microsoft Corporation, OpenAI, et al. and Bartz v. Anthropic.

The resolution of Bartz v. Anthropic PBC in the Northern District of California and the ongoing The New York Times v. Microsoft Corporation, OpenAI, et al. litigation in the Southern District of New York together represent the leading edge of judicial engagement with the intersection of copyright law and artificial intelligence. While each case arises from distinct factual circumstances and is governed by the procedural norms of its respective jurisdiction, the legal principles articulated in Bartz v. Anthropic are likely to inform, and perhaps even shape, the outcome and reasoning in the New York litigation.

The Bartz v. Anthropic decision provides one of the first clear judicial articulations that the act of training a language model on lawfully obtained copyrighted works constitutes fair use under Section 107 of the Copyright Act. The court’s analysis draws a bright line: the legitimacy of the training process is inseparable from the legitimacy of the data’s source. If the training corpus is built from lawfully acquired materials, whether purchased or digitized from owned copies, the use is protected as fair use. Conversely, the use of pirated or unlawfully obtained works, even if not directly used in training, irreparably taints the process and exposes the developer to liability. This “rotten fruit” principle, while not a formal doctrine in copyright law, is a powerful metaphor for the court’s insistence on lawful data provenance as a precondition for fair use.

In contrast, The New York Times v. Microsoft Corporation, OpenAI, et al. centers on allegations that OpenAI and Microsoft ingested Times content without authorization, raising the very question addressed in Bartz: does the use of copyrighted works for AI training constitute infringement, and if so, under what circumstances might it be excused as fair use? While Bartz v. Anthropic is fundamentally concerned with the lawfulness of the input data used to train AI models, the New York litigation is equally focused on the outputs (specifically, whether the AI models can reproduce or closely mimic protected Times content). This distinction is reflected in the court’s preservation order. The Times’ complaint alleges that the defendants did not obtain the necessary rights to use its content, and the preservation order ensures that evidence of both the use of Times material and the nature of the AI’s outputs is maintained for judicial scrutiny..

Bartz v. Anthropic gives an elegant solution about the limits of Fair Use, it explains that copyright law does not prohibit the act of learning (by either humans or machines) from copyrighted works, so long as the outputs do not reproduce or substantially mimic the originals. This principle is likely to be central in the New York case, where the Times alleges that OpenAI’s models can generate near-verbatim copies of its articles. The factual question of whether the outputs cross the line from learning to unlawful reproduction will be critical, and the California court’s reasoning provides a roadmap for how to analyze this issue.

Finally, both cases represent the importance of transparency, data provenance, and robust compliance mechanisms in the development of generative AI. The preservation order in The New York Times v. Microsoft Corporation, OpenAI, et al. and the damages awarded in Bartz v. Anthropic send a unified message: innovation in AI must be grounded in respect for intellectual property rights, and companies that disregard these boundaries face significant legal and financial risks.