Friday, May 24, 2024
- Advertisement -

    Latest Posts

    Summary: Here are the six key arguments raised by local newspapers in their lawsuit against OpenAI

    Eight newspaper publications in the US, including the New York Daily News and Chicago Tribune, have filed a lawsuit against OpenAI and Microsoft alleging that the two used millions of their copyrighted articles without permission or compensation. The publications claim that they spend a significant amount of time and effort investigating and reporting local stories. By copying their work without compensation, OpenAI and Microsoft are depriving them of site visits, subscriptions, and licensing revenue.

    The publications state the commercial success of OpenAI is built on copyright infringement. “One of the central features driving the use and sales of ChatGPT and its associated products is the LLM’s [large language model] ability to produce natural language text in a variety of styles,” they mention, adding that to produce such text, OpenAI has made numerous reproductions of copyrighted works owned by the publications in the course of “training” the LLM.

    The publications mention that while OpenAI has not revealed the training datasets for its models beyond GPT-2, it did publish general information about the models. This included the fact that GPT-2 was trained on an internal corpus called “WebText” which contains 45 million links posted by users on Reddit. Of this WebText, 145,220 entries were scraped from the publications’ website. Similarly, for GPT 3, OpenAI heavily relied on Common Crawl. The publications point out that their content makes up 124 tokens of the C4 dataset (Google’s Colossal Clean Crawled Corpus, which is Google’s version of Common Crawl corpus). GPT-3 also relied on WebText 2 which was created by scraping links from the internet over a long period of time. Given the use of the WebText, WebText2, and other training datasets to train the GPT models, and the shift in ChatGPT’s knowledge cutoff date, publishers believe that their work continues to be copied.

    Key points raised in the lawsuit:

    GPT Models have “memorized” the publications’ work:

    The lawsuit explains that AI models exhibit a behaviour called memorization. If they are given the right prompt, they will repeat large portions of the materials they were trained on. This shows that LLM parameters “encode retrievable copies of many of those training works.” The newspapers tested out that ChatGPT had memorized their work by feeding it the opening of a story to which the AI replies by writing a verbatim portion of the original article from the Publisher. The newspapers also tested the AI in several other ways including—

    • by asking it to summarize a story based on its publication and title
    • asking it to summarize a story first and then follow the summary up with the actual verbatim text

    This is notably similar to what the New York Times did to examine whether the AI models generated verbatim copies and summaries of its stories if prompted to do so.

    Microsoft and OpenAI contribute to end-user infringement (and know about it):

    The publications say that if Microsoft and OpenAI were to argue that the end-user is the infringer when a publication’s work is generated as an output by their models, it must be noted that the two directly and materially aided in this infringement. The two companies should have known that their models would result in infringing output, given the models’ propensity to memorize training material. The companies did in fact know end users use AI products to elicit copyrighted content based on their own acknowledgement of this issue on OpenAI’s website.

    It has also been widely reported that users use the browser with Bing plug-in to circumvent paywalls. Reports have mentioned that people tend to use GPT models to create disinformation, misinformation or “poor replications of newspapers copyrighted content on AI-generated on “pink slime” news sites.” Further, OpenAI’s Custom GPT store, contains numerous Custom GPTs specifically designed to circumvent the publications’ paywalls. The lawsuit gives the example of “Remove Paywall” GPT which provides the text content needed to bypass paywalls legally. It also brings up the example of a “News Summarizer” Custom GPT that encourages users to “save on subscription costs” and “skip paywalls just using the link text or URL.”

    OpenAI can prevent users from violating its policies:

    In some of the instances where ChatGPT detects that a user’s query seeks to elicit output violating the OpenAI content policy (which requires that users comply with applicable laws), it will throw up a message saying “This content may violate our content policy.” The company can not only monitor violations of its policy but can also terminate the accounts of those requesting copyrighted materials as per its Terms of Use.

    The removal of copyright management information from the published content:

    The publications convey the authors’ names and their own names with their stories. They also convey their copyright management information and terms and conditions in the webpage footer. This information was intentionally removed during scraping, training the models and distributing unauthorized copies of it. OpenAI knew that if it did so, the copyright management information would not be displayed when their models generated copies of the publications’ works. This would thereby conceal the infringement they were carrying out and induce, enable, facilitate, or conceal end-users’ infringement as well.

    The companies contacted content extractors (Dragnet and Newspaper content extractors) that, by design, removed the copyright information. OpenAI removed copyright management information to allow users to claim the publications’ works as their own. The lawsuit quotes OpenAI’s terms of use which suggest that end users own the output, even though the output contains reproductions of the Publications’ work. The lawsuit gives the example of ChatGPT producing a copy of a New York Daily News’ story and expressly stating that the user should “feel free” to incorporate it into their blog.

    Wrongly attributing hallucinations to the publications:

    The publications claim that OpenAI is causing them “commercial and competitive injury” by misattributing content to them. The lawsuit gives the example that in response to a query asking which New York newspapers provided evidence to support and promote the erroneous belief that injecting disinfectants could cure COVID-19, ChatGPT responded that the New York Daily News promoted this narrative.

    The Browse with Bing plug and Copilot adversely impact news website views:

    The publications argued that Copilot and the Browse with Bing feature have the ability to create summaries of search result content. This makes it completely unnecessary for a reader to visit the news publications’ websites. These search result summaries “maintain engagement with Defendants’ [Microsoft and OpenAI] own sites and applications instead of referring users to the Publishers’ websites in the same way as organic listings of search results,” the lawsuit reads.

    These summaries go beyond what traditional search results include.

    The publications claim that the AI models sometimes output several paragraphs or the entirety of their works. Even when the responses contain links to source materials, users have less need to navigate to those sources because their expressive content is already included in the narrative result. They argue that such an indication of attribution may make users more likely to trust the summary alone and not click through to verify. A user who has already read the latest news, even—or especially— with attribution to the Publishers, has less reason to visit the original source. As such, the synthetic search results (or summaries) created by Copilot misappropriate “hot news”.


    STAY ON TOP OF TECH NEWS: Our daily newsletter with the top story of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today!


    Also read:

    The post Summary: Here are the six key arguments raised by local newspapers in their lawsuit against OpenAI appeared first on MediaNama.

    Latest Posts

    - Advertisement -

    Don't Miss

    Stay in touch

    To be updated with all the latest news, offers and special announcements.