AI News Roundup – California’s AI regulations, YouTube videos to train AI models, EU’s efforts to regulate AI, and more

To help you stay on top of the latest news, our AI practice group has compiled a roundup of the developments we are following.

    • California’s proposed wide-ranging regulations on AI technologies are moving forward in the state’s legislature, according to The Washington Post. We covered the proposed legislation in a previous AI Roundup. The most prominent bill, introduced by Scott Wiener, a Democratic state senator from San Francisco, would require AI companies to test their models for “catastrophic” risks before public release. The regulations have faced opposition from tech industry leaders, who argue it could stifle innovation and disadvantage smaller startups. Other AI-related bills in California aim to address issues such as bias testing, protection of children’s data and transparency in AI model development. The state’s efforts highlight the growing tension between the tech industry’s calls for regulation and its resistance to specific legislative measures, as well as the increasing role of state governments in shaping AI policy in the absence of comprehensive federal action.
    • WIRED reports that numerous tech companies, including Nvidia, Apple, Anthropic, and Salesforce, used materials from thousands of YouTube videos to train AI models, despite YouTube’s terms of service prohibiting data harvesting without permission. An investigation by Proof News found that subtitles from 173,536 YouTube videos, spanning over 48,000 channels, were incorporated into a dataset called YouTube Subtitles and were subsequently used by these tech giants to train various AI models. The content ranged from educational channels like Khan Academy to popular YouTubers such as MrBeast and PewDiePie. Many content creators were unaware their work had been used, raising concerns about consent, compensation and the potential impact on their livelihoods. The YouTube Subtitles dataset is part of a larger compilation known as the Pile, which includes other publicly available information from online sources. A spokesman for Anthropic confirmed the use of the Pile in the training of its AI chatbot Claude but noted that “YouTube’s terms cover direct use of its platform, which is distinct from use of the Pile dataset,” and directed questions about violating YouTube’s terms of service to the Pile’s creators.
    • The Financial Times reports on the European Union’s (EU) efforts to regulate AI technologies and ensure its ethical use. The EU’s Artificial Intelligence Act, set to come into effect in August 2024, aims to categorize AI systems based on risk levels and impose corresponding regulations. However, the rushed nature of the law’s development, particularly in response to the emergence of generative AI like OpenAI’s ChatGPT, has led to concerns about its implementation. Critics argue that the act lacks clarity on crucial issues such as copyright and enforcement, potentially hindering innovation in the EU’s tech sector. EU officials have faced challenges in filling in regulatory gaps, hiring technical experts and balancing the desire to lead in AI regulation with the need to foster a competitive AI industry. While some view the act as a necessary step towards ensuring trustworthy AI, others fear it may put European companies at a disadvantage in the global AI race, especially against competitors in the U.S. and China.
    • OpenAI has released a new version of its GPT large language model, named GPT-4o Mini, intended to be lighter and cheaper than its full-sized counterparts, according to The Verge. The model is designed to make AI more accessible to developers and users by offering a more affordable option for building AI applications. GPT-4o Mini is reported to be more capable than OpenAI’s previous GPT-3.5 and will replace GPT-3.5 Turbo for ChatGPT users on Free, Plus and Team plans. The new model supports text and vision inputs, with plans to handle other multimodal inputs like video and audio in the future. OpenAI’s move is seen as a response to competing lightweight models such as Google’s Gemini 1.5 Flash and Anthropic’s Claude 3 Haiku and aims to encourage more widespread AI development and usage across various industries and applications without heavy monetary or computational costs.
    • The New York Times reports on new research findings that several prominent online sources have restricted the use of their data for AI training, leading to a large drop in content available for that purpose. A study by the Data Provenance Initiative found that 5% of all data and 25% of high-quality data from 14,000 web domains in common AI training datasets has been restricted over the past year. This dramatic shift has been attributed to publishers and online platforms taking steps to prevent their data from being harvested, often using “robots.txt” files that limit text crawlers or changes in terms of service. The trend poses challenges for AI companies, particularly smaller ones, and researchers who rely on public datasets. Some major tech companies have responded by striking deals with publishers, while others are exploring synthetic data generation. The situation highlights growing tensions between AI developers and content creators over data usage and compensation, as well as the need for more nuanced tools to control data access for AI training purposes.