Thursday, July 25, 2024
- Advertisement -
More

    Latest Posts

    Reddit to announce new rules to tighten control on web crawlers

    On June 25, Reddit announced plans to update its robot exclusion protocol (robots.txt file), which gives instructions about how third parties are permitted to crawl Reddit’s content. Web crawlers are automated bots that download and index the content of websites all across the Internet. These are used by search engines to index websites and make them appear in search results and by artificial intelligence (AI) companies to collect training data for their models. 

    Reddit mentioned that besides making changes to its robot exclusion protocol, it will also continue rate-limiting and/or blocking unknown bots and crawlers from accessing Reddit.com. The company mentioned that this would not impact the majority of Reddit users, including independent researchers and organizations like Internet Archive that access Reddit content for non-commercial purposes.

    Why it matters:

    While Reddit has not specified that this decision was taken to prevent AI companies from scraping its data, the platform has historically been opposed to AI firms gaining free access to its database.  In an interview with the New York Times last year, Reddit’s CEO and founder Steve Huffman said, “The Reddit corpus of data is really valuable, but we don’t need to give all of that value to some of the largest companies in the world for free.” Huffman mentioned that what makes Reddit’s data so valuable is that people share extremely personal details of their lives on the platform, which could improve the performance of language learning models. 

    The company had notably made changes to its application programming interface wherein it introduced a charge for third-party applications making a high number of data requests. It has also entered into a $60 million agreement with Google, to make its content available for training Google’s AI models.

    With all this considered, the changes in robot exclusion protocol could be an attempt to curb data scraping.  

    Some context:

    This announcement comes a month after Reddit introduced its public content policy. This policy outlines what information its partners (which would include companies like Google) can access via a public-content licensing agreement. This policy specifies that Reddit’s partners must uphold the privacy of Redditors and their communities. This includes respecting users’ decisions to delete their content and any content Reddit removes for content policy violations. Furthermore, partners are not allowed to—

    • Use content to identify individuals or their personal information, including for ad-targeting purposes
    • Use Reddit content to conduct background checks, facial recognition, government surveillance, or help law enforcement do any of the above
    • Use Reddit content to spam or harass Redditors
    • Access Reddit content which includes adult media

    Also read:

    The post Reddit to announce new rules to tighten control on web crawlers appeared first on MEDIANAMA.

    Latest Posts

    - Advertisement -

    Don't Miss

    Stay in touch

    To be updated with all the latest news, offers and special announcements.