Meta’s new AI image generator was trained on 1.1 billion Instagram and Facebook photos

Lee Duna@lemmy.nz · 2 years ago

Meta’s new AI image generator was trained on 1.1 billion Instagram and Facebook photos

Otter@lemmy.ca · 2 years ago

So I assume they added any necessary stuff to the TOS to allow this.

My question is if there’s any legal mechanism to prevent this on other platforms? Pixelfed for example.

Companies will likely federate and pull images regardless, but can we go after them when they’re caught? Nothing prevents them from taking the images for internal R&D, but at least we can stop them from selling products with that training data

helenslunch@feddit.nl · 2 years ago

So I assume they added any necessary stuff to the TOS to allow this.

Never read it but I assume it already was. Pretty much every platform has a clause that says something along the lines of “we own all the content you submit to our service”.

phx@lemmy.ca · 2 years ago

Actually it’s usually more “you own the content but by posting it grant is an irrevocable right for us and our partners to use it”

Basically allows them use without the responsibility for ownership of inappropriate content

Supercritical@lemmy.world · 2 years ago

Exactly. Instagram doesn’t claim ownership to any of your content, but Instagram’s terms of use state that the user grants Instagram a non-exclusive, fully paid, and royalty-free, transferable, sub-licensable, worldwide license to use their content. Additionally, they can make money off your content without ever paying you a cut. Honestly, it’s pretty boiler plate at this point. No one should expect anything else from corporations.

maegul (he/they)@lemmy.ml · 2 years ago

My question is if there’s any legal mechanism to prevent this on other platforms? Pixelfed for example.

Good question!

I’ve been saying for a while that the fediverse is blind to this issue as everything here is completely scrapable through either the public web or by running federated servers. On top of that, being culturally inclined toward more “serious” conversation and providing content warnings and alt-text for images, we’re probably generating relatively valuable training data.

And yet everything is public as though it’s still 2012.

There are alternatives. BlueSky for instance is basically private to members only. They recently announced that content would be made public to the web and a number of users were upset.

Group chats and Discord servers are probably similar, and from what I can tell “new” popular places for social activity online.

A major issue the fediverse has, IMO, is that it’s kinda stuck trying to fight Twitter and Facebook circa 2012, when that battle was lost and we’re on to new battle fronts now.

Otter@lemmy.ca · 2 years ago

Yea that’s something that’s been on my mind as well

There are benefits from that openness and verifiability in public spaces (ex. Lemmy communities), since now it’s easier to determine if there’s vote manipulation or astroturfing. But I think the fediverse needs a lot of work around privacy, and also education about what is/isn’t private on these platforms.

There should also be more of a focus on setting up a legal requirement on what can be done with the information, but I’m not sure if that’s a thing just yet. We developed GPLv3 to make sure FOSS products can’t be incorporated for profit, but I’m not sure how it would work for data.

ex. It should be easy to save, record, and share posts on the fediverse, such as with embeds/screenshots/news stories

But also we want to prevent abuse, misuse, and AI training

Halcyon@discuss.tchncs.de · 2 years ago

Bluesky being only accessible by members doesn’t completely prevent the content from being scraped by bots, though. Bots can be given user access in Bluesky too, and bots can read posts, create own posts and scrape posts and user profiles.

maegul (he/they)@lemmy.ml · 2 years ago

My main point with BlueSky was that many of the users there had gotten quite comfortable with what appeared to be their closed/private space, which, despite examples like yours, was relatively true compared to the norms of Twitter and Mastodon.

The point was that many over there seemed to like it, and, if a BlueSky competitor opened up today promising all the same stuff but closed/private with the ability to opt out and make something public, many would probably jump ship or demand the same from BlueSky.

PupBiru@kbin.social · 2 years ago

afaik activitypub/fediverse doesn’t have to be fully open… there’s private messages and followers only profiles on mastodon… sure, any server admins of your followed would be able to see anything you post (and thus in this case for threads for example, if you accept any follower from threads then meta can see your stuff) but this also doesn’t grant them a license to use the content

also, bluesky will eventually be the same: it only doesn’t have those issues now because they haven’t opened up their software… it’ll have federation in the future, which means it has to be somewhat programmatically open to others

Eezyville@sh.itjust.works · 2 years ago

I think in order to fight against these composite using our data for AI training we souks have to do something like watermarking our images explicitly stating that they are not for AI training. Or we create some type of counter measure that messes up the training.

Dkarma@lemmy.world · 2 years ago

You’re never going to get rights over the training data your pictures that are freely available for anything to scan creates. By being on the internet your pictures basically have the right to be viewed by anyone or anything even an AI. You have never gotten to control who looks at your content after you post it.

You’re trying to make the same argument the “don’t copy my nft” bros tried to make.

Imagine going into court and saying you should get paid for all the stuff u gave away for free on the Internet willingly.

Otter@lemmy.ca · 2 years ago

Well there’s a difference between “don’t look at my work without paying me, even if it’s posted publicly” and “don’t sell my work without paying me, even if it’s posted publicly”

Like I said, there’s nothing we can do about companies using all the data they can get their hands on for private R&D. It IS possible to protect against the second case, where companies can’t sell an LLM product with copyrighted training data.

My question was about how that second case could be extended to stuff posted on the Fediverse, such as if an instance had a blanket “all rights belong to the user posting the content”.

These laws exist, if companies can use them then so can we

I_Has_A_Hat@startrek.website · 2 years ago

where companies can’t sell an LLM product with copyrighted training data.

If an artist learns their technique from copying other artists until they are competent enough to produce their own original works, should they be banned from selling their original work or services? After all, they used copyrighted training data to gain the skills needed to produce said work and services.

BURN@lemmy.world · 2 years ago

LLMS and Generative AI do not learn like humans and regulating it the same would be disingenuous and completely off base.