KudoZ answers accessible for AI training
Thread poster: Philippe Locquet
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 00:10
English to French
+ ...
Nov 2, 2023

Reading comments about the use of AI in Kudos, it made me realize something. As some have said, KudoZ is a way to make their expertise shine and is a nice way to help one another among language professionals.
However, here’s a potential pain point: Most AI tools from big providers require tremendous amounts of data for training. To minimize costs, this is done by using a technique called “Scraping”. This means that a program (bot) is scanning to grab as much text as possible if acces
... See more
Reading comments about the use of AI in Kudos, it made me realize something. As some have said, KudoZ is a way to make their expertise shine and is a nice way to help one another among language professionals.
However, here’s a potential pain point: Most AI tools from big providers require tremendous amounts of data for training. To minimize costs, this is done by using a technique called “Scraping”. This means that a program (bot) is scanning to grab as much text as possible if accessible freely on the internet. Many platforms are now deciding to block this by forcing the use of an API to connect to their content (i.e. Reddit). This way they prevent scraping or allow it against a payment.

So as technology progresses I don’t see anything blocking KudoZ content to be harvested for AI training if it’s freely accessible to non-members.

If access from the outside is protected, KudoZ answers can’t be used to train AI, but this means less traffic so, less exposure for advertisement banners.
I can’t think of a solution for this, there’s pros and cons, but at least, I think we should be aware of this.

I did a test for that theory and yes, you can find the answer directly from an AI chat bot without having to visit the ProZ.com website (see the answer in bold underlined in the screenshot below). It also gave me the correct link: https://www.proz.com/kudoz/english-to-french/medical-general/2465731-device-licence.html
Test realized with Bing (aka Copilot).

Note that depending on the language and the answer formatting, the AI doesn’t always find the answer on ProZ.
I just thought that was worth writing about.

find duplicates in list
Collapse


expressisverbis
Zea_Mays
Jennifer Levey
Maria Teresa Borges de Almeida
Marina Aleyeva
Anne Beckers
texjax DDS PhD
 
Zea_Mays
Zea_Mays  Identity Verified
Italy
Local time: 01:10
Member (2009)
English to German
+ ...
What is the difference vs search engines when asking the AI chat bot? Nov 2, 2023

Beside of training if the AI chat bot is fed with the data through a dedicated crawler?
You can ask the same in Google using "site:.proz.com" and it returns only results from ProZ (without explanations of course).

Can the AI crawlers be blocked individually when you know their name?

[Bearbeitet am 2023-11-02 17:54 GMT]


Tomoyuki Urabe
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 00:10
English to French
+ ...
TOPIC STARTER
Demo Nov 2, 2023

Zea_Mays wrote:
Can the AI crawlers be blocked individually when you know their name?

[Bearbeitet am 2023-11-02 17:54 GMT]

That was just a simple experiment to prove "outside access". But yes, I suppose doing the same with Google would work too.

How it's technically deployed is beyond my skills but this thread on Stack Overflow is well documented:
https://stackoverflow.com/questions/3161548/how-do-i-prevent-site-scraping

And this article explains why Reddit made a specific move to prevent its data to be scraped: https://siliconangle.com/2023/04/18/reddit-charge-access-api-counter-free-data-scraping-ai-companies/

So basically, after a while, AI chat bots could give answers based on information gathered in the KudoZ too (unless ProZ has already something in place against this that we don't know about.)


Tomoyuki Urabe
 
Zea_Mays
Zea_Mays  Identity Verified
Italy
Local time: 01:10
Member (2009)
English to German
+ ...
Future of AI chat bots? Nov 2, 2023

Here's an article on this (I actually had thought of the robots.txt file but it involves a lot of work)
https://www.makeuseof.com/block-ai-chatbot-scraping-website/

There are lawsuits against AI companies around the world for copyright infringement, so the public availability of their bots may be restricted in the future.


Philippe Locquet
Maria Teresa Borges de Almeida
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 00:10
English to French
+ ...
TOPIC STARTER
Protecting content Nov 2, 2023

Zea_Mays wrote:
There are lawsuits against AI companies around the world for copyright infringement, so the public availability of their bots may be restricted in the future.


Yes, that's a difficult bit. What kind of protection/ownership can apply to high quality language translations resulting from skilled human effort applied to a specific context. Seems hard to define... although as translators we can clearly see the skill, effort and expertise.


Zea_Mays
 
Peter Motte
Peter Motte  Identity Verified
Belgium
Local time: 01:10
Member (2009)
English to Dutch
+ ...
AI is possible because we fed it Nov 8, 2023

If people wouldn't have been publishing so much text and other material on the internet, free accessible to all, LLM's wouldn't have been possible, let alone AI for translation. If we are pushed out of the market because of AI, we did it to ourselves. It also shows that whatever generative AI comes up with, it is always because of a form of copyright infringement and infringement of the authors' rights.

[Edited at 2023-11-08 10:50 GMT]

[Edited at 2023-11-08 10:50 GMT]


 
Peter Motte
Peter Motte  Identity Verified
Belgium
Local time: 01:10
Member (2009)
English to Dutch
+ ...
Restrictions will come Nov 8, 2023

Zea_Mays wrote:

Here's an article on this (I actually had thought of the robots.txt file but it involves a lot of work)
https://www.makeuseof.com/block-ai-chatbot-scraping-website/

There are lawsuits against AI companies around the world for copyright infringement, so the public availability of their bots may be restricted in the future.


It will be restricted, because lots of people are really afraid of it (although Musks statement 'There will come a point when no job is needed' is rubbish), especially because the decision taking rights of AI could get out of hand.
But creative workers will also start to defend themselves against AI using there work. I've already put the statement "Use by AI not allowed. Use for training purposes of AI not allowed." on my blogs, and on articles I publish on the internet. Maybe it won't work, but they can't say they haven't been warned.

For the record:
Use by AI of this answer not allowed. Use for training purposes of this answer by AI not allowed.


 
Lingua 5B
Lingua 5B  Identity Verified
Bosnia and Herzegovina
Local time: 01:10
Member (2009)
English to Croatian
+ ...
It will fail Nov 8, 2023

AI will be a typical example of this:

A product needs to have some substance, some “meat” on it and then marketing will push it forward.

A product with no substance and no “meat” but just bones, despite $billions paid in cash for marketing and for advocacy to agressively promote it will eventually fail. Just an ultimate business law. AI falls into this category. The only place I cannot find GPT promotion these days is inside my fridge, as I was not paid a $milli
... See more
AI will be a typical example of this:

A product needs to have some substance, some “meat” on it and then marketing will push it forward.

A product with no substance and no “meat” but just bones, despite $billions paid in cash for marketing and for advocacy to agressively promote it will eventually fail. Just an ultimate business law. AI falls into this category. The only place I cannot find GPT promotion these days is inside my fridge, as I was not paid a $million in cash.

Great video, this guy articulates it much better than I could ever do: https://youtu.be/ro130m-f_yk?si=pRVlMQBWiNurn1UQ
Collapse


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 00:10
English to French
+ ...
TOPIC STARTER
AI fridge Nov 8, 2023

Lingua 5B wrote:

The only place I cannot find GPT promotion these days is inside my fridge, as I was not paid a $million in cash.



AI fridges are coming... https://150sec.com/smart-fridges-set-to-get-smarter-with-iot-ai-functionalities/13576/

If the fridge was able to make food out of thin air "Star Trek replicator style", would 3 Michelin's stars chef file for copyright infringement? Could be. I prefer by far a dumb fridge, by definition, it's "cool" and it's not going to ruin my groceries over a firmware update. JJJJJ


Jorge Payan
 
Lingua 5B
Lingua 5B  Identity Verified
Bosnia and Herzegovina
Local time: 01:10
Member (2009)
English to Croatian
+ ...
Yeah Nov 8, 2023

Philippe Locquet wrote:

Lingua 5B wrote:

The only place I cannot find GPT promotion these days is inside my fridge, as I was not paid a $million in cash.



AI fridges are coming... https://150sec.com/smart-fridges-set-to-get-smarter-with-iot-ai-functionalities/13576/

If the fridge was able to make food out of thin air "Star Trek replicator style", would 3 Michelin's stars chef file for copyright infringement? Could be. I prefer by far a dumb fridge, by definition, it's "cool" and it's not going to ruin my groceries over a firmware update. JJJJJ


Sure. What I wrote was a metaphor, it was about promotion and marketing, not technology. Point being thar GPT promotion is jumping at me from every corner of the Internet, and wider.


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 00:10
English to French
+ ...
TOPIC STARTER
Joke Nov 8, 2023

Lingua 5B wrote:
Sure. What I wrote was a metaphor, it was about promotion and marketing, not technology. Point being thar GPT promotion is jumping at me from every corner of the Internet, and wider.


Sure, I was just joking on this, since it seems to be creeping everywhere, that's what happens with hypes...
take care


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 00:10
English to French
+ ...
TOPIC STARTER
@Henry author protection Nov 8, 2023

Peter Motte wrote:

There are lawsuits against AI companies around the world for copyright infringement, so the public availability of their bots may be restricted in the future.
...
For the record:
Use by AI of this answer not allowed. Use for training purposes of this answer by AI not allowed.


Just wondering what kind of legal framework could be put around this. Since ChatGPT is now giving legal protection/support against people filing copyright claims on OpenAi customers using their AI systems.
Maybe hosting the data in EU could help (sometimes they have more protective laws). Could data from KudoZ, glossaries etc. be protected as a product of human effort? Broad question, but I think it may make sense looking ahead.

[Edited at 2023-11-08 16:09 GMT]


texjax DDS PhD
 
Daryo
Daryo
United Kingdom
Local time: 00:10
Serbian to English
+ ...
What did I miss? Nov 11, 2023

Which part of "(some information) being in public domain" is so difficult to grasp?

If something was made available on the part of the Internet accessible to everyone, there is no technical way to prevent anyone from accessing it. So, no matter you like it or not, content generated by Kudoz **can** be part of the "raw input" used for training anyone's AI.

Whether any sane "AI trainer" would want to use Kudoz "as is/tel quel" as training input is another matter altogethe
... See more
Which part of "(some information) being in public domain" is so difficult to grasp?

If something was made available on the part of the Internet accessible to everyone, there is no technical way to prevent anyone from accessing it. So, no matter you like it or not, content generated by Kudoz **can** be part of the "raw input" used for training anyone's AI.

Whether any sane "AI trainer" would want to use Kudoz "as is/tel quel" as training input is another matter altogether. There are terabytes (if not pentabytes) of publicly available translations that have been triple-checked to the last single comma before being published by various international organisations (UN, EU, all sort of specialised agencies etc...). The additional contribution from a source randomly and unpredictably mixing well explained perfectly good translations and occasional pure nonsense would be rather marginal.

You propose to turn Kudoz into some kind of "members-only club"? Not sure it would do any good - for anyone. Plus, if all the various sources used for answering Kudoz questions started using the same logic of "protecting our content", what would happen? Ever thought of that?

"Protecting content" is a wrong solution for a real problem. A very real problem - never mind the marketing BS piggybacking on the current AI hype, AI is here to stay - and keep improving. The infighting between those who want to profit right now from whatever AI is currently available is only a side-show - the genie can't be put back in the bottle. In 1769 every coachman was laughing at Cugnot's "silly invention" ... 200 years later there wasn't many coachmen around and they were certainly not laughing. I'm not sure the guy from this video https://www.youtube.com/watch?v=ro130m-f_yk will still be laughing in 20 years time (or even 10).
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

KudoZ answers accessible for AI training






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »