Writing these words here I am wondering how soon the first web crawler will arrive pulling them into a Large Language Model (LLM) and then how soon they will be used to formulate new sentences, paragraphs and articles for anyone using ChatGPT or Bard. I think it is amazing that writing these words now is going to affect the output for some user of an AI tool shortly. I have included many relevant Wikipedia Links below to avoid me writing up the background should you want more detail since this post is examining a different perspective.
I suspect Bard will be the first
to make use of my words written here. This is because Google own Bard and they
will give its web crawler a priority in terms of being able to capture sites they,
Google, own. More correctly Alphabet own who own Google. This here is being written on Blogspot which
is owned by Google. Google may hold back the ChatGPT (Claim open but really Microsoft)
and Llama ( Meta - Facebook) web crawlers from getting immediate access. This
whole subject of allowing access to web crawlers to your websites is the
subject of a new commercial and political battle most of it taking place behind
the scenes. With me being in the Writers and Publishers domain it is a subject
that I chose to follow closely since it is my content they are making use of
without my authorisation or in some cases my legal copyright approval.
It should be noted that the
market is being flooded with many AI generation tools with some having their
own web crawlers or sharing with others. They are not just text generators but
can produce a variety of digital outputs covering images, audio and video. Look
out for DALL-E for image creation. It reminds me of when the World Wide Web
first started with everything changing over night and you never knowing which
one to view. I have used ChatGPT and achieved some impressive outcomes.
It should be noted that most of
the owners of web crawlers, although allowing use of the AI tools running over
their LLM’s, do not allow a more open access to the underlining data. But it appears
that Meta (Facebook) with their LLM called Llama 2 are offering a different approach allowing for a more open
access to these LLM captured data sets.
To conclude for now read this
article from the Times Newspaper.
The article in the Times on the
22/01/24 titled “Google “traps” publishers in AI battle by Katie Prescott their
Technology Business Editor with a copy below suitably copyright acknowledged
should give you an insight to this current situation.
Copyright.
The Times Newspaper.
Google ‘traps’ publishers in AI
battle
Katie Prescott - Technology
Business Editor
The New York Times has sued OpenAI for alleged copyright infringement,
claiming the technology company used the newspaper’s information without permission.
Publishers are complaining that Google, because of its dominance in
internet search, has them “between a rock and a hard place” over the use of
their copyrighted output to power its artificial intelligence models.
This situation potentially gives Google an enormous advantage over its
rivals, they claim, as businesses fear that blocking its AI search “crawlers”
would mean they lose out on valuable traffic.
While most publishers, such as media organisations, have blocked
OpenAI’s web crawler, a bot that sucks in their content to feed ChatGPT with
information, they worry that barring Google’s equivalent, which supplies its
Bard chatbot, would disadvantage them in the long term when it comes to making
their information findable and accessible on traditional Google.
“We don’t want to do anything that results in a situation where we get
less traffic in a world where Google combines AI and search,” one said, “so
we’ve turned off the OpenAI crawler but we haven’t turned off the Google one.
They have us between a rock and a hard place.”
Towards the end of last year, Google said it would split its crawlers,
so that publishers could choose whether to have their information scraped, or
extracted, for its AI systems or merely its search engine. However, it has a
new iteration, called search generative experience, or SGE, which is a hybrid
of generative AI paragraphs and traditional search: this is what publishers
fear will erase them from results pages should they block Google’s crawlers.
Owen Meredith, chief executive of the News Media Association, a British
trade body, said: “Individual publishers inevitably will take a commercial view
on whether to opt out or not, based on their individual business model. The
challenge for many publishers will be the interdependency and gatekeeper role
of a small number of Big Tech platforms across every part of their business,
from discoverability to advertising to operating systems. Publishers may feel
exposed about how Big Tech could react if they decide to opt out.”
Google says it is very aware of the importance of generative AI
returning traffic to content-makers and argues that the new function will
present more possibilities for people searching for information.
“As we develop LLM [large language model]-powered features, we’ll
continue to prioritise experiences that send valuable traffic to the news
ecosystem,” a spokeswoman said. “Our intent is for search generative
experiences to serve as a jumping-off point for people to explore web content
and, in fact, we are showing more links with SGE in search and links to a wider
range of sources on the results page, creating new opportunities for content to
be discovered.”
Generative AI burst into the public consciousness with the launch of
ChatGPT in November 2022. Since then a handful of players, including OpenAI,
backed by Microsoft, Google and Meta, have dominated the market, boasting the
resources and computing power needed to build the large-language models that
underpin the engines that can create everything from text to images in a
human-like way.
Creative industries and the technology companies worldwide are clashing
heads over the rights to the content used to create AI. In Britain, the
Department for Science, Innovation and Technology is expected to make a ruling
shortly. In addition, there are test cases under way in the courts. In the
United States, The New York Times has sued OpenAI for alleged copyright
infringement, claiming the technology company used the newspaper’s information
to train its artificial intelligence models without permission or compensation.
In Britain, Getty Images is suing Stability AI in the High Court over
copyright, claiming the latter had “unlawfully” scraped millions of images from
its site.
Useful links.
Wikipedia Large Language Model
https://en.wikipedia.org/wiki/Large_language_model
Wikipedia Web Crawler
http://en.wikipedia.org/wiki/Web_crawler
Wikipedia ChatGPT
https://en.wikipedia.org/wiki/ChatGPT
Wikipedia Bard
https://en.wikipedia.org/wiki/Bard_(chatbot)
Meta Llama 2
Wikipedia DALL-E
https://en.wikipedia.org/wiki/DALL-E