We were promised Strong AI, but instead we got metadata analysis

We were promised Strong AI

The late nineties dream of search engines was that they would use grand-scale Artificial Intelligence to find everything, understand most of it and help us retrieve the best of it. Not much of that has really come true.

Google has always performed a wide crawl of the entire web. But few webmasters are so naive as to assume their pages will be found this way. Even this website, which has fewer than 20 pages, has had problems with Google finding all of them. Relying solely on the general crawl has proved unworkable for most.

Google introduced the Sitemap standard in 2005 to allow webmasters to eliminate the confusion by just providing a list of all their pages. Most websites now provide sitemap files instead of relying on the general crawl.

A sitemap file is, in short, a big XML file full of links to your site's pages. I think it says something that even with this seemingly foolproof data interchange format that Google still have to provide tooling to help webmasters debug issues. That said, it's a huge improvement compared to trying to riddle out why their general crawl did or did not find certain pages. Or found them multiple times.

After a search engine finds a page the next step is to read it and understand it. How well does this work in practice? Again, relatively few websites expect Google to manage this on their own. Instead they provide copious metadata to help Google understand what a page is about and how it sits relative to other pages.

Google gave up at some point trying to work out which of two similar pages is the original. Instead there is now a piece of metadata which you add to let Google know which page is the "canonical" version. This is so they know which one to put in the search results, for example, and don't wrongly divvy up one page's "link juice" into multiple buckets.

Google also gave up trying to divine who the author is. While Google+ was a goer, they tried to encourage webmasters to attach metadata referring to the author's Google+ profile. Now that Google+ has been abandoned they instead read metadata from Facebook's OpenGraph specification, particularly for things other than the main set of Google search results (for example in the news stories they show to Android users). For other data they parse JSON-LD metadata tags, "microformats" and probably much more.

Google doesn't just search web documents, they also have a product search, Google Shopping (originally "Froogle"). How do Google deduce the product data for an item from the product description page? This is, afterall, a really hard AI problem. The answer is that they simply don't - they require sellers to provide that information in a structured format, ready for them to consume.

Google of course do do text analysis, as they have always done, but it's often forgotten that their original leg up over other search engines was not better natural language processing but a metadata trick: using backlinks as a proxy for notability. The process is detailed in the original academic paper and in the PageRank paper.

Backlink analysis was a huge step forward, but PageRank is not about understanding what is on the page and indeed early on Google returned pages in the search results that it had not yet even downloaded. Instead PageRank judges the merit of a page based on what other pages link to it. That is, based on metadata.

And how well, after all this, does the Artificial Intelligence do at coming up with the relevant documents in response to search queries? Not so well that showing structured data lifted from Wikipedia's infoboxes on the right hand side wasn't a major improvement. So many searches are now resolved by the "sidebar" and "zero click results" that traffic to Wikipedia has materially fallen.

The remaining search results themselves are increasingly troubled.

Share it:
Share it:

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Understand TensorFlow by mimicking its API from scratch

13 Aug, 2019

TensorFlow is a very powerful and open source library for implementing and deploying large-scale machine learning models. This makes it …

Read more

Data, Automation, Analytics, AI — The Unbeatable Quartet

25 Sep, 2020

The business world has been bombarded with one revolution after another. The Digital Revolution has become the catchphrase and every …

Read more

5 Reasons Knowledge Graph will never bloom

29 Nov, 2021

Knowledge Graph (KG) is used to organize knowledge as a graph with nodes and edges (called triples). KG has been …

Read more

Recent Jobs

Data Scientist

Grand Junction, CO, USA

27 Mar, 2023

Read More

Data Analyst

Grand Junction, CO, USA

27 Mar, 2023

Read More

MLOps Lead (Machine Learning Ops lead)

Remote (New Jersey, USA)

22 Mar, 2023

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.