We were promised Strong AI, but instead we got metadata analysis
- by 7wData
- May 2, 2021
The late nineties dream of search engines was that they would use grand-scale Artificial Intelligence to find everything, understand most of it and help us retrieve the best of it. Not much of that has really come true.
Google has always performed a wide crawl of the entire web. But few webmasters are so naive as to assume their pages will be found this way. Even this website, which has fewer than 20 pages, has had problems with Google finding all of them. Relying solely on the general crawl has proved unworkable for most.
Google introduced the Sitemap standard in 2005 to allow webmasters to eliminate the confusion by just providing a list of all their pages. Most websites now provide sitemap files instead of relying on the general crawl.
A sitemap file is, in short, a big XML file full of links to your site's pages. I think it says something that even with this seemingly foolproof data interchange format that Google still have to provide tooling to help webmasters debug issues. That said, it's a huge improvement compared to trying to riddle out why their general crawl did or did not find certain pages. Or found them multiple times.
After a search engine finds a page the next step is to read it and understand it. How well does this work in practice? Again, relatively few websites expect Google to manage this on their own. Instead they provide copious metadata to help Google understand what a page is about and how it sits relative to other pages.
Google gave up at some point trying to work out which of two similar pages is the original. Instead there is now a piece of metadata which you add to let Google know which page is the "canonical" version. This is so they know which one to put in the search results, for example, and don't wrongly divvy up one page's "link juice" into multiple buckets.
Google also gave up trying to divine who the author is. While Google+ was a goer, they tried to encourage webmasters to attach metadata referring to the author's Google+ profile. Now that Google+ has been abandoned they instead read metadata from Facebook's OpenGraph specification, particularly for things other than the main set of Google search results (for example in the news stories they show to Android users). For other data they parse JSON-LD metadata tags, "microformats" and probably much more.
Google doesn't just search web documents, they also have a product search, Google Shopping (originally "Froogle"). How do Google deduce the product data for an item from the product description page? This is, afterall, a really hard AI problem. The answer is that they simply don't - they require sellers to provide that information in a structured format, ready for them to consume.
Google of course do do text analysis, as they have always done, but it's often forgotten that their original leg up over other search engines was not better natural language processing but a metadata trick: using backlinks as a proxy for notability. The process is detailed in the original academic paper and in the PageRank paper.
Backlink analysis was a huge step forward, but PageRank is not about understanding what is on the page and indeed early on Google returned pages in the search results that it had not yet even downloaded. Instead PageRank judges the merit of a page based on what other pages link to it. That is, based on metadata.
And how well, after all this, does the Artificial Intelligence do at coming up with the relevant documents in response to search queries? Not so well that showing structured data lifted from Wikipedia's infoboxes on the right hand side wasn't a major improvement. So many searches are now resolved by the "sidebar" and "zero click results" that traffic to Wikipedia has materially fallen.
The remaining search results themselves are increasingly troubled.
Upcoming Events
You Might Be Interested In
Understand TensorFlow by mimicking its API from scratch
13 Aug, 2019TensorFlow is a very powerful and open source library for implementing and deploying large-scale machine learning models. This makes it …
Data, Automation, Analytics, AI — The Unbeatable Quartet
25 Sep, 2020The business world has been bombarded with one revolution after another. The Digital Revolution has become the catchphrase and every …
5 Reasons Knowledge Graph will never bloom
29 Nov, 2021Knowledge Graph (KG) is used to organize knowledge as a graph with nodes and edges (called triples). KG has been …
Recent Jobs
Do You Want to Share Your Story?
Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.
Privacy Overview
Get the 3 STEPS
To Drive Analytics Adoption
And manage change

