Latent Semantic Indexing & Onsite SEO

Once upon a time (4-5 years ago), on page SEO involved forcing your primary keyword into the title, headings and body text of the page you are trying to rank.  Not too dissimilar to the way Yoast SEO and most bloggers guide users nowadays.   If you wanted to rank highly for a specific term, you aimed for 5-10% keyword density and you made sure you put your term in the following spots in your HTML…

<b> or <strong>
<i> or <em>
<img alt="" />F
<meta name="description" content="">
<meta name="keywords" content="">

That was pretty much it.  I’m not looking to completely discount this methodology as it does still work to a degree however it will only get you so far before you over optimise your site and a black and white coloured animal causes you to have a very bad day (Penguin/Panda). Even though things have moved on quite a bit, I still maintain that SEO is nothing more than a labelling exercise, the game is unchanged, the however rules have evolved and become considerably more complex.

Gone are the days where you could cram your primary keyword or into your text and magically rank.  Google nowadays is more focused on trying to algorithmically understand the theme (classifiers/judges) of your site by analysing the language used on page though latent semantic indexing, latent semantic analysis and site structure (SILO’ing).

Patent History

Back in the day, Yahoo ruled the roost. Responsibility for their PPC division was given over to a company called Overture who in turn used technology licensed from another third party (Applied Semantics) to suggest semantic variants of keywords for Overture clients to bid upon.  Keen to get in on the action Google acquired Applied Semantics in 2003 with the specific intention of utilising their technology/patents to further enhance their paid advertising offerings and organic search using techniques such as Latent Semantic Indexing.  Yahoo and Overture in the meantime were kicked to the kerb and rapidly lost their market share.

What is Latent Semantic Indexing:

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. –Wikipedia

Put simply, latent semantic indexing allows the algorithm to differentiate the meaning of language used within a document. –Vin D’Eletto,

When determining LSI keywords, you need to think of terms related to a main keyword that will allow search engines to understand the context of an article.

So, in the case a primary keyword in a document, e.g. “Apple,” LSI keywords would be “taste“, “fruit“, “flavour“, “health food“, “keep the doctor away,” etc…

If however language such as iPad, iPod, Mac was placed in a document in proximity to the primary keyword “Apple,” an LSI algorithm would be able to determine that the word “Apple” referred to the company not the fruit.

The logical assumption is that seeding a document with LSI keywords allows the search engine to establish its theme/intent without the risk of over optimisation.  To summarise, LSI is really just about building a proper context around a given topic.

One of the biggest misnomers is that LSI keywords are synonyms. This is not 100% correct as synonyms do not set the proper context for a document, related keywords, entities and concepts should be present within the text.

Enter The Humming Bird

Google’s introduction of the Hummingbird Algorithm around August 30, 2013 was  the culmination of the assorted semantic search building blocks falling into place and according to Google search chief Amit Singhal “the first major update of its type since 2001.”

Hummingbird considers each word but also how each word makes up the entirety of the query — the whole sentence or conversation or meaning — is taken into account, rather than particular words. The goal is that pages matching the meaning do better, rather than pages matching just a few words.

“Simply put, it’s not just about keywords nowadays but since the hummingbird update Google is focusing more on meaningful signals(semantics).” – Simon Eade, Eadetech Web Design

In addition to indexation, the algorithm opened new strides into use of conversational search; leveraging natural language processing to improve the way search queries are parsed allowing for changes to the mean users interact with search engines through the increased adoption of mobile devices and voice commants / digital PA’s.

Welcome to The New Era of SEO Copy Writing:

I will be the first to admit that “Determining context and relationships between terms / phrases and collections of the latter” is not the easiest of concepts to get your head around.  Furthermore finding worthwhile information on the subject matter is hampered by an overall lack of understanding within the SEO industry and the usual snake oil salesman latching onto LSI as a buzz phrase/attempting to sell the uninformed masses so-called LSI keyword tools that don’t work as advertised.

It’s acutely  important to be aware of LSI and the importance of writing natural free flowing copy for the web. Spending time and effort to create the correct content will pay dividends in the push for higher rankings” – Derek Jurovich, Avitus Group.

Welcome to the new era of SEO Writing.  One of singular value decomposition and probabilistic models.  Where each document is a probability distribution over topics, each topic is a probability distribution over words and you as a writer start researching abstract concepts such as “n-grams” in an effort to stay ahead of the game.

Don’t Panic:

The algorithm is not an AI, it only understands and processes numbers through a data matrix (singular value decomposition) not semantics.   If you want to rank for a competitive term, you’re still going to resort to keyword stuffing.

As I said earlier the game has not changed, on-site SEO is nothing more than a labeling exercise.  The difference is you are going to be focused on placement of as many LSI keywords as possible to ensure Google classifies your site within an appropriate taxonomy. There may be additional quality checks e.g. spelling and reading level if the so-called phantom update a few months ago did what many claim however the premise of what we are doing is no different to the methodology employed 10 years ago to rank a website for desired keyword.

Finding LSI Keywords:

Most so-called LSI keyword tools function by scraping Google’s related searches or auto complete database by prefixing or post fixing your base seed word with characters from the alphabet. Whilst this technique has its merits, it will not yield LSI keywords, rather variants and inflections.

Performing an organic search will occasionally yield some LSI keywords in the related searches output. The easiest and obvious technique for finding LSI keywords is to use Google’s own PPC keyword research tool. Simply place a short tail variant of a keyword into the tool and then add that same keyword to the negative filter.  This will leaving an extensive list of related terms which can then be worked into your website copy.