The problem with grand visions of the Semantic Web was that they all assumed a top-down structure. One wickedly clever set of rules to wrangle every fact. A global ontology.
It didn’t make sense. Global ontologies are like Soviet Central Planning. Rules are meant to be broken. And top-down systems are crashing and burning everywhere you look.

The search giant (Google) has constructed a bottom-up directory of meaning. The company calls the product “the knowledge graph” and the service Semantic Search.

Dan Conover, 2012: Google gives birth to the bottom-up Semantic Economy, April 16, 2012

While we’re not sure how Dan’s vision will actually work at the end of the day we nevertheless agree that a universal, top-down, global ontology won’t be the base the semantic web will be built upon. Instead it will be based on multiple, buttom-up, decentral and highly flexible ontologies, some of them specific to just one user, others managed and maintened by groups of users sharing a common interest.

When Google goes to semantic search, it won’t be as much about keywords at all, but on the meaning of the words you use. This might be the biggest SEO killer of all. If tuning our content for keywords our users care about is no longer an effective strategy, what is left for SEOs?
At the moment Deep Dive is limited to topic tags, which are mostly broad terms like “Middle East and North Africa Unrest (2010- )” and “Demonstrations, Protests, and Riots.” That means that that Yemen story connects with stories in on protests in Egypt and Syria, not more stories about what’s going on in Yemen. Erwin said they hope in the future the system could incorporate other factors to make connections through semantic data, editorial data, or time elements. The Times’ metadata is likely the richest of any news organization.

Justin Ellis, Meet Deep Dive, the New York Times’ experimental context engine and story explorer, Jan. 23, 2012

Unless the NYT can resolve this problem the service will be - sorry for that - pretty useless, though the idea of generating a timeline around an article is interesting. As we described earlier without a proper ontology and a fulltext semantic analysis surfacing  additional content will not deliver results relevant enough for the individual reader.

The Basics (3): Content Analysis

After acquiring content from a source it needs to be analyzed. Ideally the content comes directly from a content management system including associated metadata and thus is free of any user interface components, advertising etc.. Unfortunately this is most often not the case.

Depending on system capabilities and accessibility private content sources normally deliver an entire data set consisting of raw data plus associated meta data (author, access rights, usage rights, source, references, tags …). In contrast public content sources in general deliver data in an unstructured form. Meta data is either mixed into the content or missing at all.

So in a first step non-content related data needs to be detected and eliminated. Content elements need to be identified. The remaining content needs to be tagged, categorized and indexed.

Relevancer’s Analysis Engine performs these tasks in four steps:

Content Purification

In a first step non-content related data such as ads, user interface controls, linked content etc. is identified and eliminated based on intelligent content recognition algorithms.

Content Element Identification

Content elements such as headline, subhead, byline, lead, body, pictures, videos are detected and source, location and publishing date are registered. Finally the purified content is stored in a database together with the related meta data.

Semantic Analysis

Now stop words are eliminated based on stop word lists and ontologies. The remaining content is analyzed using semantic technologies. Based on their relation, position in sentences and statistical analysis correlations between terms are detected and represented in mathematical form.

Tagging, Categorizing and Indexing

Using the result of the semantic analysis tags are defined, the content is linked to categories and indexed for high-performance search.

Relevancer’s Analysis Engine performs semantic analysis in 21 languages and processes up to 500,000 documents per CPU. The generated index delivers semantic search results within less than 1 second even for very large data sets.