The Basics (3): Content Analysis

After acquiring content from a source it needs to be analyzed. Ideally the content comes directly from a content management system including associated metadata and thus is free of any user interface components, advertising etc.. Unfortunately this is most often not the case.

Depending on system capabilities and accessibility private content sources normally deliver an entire data set consisting of raw data plus associated meta data (author, access rights, usage rights, source, references, tags …). In contrast public content sources in general deliver data in an unstructured form. Meta data is either mixed into the content or missing at all.

So in a first step non-content related data needs to be detected and eliminated. Content elements need to be identified. The remaining content needs to be tagged, categorized and indexed.

Relevancer’s Analysis Engine performs these tasks in four steps:

Content Purification

In a first step non-content related data such as ads, user interface controls, linked content etc. is identified and eliminated based on intelligent content recognition algorithms.

Content Element Identification

Content elements such as headline, subhead, byline, lead, body, pictures, videos are detected and source, location and publishing date are registered. Finally the purified content is stored in a database together with the related meta data.

Semantic Analysis

Now stop words are eliminated based on stop word lists and ontologies. The remaining content is analyzed using semantic technologies. Based on their relation, position in sentences and statistical analysis correlations between terms are detected and represented in mathematical form.

Tagging, Categorizing and Indexing

Using the result of the semantic analysis tags are defined, the content is linked to categories and indexed for high-performance search.

Relevancer’s Analysis Engine performs semantic analysis in 21 languages and processes up to 500,000 documents per CPU. The generated index delivers semantic search results within less than 1 second even for very large data sets.