Metadata: the new challenges for wiki search engines

Daniel Mietchen, The word Metadata in Wikidata Morse code, CC0, via Wikimedia Commons

Daniel Mietchen, The word Metadata in Wikidata Morse code, CC0, via Wikimedia Commons

Everything used to be simple: database queries for structured data and full-text search functions were conceived separately. But that is now history. In new search engines, metadata can be searched for much more selectively. These new possibilities blur the distinction between a database query and a search function. What does this mean for the technological development of wikis?

Combined queries

In my last blog entry, I wrote that the search engine in wikis indexes all searchable content in full-text. Metadata, such as categories or authors are apprehended too and are thus searchable. With semantics, the size of the metadata recordable increases dramatically.

The indexing of semantic information is, however, not a semantic search function. When we speak about semantic searches in the context of Semantic-MediaWiki, we generally mean that combined queries are possible. An example of a semantic search understood in this way would be: “Give me all mayors of New York after 1971″. Using a classical key-word search, one can only search for “mayor”, or “New York” or “mayor of New York”.

Real Semantics and reasoning

I have to say, Semantic MediaWiki is a relatively simple way of enhancing content with metadata. Semantics would play a role when one operated a system of “reasoning” with higher formalism on top, i.e. according to relationships removed by several degrees.
An example for reasoning is: “I am the father of David and Peter is my father. What is the relationship between Peter and David?” Real semantics could make such connections over several degrees. One can also reflect such connections via the SPARQL query language. A keyword search function cannot do this.

Even the widely-used search engine Elasticsearch is not able to do this. But at the first level, like the question about the major of New York after 1971, it works well with filters. Asking about the majors of the five largest cities would not work, because it already requires more intelligence than can be organised over filters. In this concrete example, one needs to activate a search within a search.

Metadata in Wikipedia – a Herculean task

We can see that the possibilities of a search engine are bounded by a certain level of complexity. We are, however, about to take a great step forward in the development of wiki-implemented search engines. Let us take the example of Wikipedia.

A good search function allows queries like “give me all famous women born in the 80s of the 20th century”. In Wikipedia, these are metadata categories. The combination of categories like “famous woman” and “born between 1980 and 1989″ can be created by a search function in real time.

Nevertheless, such queries are not supported by Wikipedia. Wikimedia, the organisation behind Wikipedia and its sister projects, decided some time ago to use Elasticsearch as a search engine and this search engine is designed for such tasks. Nevertheless, the Wikipedia implementation of Elasticsearch falls far short of the capabilities which such a search function offers. One might say that Wikipedia does not really have a sensible way of dealing with metadata.

In the case of the online encyclopaedia, the problem has many layers. The project has grown over many years, and so there are different sources that have not been harmonised. The metadata is mostly in the categories. At the same time, there is Wikidata, a platform in which metadata is collected and documented collaboratively so that it can be incorporated centrally in many Wikipedias. These different sources with their non-homogeneous references need to be collated and processed. There are duplicates and differing systematics. And this is also in completely different languages. To harmonise this is a Herculean task.

The founder of Wikipedia, Jimmy Wales, has been calling for an improved search function for a long time. There is, in addition, a team to develop the search function further under the name Discovery.

It is, however, likely that important steps will soon be taken in other fields too with regard to the connection of wikis with metadata. Thus, there are problems which the Wikipedia universe is confronted with which do not arise for company wikis like BlueSpice. The environment and the tasks are manageable here. This is not a bad starting point for developing efficient search strategies with metadata.

Leave a Reply