The semantic web: the future of search engines

An interview with Prof. Dr. Frank van Harmelen of the Vrije Universiteit Amsterdam, Faculty of Sciences, Department of Artificial Intelligence

Source: www.oryon.nl, November 2003

Let's start with the most obvious question: what does the semantic web actually entail?

Van Harmelen: 'To provide you with a clear explanation, we should look at the limitations of the current world-wide web. On the one hand, the world-wide web is a huge success - ten years ago, nobody would have foreseen that would have such a great influence on our daily lives. However, at the same time it has a number of major limitations. Currently the web is very useful if you are able to read English or another language and able to comprehend images and photographs. People are well equipped for this task, but for computers this is another matter entirely. Computers are completely unable to handle the current web, at least as far as its content is concerned. And for this reason, computers are of very little use to us when we are looking for information on the web. In this sense computers provide us with extremely limited support. The only thing your expensive PC does, is retrieve information from a single location, move it to another location and display it on your screen. However, we are left to our own devices when it comes to understand, combine, interpret, select, evaluate etc. this information. Computers are unable to assist us in this task, because they simply do not comprehend what is said in those pages.

So how does this relate to the semantic web? Well, the idea behind the semantic web is that we can try to expand the current web by providing additional information that enables computers to actually comprehend the content of web pages. This is not to say that we should forget about or dump the current web. The semantic web will be an expansion, an extra layer on top of the existing world-wide web. In actual practice, this means that we will have to adapt part of the content of the current pages in such a way that these pages become intelligible to computers. And that is what we are currently actively involved in'

How should we imagine the semantic web?

Van Harmelen: 'A number of specific languages have been developed that are intelligible to computers and can explain to them what a particular website is exactly about, and in a much richer way. For instance, in such a language you can indicate that there is such a thing as the Vrije Universiteit in Amsterdam, that there exists a person named Frank van Harmelen and that there is a relation between the two that can be defined as 'works for'. In addition, you can indicate that there is also a 'building', that this building 'is part of' the Vrije Universiteit, and that Frank van Harmelen 'works in' this building. Naturally, you will have to define all these relations. After all, the relation between the building and me is fundamentally different from the relation between the university and the building. However, if you define all this in one of those languages and go on to find people who 'work for' the Vrije Universiteit or 'work in' that particular building, the computer will understand what you are looking for, because you have taught it as much in advance. In this way, the computer would be able to provide you with much better support in your search for information'.

So the entire system stands or falls with the way in which you offer this information to your computer?

Van Harmelen: 'Yes, exactly. Let me give you another example: if you search for Frank van Harmelen's work address, you may not be able to find it because I have been too lazy or stupid to include it on my website. However, if the computer knows that I work at the Vrije Universiteit and it knows the address, it may be able to deduct which address it should provide on the basis of this information. However, you need to provide the computer with the necessary information first. This means that everything will depend on the quality of what we call the ontology. An ontology is a set of definitions that explains the concepts and relations between these concepts'.

In other words, the quality of the ontology will determine the quality of the support that your computer can offer you.

Van Harmelen: 'Precisely. You can regard ontologies as structured ways of representing the meaning of words for a particular domain. Let's return to the example I gave you just now. You will have to explain to your computer what a university is, what an employee is, what the relation between the two is, how this relation takes shape in the actual world, etc. We refer to this type of information as metadata. And that is what we need to create the semantic web. Without this metadata, there can be no semantic web'.

And where will all this metadata come from?

Van Harmelen: 'That is the question I am asked the most during presentations (smiles). It does not result from the pens or keyboards of individual users. If we look at the origin of the world-wide web, of, say, the first hundred thousand pages, we can say that these were manually written by people who sat behind their computer and created web pages and HTML. However, for a long time this has no longer been the case. We don't get 3 billion pages on the world-wide web just like that. They are generated from databases written with or by specific programs, etc. In the future, these databases and applications will not just generate HTML, but metadata as well. An easy example of this is Amazon.com. This website is actually a reverse database. All information is stored in the database, and the database is converted into HTML pages, which enables us to read and understand the information. The same information can also be represented in another language that computers can understand. And in this way, my personal shopping agent would be able to assist me during my search for books or music that match my predefined personal preferences. That in itself already is a single source of metadata. Another important source is formed by specialised applications that are able to superficially understand natural languages, e.g. English or Dutch, and distil metadata from them. These types of applications already exist, and companies are already making money out of them. So, to a large extent, metadata will be generated automatically or semi-automatically'.

But of course this metadata should be standardised so it can be exchanged and be of use.

Van Harmelen: 'You have just touched on an important topic. For instance, if you are using the term 'employee', and someone else is talking about a 'staff member', the computer would have to know the same concept is concerned in both cases. When it searches for 'employee', it should know that it must also include 'staff member' in the search query. That is precisely the efficiency we want to achieve with the semantic web. The current search engines are - maybe this is putting it a little dramatically - mainly engaged in character matching. Sure, they are a bit cleverer than that, but basically all they do is compare digits and letters. Ontologies will change all this. Not only will they define that I am an employee of the Vrije Universiteit, they will also define the concept of 'employee' and the relation with other concepts. However, this also means that when somebody else uses the word 'employee', he or she must refer to the same type of ontology and that the two ontologies should have to be linked. Only then will the computer be able to understand that two identical concepts are concerned. To do all this, you will indeed need standardised languages for metadata'.

Aren't computers able to identify such links by themselves? Is the technology still insufficiently advanced in this respect?

Van Harmelen: 'That is currently a hot research topic. This has already been proven possible in a number of experiments in carefully selected test domains, but it is not yet possible in the proliferation of the world-wide web. However, I expect a completely new commercial market to develop around this technology. Some companies are already offering ontologies. For instance, they provide large commercial ontologies with terms such as 'employer', 'employee', 'product', 'address', 'price', etc. These terms are interrelated, and you can link to these terms if you pay for this service. Moreover, these companies will make sure their ontology is linked to other ontologies. In this way, you will be provided with a kind of semantic service that allows users to find information on your pages in a much quicker and easier way'

What will the regular Internet surfer notice of all these changes? It seems to be a back-office operation mainly. Which concrete changes will take place for regular internauts?

Van Harmelen: 'A large part of the semantic web's will depend on it remaining invisible.. All the technology we have been talking about is indeed beneath the surface. The only thing you will notice when you surf the Internet, is that the quality of the results of your search engine will have improved. The current search engines are very good at recalling: everything that can be found, will be found. However, the search engines do not provide such good results where precision is concerned. Apart from the required information, they will also come up with a lot of other stuff that you have no use for. I am turning this into a bit of a caricature, but it can still be said that the precision still needs to be considerably improved. Of course, the way in which information is provided will change as well. For instance, if I enter my name in one of the search engines currently available, I will be presented with two types of results: results concerning me and my scientific work, but also results relating to the Dutch village of Harmelen. The problem is that the search engine does not make any distinction and presents a mishmash of this information. As the semantic web evolves, search engines will have to be capable of determining that two type of hits are concerned and that they have to be displayed separately, or ask the user what he or she is looking for: the person Frank van Harmelen or the village of Harmelen'.

So Internet searches will be simplified. Are there any other benefits?

Van Harmelen: 'An important topic that I haven't touched upon so far, is personalisation. If you and I browse to a website, we will both get to see the same thing. This, however, is not an ideal situation, because you and I have different interests. Let's return to the example of Amazon.com. Wouldn't it be in the interest of this company to show us different pages, based on our personal preferences? Personalisation can be even taken so far that it will drastically reduce the flow of information. After all, things that you are not interested in do not have to be presented to you.

How many people are currently involved in the development of the semantic web?

Van Harmelen: 'W3C is really just a small circle of people. There are many members, but the size of W3C's staff is limited. On a global scale, a few dozen people are involved, but they are also engaged in activities that are unrelated to the semantic web. The real work is done by people who are employees of the members of W3C. In workgroups centred around the semantic web you will find people from companies such as IBM, Hewlett-Packard, Sun, Nokia, etc. Although these may not be the most likely names, quite a lot of these companies will benefit from the development of the semantic web. Hewlett-Packard, for instance, sees opportunities in using the semantic web through their printers. Each printer would become a kind of self-describing device: each printer is assigned its own profile, written in a special language for the semantic web. And do you know what will happen? You walk into a building, a convention hall, for instance, and all the printers will make themselves known to your laptop or PDA. So when you want to print a page, your laptop or PDA will already be aware of the location of each printer and which printer is best suited to your specific task. A company such as Nokia, on the other hand, hopes to make all sorts of services available through their mobile phones. So it is logical that these companies would like to be involved in the development of the semantic web. Nobody wants to miss out'.

So what about the Googles, Altavistas and Yahoos of this world? To what extent are they involved in the development of the semantic web

Van Harmelen: 'I recently spoke to some people from Google, and I was surprised to find them, shall we say, politely expectant. They were very well aware of the latest developments, but informed me that they would prefer to wait and see which way the wind blows. However, at the same time I noticed they are already experimenting with certain semantic aspects. Do you know the Open Directory? It is a project involving volunteers who manually categorise web pages. Currently Google already uses links to this gigantic database for a large number of search results. At the bottom of the page you will find a hyperlink to the category that the specific search result belongs to. In this way you can search for other results in the same category. So although Google doesn't want to admit it, it is already using semantic support. That's because it is something they simply can't do without. The popularity of a search engine is still determined by the quality of the search results'.

Are there any other practical applications?

Van Harmelen: 'W3C has developed an ontology that describes device capabilities. This ontology will teach the computer, say, the things a phone can do, what a printer is capable of and so on, and which information can be exchanged by these devices. There are also large ontologies for very specific industries. For instance, the biomedical sector already uses a relatively large number of comprehensive and well-structured ontologies with medical terms. The car industry is also at an advanced stage in this respect. Daimler-Chrysler is even an active participant in W3C workgroups. However, these are applications that regular Internet users aren't likely to come across that quickly. For the time being, the semantic web is mainly found in the business-to-business sector'.

When can we expect the first practical applications for regular Internet users, or consumers if you will?

Van Harmelen: 'Currently numerous small semantic web 'islands' are evolving within these specific industries. In the long term, I see these islands connecting, and that is when you will really get a semantic web. Consumers will not take any notice of the semantic web until then. What I do see happening in the near future, and this is an area Philips is very active in, is the creation of ontologies for media content provision. Let me give you an example. There are many websites that offer you an online television guide. These websites can only be read by people. A media content ontology will allow your computer or PDA to read such pages, match their content against your preferences and to draw your attention to other interesting programs that are broadcast that same week. In this framework I can also imagine that ontologies will be created for musical or movie genres. The only thing that will be needed from you is an indication of the genre or sub-genre you are interested in. The computer will do the rest. I expect such applications to arrive in just a few years time'.

Finally, a difficult question: when do you expect a major breakthrough of the semantic web?

Van Harmelen: 'That is indeed a very difficult question. It is quite hard to predict the future, and even more so where the IT industry is concerned. It so happens that I recently discussed about this with Tim Berners-Lee, the architect of the world-wide web, and he used the metaphor of a bobsleigh. Initially you will have to set the bobsleigh in motion, but once it begins to gather speed, you will still have to hurry to get in if you don't want it to leave without you. As far as the semantic web is concerned, we are still in the pushing phase. We must convince the industry of the value of the semantic web. That said, there is no question in my mind that the semantic web will arrive. Tim Berners-Lee also regards the semantic web as the next major step in the history of the world-wide web. However, I'll be a bit more concrete than that. Let us say that I would be really disappointed if no visible applications of the semantic web would become available to regular Internet users in two or three years time. And in this respect I am particularly thinking of the e-commerce industry. To my mind, this industry can benefit the most from personalisation. It will take some more time to convert the entire web into a semantic web. But it will happen.