Author: Gertjan van Heijst
Lately the world of knowledge and information management has expressed a new or renewed interest in taxonomies. Not so long ago, many organisations put these tools for structured, intrinsic knowledge-sharing out of use, because they were considered redundant after the emergence of full text search engines. The manual classification of documents with taxonomies was considered to be too labour-intensive and therefore too expensive. Furthermore, consulting knowledge repositories requires a certain expertise in taxonomy for which often specific jobs were created. With the abolishment of taxonomies these jobs could be reorganised out of existence as well.
So where does this renewed interest in taxonomies come from? In our view, there are three reasons: (i) inherent limitations of pure full-text searching, (ii) the desire to index additional non-textual media and (iii) the requirement of undirected searching in document repositories. In the following paragraphs we will discuss each of these factors in some more detail.
Because of the nature of natural language, full-text search engines usually have low precision: they do not only find the document that you are looking for, but many others as well. If a search engine returns five or ten irrelevant documents, this is perhaps acceptable, but when there are hundreds of thousands, you'll be fighting a losing battle trying to identify the document you are looking for. Besides, search engines have a limited recall: there is no guarantee that the search engine will return all documents that you are looking for. For most purposes this may not be a big problem, but in a business context this could mean a disaster. Imagine that the legal department of your company uses a full-text search engine to find all relevant jurisprudence for a lawsuit and that one decisive verdict slips through the net.
Obviously, full text search engines can only index textual documents. As a side-effect of the increasing multimedia capabilities of PCs, it is becoming more common to distribute information in a (partly) non-textual form. Indexing of these non-textual documents is - for the moment at least - necessarily a manual operation and taxonomies are useful tools to support this activity.
Search engines are excellent tools to find documents in large collections, if you know exactly what you are looking for. Often, however, you only have a vague idea of the kind of document that you need. In such cases it is handy to get a global overview of the types of documents that the collection contains, before you start searching for specific documents. Because of their hierarchical structure, taxonomies can provide such a global overview.
Thus far our analysis of the reasons for the - rightful - revival for interest in taxonomies. What worries us a bit, though, is the suggestion that some advocates of taxonomies seem to make that taxonomies are a completely new concept in the world of knowledge-sharing. This suggestion is not only historically improper but also dangerous because it could lead to the neglect of the enormous amount of work in taxonomy building that has already been done in the past. To emphasise this point, we will dedicate the remainder of this article to a presentation of the Universal Decimal Classification (UDC) and the context in which it was developed about a hundred years ago.
Let us start with the name. The UDC is a classification (or taxonomy) that is universal and decimal. Decimal means that every category in the classification may have ten subcategories at most. The decimal character, which the UDC has inherited from the Dewey Decimal Classification from 1876, allows for the association of a simple numerical code with every category in the classification. Universal means that the classification (i) covers all existing knowledge areas and (ii) provides room for new knowledge areas when they emerge. Besides the main tables of categories and subcategories, which contain more than 65,000 knowledge areas at this moment, the UDC provides additional indexing facilities through auxiliary tables and facets. It would take too far to go into the nitty-gritty details of these mechanisms, but you can take it from us that the indexing tools of the UDC are sufficiently powerful to provide access to the largest document collections.
Accidentally, this was exactly the purpose for which the UDC was originally developed. The ambition of the founders of the UDC, the Belgian information scientists Paul Otlet and Henri La Fontaine, was to develop a Universal Bibliographic Repertory: a card-tray system with references and abstracts of all books and periodicals that had been published since the invention of printing. Their work on this repertory started in 1895, and by the beginning of World War I in 1914 it contained more than 11 million bibliographic entries. The Universal Bibliographic Repertory acted as a central database with references to - in the end - all human (written down) knowledge.
On top of the repertory they organised a search service that apparently generated a considerable amount of business. Users sent their requests by mail or telegraph. On reception, co-workers of the UBR would translate the request into UDC terminology and then retrieve the relevant references from the repertory for replication. The replicated entries were then sent back to the user by mail. This service can without any doubt be considered as a 'Google avant la lettre', the only difference being that the communication between client and server was a bit slower and that - and this is perhaps the most significant dissimilarity - the service was not free. An interesting detail is that employees of the service would contact the client in the case of more than fifty hits to prevent any unpleasant surprises. Something that did not happen very often by the way, because of the high precision of the UDC indexes.
The search service was kept alive until the seventies of the previous century. UDC has begun a life on its own, however. It is translated into more than thirty languages and is at this moment - either directly or in the form of a derivative classification - worldwide the most widely used indexing system for libraries and cultural-heritage repositories.