State of the Art

Indexing languages and knowledge organization

By Claudio Gnoli

Among the many kinds of signs used by humans, an interesting group are those of indexing languages, also called documentary languages in some contexts. These are codes aimed at describing knowledge contents as they are recorded in documents, like the subject of a book, or the theme of a documentary film, or the topics addressed in a website. We can say generically that this bulletin is “about semiotics”, but specifying its contents in a precise way may be not a trivial task, especially when they have to be listed together with the contents of thousands, or millions, other resources.

The original purpose of indexing languages is simply a pragmatic one: when, as a consequence of the developments in printing and communications, librarians and bibliographers began to manage larger and larger amounts of resources, they could not keep in mind anymore the place of each. They needed lists, or indices, where each entry stood for a corresponding document. (In particular, the index of all documents owned by an organization is called a catalogue.)

Let's imagine to create a list of websites that we find useful. How should we organize it? A simple option (actually the default option in computers) is by alphabetical order of their titles. In this case we are doing descriptive indexing, as we just report descriptive information about document features (their titles, their publication dates, their authors), without any assumption concerning their subject content. However, this has some disadvantages. A title like The name of the rose does not correspond to a book about botanical nomenclature...

Also, descriptive indices imply that we already know the title or the author we are searching for, so that we are able to look under the appropriate point in the alphabetical order. But what if we need to know “what is there about semiotics”? We will find Semiotics: the basics by Chandler, but will fail to identify Collected papers of Charles Sanders Peirce. In order to be informed on all documents available about a given subject, we rather need a subject index.

American librarians began to develop subject catalogues towards the end of the 19th century, especially following the principles formulated by C.A. Cutter. The Library of Congress Subject Headings became one of the most influential systems for subject indexing. These are verbal indices, i.e. they are made of words, but not by any word of the natural language. To make the system effective and predictable, indeed, it is necessary to choose a preferred form for expressing each concept, to make links from its synonyms and its variant spellings, to add specifications distinguishing homographs, etc.

They thus become terms, that is, special words belonging to a reference controlled vocabulary. The word “semiotics” is not the same thing as the controlled term semiotics. Also, terms are written according to some formal rules (e.g. countable nouns are always in the plural), and combined according to some standard order (e.g. place and time specifications always go at the end, like in semiotics – methodology – France – 20th century).

The last example also shows one of the basic features of document subjects: most times, a document deals not just with a single topic, isolated from all others; rather, it deals with a combination of topics being in some syntactical relationship between them, like “the spreading of A in region B during period C”, or “the influence of A on B under conditions C”. This suggests that subjects can be best specified by combination of a series of elements (the facets of the subject) into a compound string. Such idea was theorized by the Indian librarian S.R. Ranganathan, who provided a well-developed methodology for identifying, listing, and combining facets. Facet analysis is a major subfield of knowledge organization still today, when it is finding new applications in the management of digital resources and the information architecture of websites.

Although verbal subject indexing is able to group some documents according to their content, it does not free us completely from the limitations of the alphabetical order. Indeed, related subjects like semiotics and communication will go far away in the list. To obtain a systematic presentation, where similar topics are listed together, we have to represent each concept with a notation expressing its position in a classification scheme: say that semiotics is Q45 and communication is Q3, so that they will be filed near each other.

We have moved, in Ranganathan's terms, from the idea plane (the concept of semiotics) to the verbal plane (the term semiotics), and then to the notational plane (the notation 81'22 which stands for semiotics in the Universal Decimal Classification).

From the semiotical viewpoint, it is interesting to notice that notation can be formed by any set of symbols as its base: letters, digits, punctuation marks, stars, icons... Many classification systems use digits, following the prototype of the Dewey Decimal Classification, where digits suggested that the classmarks were sorted like decimal numbers, with more digits meaning increased specificity rather than a greater value. It has also been observed that digits are more universal than letters, as they are understood also in some countries not using the Latin alphabet. In this, notation can act as a transnational language. For the same reason classification is used more often in Eastern Europe countries, where to develop and maintain a verbal subject heading system in the national language would be too expensive, compared with the number of its potential users.

In the digital environment where modern information resources are contained, it is highly recommendable to use for notation symbols included in the ASCII standard code, that is letters, digits and only some additional symbols; the Greek letter delta, which stood for a main class in Ranganathan's Colon Classification, instead would cause problems to computer data managers.

Other relevant aspects of notation are its brevity and simplicity, which can help users to keep in mind a code seen in an index while moving to a library shelf, or copying it into a request form, or communicating it by telephone; and its expressiveness, that is its capability to reflect the hierarchies and relationships of the classification schedule. Unfortunately, brevity-simplicity and expressivity tend to be inversely proportional, so that acceptable compromises must be devised. Classifications like Colon give priority to expressivity, while others like Bliss give priority to brevity. Obviously the optimal choice depends on the purpose of the system: if notation has to be written and read on the spine of books, it must be simple, while if it has to be handled entirely by the computers, showing its verbal equivalents to users, much more room can be left to expressivity.

On the semiotic side, document indices seems to occupy an original position. While documents are signs referring to some denotatum, their indices are signs about signs. This means that there is a two-steps path between indexing languages and phenomena dealt with. A subject index should reflect both steps. In an entry like zoology – France – handbooks, the terms zoology and France refer to the phenomena dealt with, while handbooks refers to the documentary form in which they are dealt with.

We can go further, and notice that even the term zoology assumes that a given phenomenon (animals) is treated under a particular disciplinary perspective, that of zoology rather than veterinary medicine or food science. In fact, most classification schemes are based on a division of knowledge into canonical disciplines, like zoology, sociology, or philosophy. As a result, zoological documents are separated from veterinary medicine documents dealing with the same animals. Some authors, however, claim that the phenomenic dimensions (the object of knowledge) should be expressed separately from the epistemic ones (discipline, literary form, theory, method, application...). In this way, one could search for all documents about animals studied under any perspective, or for all documents applying an evolutionary perspective to any kind of phenomena. This idea has recently been represented in the León Manifesto.

Although most indexing languages have been developed originally for libraries or information services, their relations with the more general philosophical problem of the order of knowledge were early recognized. Putting a concept or a discipline before another, or under a broader class or another (say, moral under religion or under sociology), implies philosophical assumptions. Some authors have observed that the Library of Congress Subject Headings hide a masculine view of the world, or that the Dewey classification is strongly Western-biased.

In turn, the organization of knowledge can have relevant effects on society. It is not just a matter of putting a book near another, but also of creating or not a given university department, or a given government ministry, devoted to a certain field of human knowledge and activities. In the words of H.E. Bliss, “ this is not merely an intellectual interest but has social and economic value [...] It is not merely a bibliothecal problem, nor on a higher plane is it a problem solely scientific or philosophic. It concerns all these and also the educational interests and those of social organization”.

We thus come from indexing languages to the more general notion of knowledge organization systems (KOS). This term, taken from the title of Bliss's books published in 1929 and 1933, has been revived since 1989 with the creation of the International Society for Knowledge Organization (ISKO), aimed at studying and discussing classification and knowledge ordering, both in its philosophical foundations and in its technical applications in information resource management. KOSs include practical tools, like verbal subject heading systems, thesauri, taxonomies, bibliographic classification schemes, but also philosophical systems of knowledge, like those of Aristotle, Francis Bacon, John Wilkins, Auguste Comte, and modern ontologists.

Which are the basic categories of reality? Can they be reconnected to those of linguistics (cases), or to those used in automatic information processing (part, process, agent, etc.)? It seems that much work is yet to be done. What is most needed, it seems, are not new notions, but acknowledgement of the connections between notions already developed in different fields, like psychology, philosophy, linguistics, information science, computer science... Very often the same idea are called with different names (a paradoxical situation for people working on vocabulary control!), and specialists in each field do not know each other's work. The biennial international ISKO conferences (next ones will be held in Montreal, August 2008, and Rome, February 2010) are attended by a wide variety of professionals, representing a wide variety of approaches. At least, the existence of such interdisciplinary fora is a good hope for progress in the integration of research and ideas.

Can semiotics be part of the company? It surely should be, although few authors have yet recognized its relevance to knowledge organization. Among them are John Sowa, who bases his knowledge representation ontology on Peirce's categories or firstness, secondness, and thirdness, and Alfredo Serrai, who wrote Italian books with titles like Indices, logic, and language. Readers are encouraged to contribute observations on more interdisciplinary connections, and to join the discussion concerning knowledge organization.

References

A.C. Foskett, The subject approach to information, 5th ed., Library Association, London 1996.

H.E. Bliss, The system of the sciences and the organization of knowledge, Philosophy of science, 2, 1935, n. 1, p. 86-103, also in JStor.

S.R. Ranganathan, Prolegomena to library classification, 3rd ed., SRELS, Bangalore 1967, also in DLIST.

B.C. Vickery, Notational symbols in classification, Journal of documentation, 8, 1952, n. 1, p. 14-32; 12, 1956, n. 2, p. 73-87; 13, 1957, n. 2, p. 72-77; 14, 1958, n. 1, p. 1-11; 15, 1959, n. 1, p. 12-16.

Knowledge organization: international journal devoted to concept theory, classification, indexing, and knowledge representation, Ergon, Würzburg.

I. Dahlberg, Knowledge organization: its scope and possibilities, Knowledge organization, 20, 1993, n. 4, p. 211-222.

R. Poli, Ontology for knowledge organization, in R. Green (ed.), Knowledge organization and change: proceedings Fourth ISKO conference, Indeks, Frankfurt, p. 313-319, also /www.mitteleuropafoundation.org/Papers/RP/.

C. Gnoli (ed.), special issue on Facet analysis, Axiomathes, 18, 2008, n. 2, Springer, www.springerlink.com/content/106590/.