Note: The full version of this paper can be obtained from the current issue of the The Public Journal of Semiotics.
Using visualization software, one can visualize the relations between words in a co-word map using relational (graph) analysis, and additionally distinguish meaningful components in the communication spatially by using a systems perspective.
Figure 1, for example, shows a cosine-normalized co-occurrence map of words in nine newspaper articles using the word “autopoiesis.” This visualization of a discourse provides an illustration of the use of our research methodology. In Figure 1, for example, the names “Derrida” and “Heidegger” are related via the word “Deconstruction.”
In a similar vein, one would be able to analyze large sets of texts or discourses in terms of their visual representations. In a recent paper entitled “Content Analysis and the Measurement of Meaning: The Visualization of Frames in Collections of Messages” in the Public Journal of Semiotics (available from >here<), we provide an online manual for the construction of semantic maps using a set of computer programs. Our intention is to guide the user through the steps using these techniques and to facilitate the process by providing routines for the transitions between packages. Our technique can be used for the analysis of any set of messages which is electronically available as text-files.
We first distinguish between two possible usages: short messages (e.g., titles) or long texts (e.g., full text articles). The two corresponding routines are called ti.exe and fulltext.exe. Ti.exe reads a file with the short messages organized as sequential lines; fulltext.exe prompts for the number of texts to be included in the analysis. Both programs require preprocessing of the set of messages into a list of word frequencies. This list can be sorted on frequencies or one can select words on the basis of a statistical test for the significance of words (Leydesdorff & Welbers, 2011). However, the word/document matrix which is generated by the routines, is informative about the network structure and can be used also without using further statistics.
The program output
Both programs provide three essential output files. The file coocc.dat contains the affiliations between the words included in the list of words which were used as input. How often do two words co-occur within a single message? The file is in the so-called Pajek-format. Pajek is the popular freeware network analysis and visualization program made by statisticians of the University of Ljubljana and available at the internet. This program provides in addition to visualization also network statistics, such as the density of the network and the various centrality measures of words (De Nooy et al., 2005; Hanneman & Riddle, 2005).
Furthermore, Pajek-files are increasingly considered as the standard for communication with other freeware visualization programs such as Gephi or VOSViewer. These programs may offer additional facilities: Gephi, for example, allows for exporting the network in the GEXF format which can be embedded in the html for interactive display at the Internet (using GEXFExplorer; Leydesdorff et al., 2011). VOSViewer allows the visualization of the data into a density map (Figure 2).
The file cosine.dat is the second output of the programs ti.exe and fulltext.exe. This file contains the cosine-values of the angles among the vectors representing the words in a multi-dimensional space. The multidimensional space is generated by considering the distributions of word-occurrences in messages as vectors. A matrix is first constructed in which the messages are represented as the cases (rows) and the words the column-variables. (This matrix is also available as the file “matrix.txt”.) Between all word vectors a cosine can be computed.
The cosine is a similarity measure comparable to the Pearson correlation coefficient, but without the normalization to the mean. Given the large number of non-occurrences of words in messages (zeros in the matrix), the non-parametric cosine is a measure better than the parametric Pearson correlation for textual analysis (Ahlgren et al., 2003). Two words can have a very similar distribution over the file without necessarily co-occurring in any message. For example, two synonyms may never be used in the same message, but nevertheless be equivalent. These two words should thus be positioned close to each other in the map.
One can distinguish between co-occurrences as a relational measure and cosine-similarity as a positional measure (Burt, 1982). The cosine measure should not be taken over the symmetrical co-occurrence matrix, but the asymmetrical word/document matrix (Leydesdorff & Vaughan, 2006). This matrix is the third important output file made available by both programs as the file “matrix.txt”.
“Matrix.txt” can be read into SPSS (or other statistics programs) for further analysis. For example, one can input matrix.txt into principal component, factor, or correspondence analysis. The SPSS syntax file “labels.sps” contains the words as variable names for the import into SPSS. These statistics programs allow for further analysis of the network structures. If so wished, one can color the partitions in the visualizations according to one’s thus gained insights into the structural properties of the set. Actually, that is what we did when coloring Figure 1 above.
In summary, these routines provide the user with both the input files for visualization and further statistical analysis. After finishing the procedures, the user is prompted with a question whether to proceed with the observed/expected values which may provide further improved representations. However, this is not advised when first using this software; one should first get started and refine only thereafter. We used this software in classes of second-year students at the University of Amsterdam. It takes the students usually about a day or so to become fully familiar with this software.
Assistance is further provided to users by instructions and help files available in online lessons at http://www.leydesdorff.net/indicators. For example, one can further embellish the output. Pajek and the other visualization programs allow the user to export the results into vector formats (SVG) which can be used in programs such as Adobe Illustrator or the freeware program InkScape. Alternatively, the bitmapped (BMP) file format can be used directly for illustrations in Word files.
The use of the maps
In this era of the internet, pictures are often more convincing than tables based on various statistics. However, pictures never speak for themselves. One can use the figures for illustrative purposes to an argument. Note that visualization is not a strictly analytical technique. Using the methods mentioned above, one is able to further inform the results, for example, by using statistics.
Visualization is always based on the projection of a multi-dimensional space onto a two-dimension screen or sheet. This reduces the information contained in the data. One obtains a variety of options to do this, but the various options may lead to very different representations. The manual focuses on one of them as an example—namely, the popular algorithm of Kamada & Kawai (1989)—but the user can further explore the options using this reorganization of the textual data into the files that can be read directly into visualization software.
Conclusions and perspectives
Our long-term objective is to model and measure the communication dynamics of science and technology. Discursive knowledge codifies specific meanings among other possible meanings, much the same as the communication of meaning codifies some information as meaningful and others as noise. Meaning is generated when bits of information can be related by a system (e.g., a discourse or an observer); knowledge is recursively generated when different meanings can be related and thus communicated in a system. The currently provided programs are only a first step in this longer-term measurement and modeling effort (Leydesdorff, 2011).
Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirement for a Cocitation Similarity Measure, with Special Reference to Pearson’s Correlation Coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550-560.
Burt, R. S. (1982). Toward a Structural Theory of Action. New York, etc.: Academic Press.
De Nooy, W., Mrvar, A., & Batagelj, V. (2005). Exploratory Social Network Analysis with Pajek. New York: Cambridge University Press.
Hanneman, R. A., & Riddle, M. (2005). Introduction to social network methods. Riverside, CA: University of California, Riverside; at http://faculty.ucr.edu/~hanneman/nettext/.
Kamada, T., & Kawai, S. (1989). An algorithm for drawing general undirected graphs. Information Processing Letters, 31(1), 7-15.
Leydesdorff, L. (2011). “Meaning” as a sociological concept: A review of the modeling, mapping, and simulation of the communication of knowledge and meaning. Social Science Information (in press; preprint available at http://arxiv.org/abs/1011.3244).
Leydesdorff, L., Hammarfelt, B., & Salah, A. A. A. (2011). The structure of the Arts & Humanities Citation Index: A mapping on the basis of aggregated citations among 1,157 journals. Journal of the American Society for Information Science and Technology (in preparation; preprint version available at http://arxiv.org/abs/1102.1934.
Leydesdorff, L., & Vaughan, L. (2006). Co-occurrence Matrices and their Applications in Information Science: Extending ACA to the Web Environment. Journal of the American Society for Information Science and Technology, 57(12), 1616-1628.
Leydesdorff, L., & Welbers, K. (2011). The semantic mapping of words and co-words in contexts. Journal of Informetrics, (in press; preprint available at http://arxiv.org/abs/1011.5209).
 The articles were harvested from: The Washington Post (2), The Australian (1), Calgary Herald (1), The Herald (Glasgow) (1), The Independent Extra (1) The New York Times (1), The Observer (1), and Prince Rupert Daily News (Britisch Columbia) (1). LexisNexis was used for the retrieval.