Berber Sardinha, Tony
University of Liverpool

Mr. A.P. Berber Sardinha
Modern Languages Building
University of Liverpool
Liverpool L69 3BX

Strand: Computational

Title: Lexical cohesion and segmentation

The concept of lexical cohesion is used in Systemic Functional Linguistics to refer to the ways in which messages are connected by semantic ties (Eggins 1994:88). Lexical cohesion does more than connect texts, though. It is argued here that a key role of lexical cohesion is to provide ways for dividing texts into coherent segments. So far, very little has been said about how texts segment naturally into sections or chapters, even though such units form an integral part of a wide range of text types (e.g. research articles, novels, business reports, etc). The present study focuses on sections, since fewer text types present chapter divisions. The data for the study are 100 encyclopedia texts selected on the basis of their having section divisions. The computation of lexical cohesion is carried out through a program specially designed for this investigation which identifies all lexical links (Hasan 1984; Hoey 1991) between sentences. The links are then clustered for similar distribution. Crucial at this stage is the determination of the number of clusters present in the data. For this task, the Cubic Clustering Criterion statistic (Searle 1983) is calculated for each text, and the optimum number of clusters is obtained. The clusters are then plotted and segment boundaries are drawn. The segments so obtained are compared to the existing section divisions by ANOVAs. The existing section divisions are also compared to both a random segmentation and a segmentation by means of TextTiles (Hearst 1994). Preliminary results indicate that segmentation by clustering lexical links provides a better approximation to the original sectioning of the texts than random or TextTile segments. The implications of the findings are discussed, including whether they can be taken as evidence that the organization of lexis can be predicted (Hoey 1991).