kpattera@sfu.ca
CHWP B.32, publ. April 1997. © Editors of CHWP 1997. [First published in TEXT Technology, 6.3 (1996), Wright State University.]
[Abstract / Résumé]
KEYWORDS / MOTS-CLÉS
letters, women writers, corpus linguistics, stylistics, literary
criticism; lettres, écrivaines, linguistique du corpus,
stylistique, critique littéraire
In designing an epistolary corpus based on material in a letter
collection, certain challenges are presented by the text-type --
personal letters -- and by users of machine-readable corpora --
linguists and literary scholars. Stated briefly, this project
presents the opportunity to design a corpus suitable for
both literary and linguistic research, an attribute not
applicable to many of the major corpora now in use, with the
exception of Old English and dictionary corpora such as the Early
Modern Dictionaries Corpus.[1] Linguists and literary scholars,
generally speaking, come at language from opposite directions:
linguists from the word and phrase level up to text or discourse
level, literary scholars from the text level down to the word and
phrase level. They meet each other somewhere in the fluid regions
of discourse analysis, literary stylistics, and genre typology
and taxonomy. Rarely do they swim in the same textual waters.
Linguists work with language samples; literary scholars work with
authored texts. If a corpus is by definition a collection of
texts/samples designed for a particular "representative" function
(Oostdijk 1991: 19), then it is not surprising that corpora
infrequently meet the requirements of researchers in both
disciplines. Yet a specialized corpus designed for a variety of
uses is congruent with the research aims of some scholars in both
disciplines. First I shall describe my letter collection briefly,
and then I shall look at the proposed corpus from a linguistic
perspective and finally from a literary point of view.
The Letter Collection
The collection at present consists of approximately 2,000 letters
written by or to nineteenth-century women writers during a period
of roughly fifty years, 1820-1870. Almost all the letters are
unpublished and have been transcribed from manuscripts in
archives across Great Britain, the United States and Canada.
Around twenty-five women writers are represented, mainly British,
although a few are American; "writers" are defined as published
authors of more than one work. Well-known figures such as Harriet
Martineau, Anna Jameson, Mary Russell Mitford, Fanny Kemble,
Florence Nightingale, and Harriet Beecher Stowe are represented,
as are lesser-known women such as Anna Maria Hall, Sarah Austin,
Mary Howitt, Catharine Sedgwick, and Harriet Grote. The
collection was amassed for my doctoral dissertation, a taxonomic
study that defines and describes the epistolary form as it is
found in nineteenth-century women's letters. My interest in these
letters arose from a previous study of the letters of Elizabeth
Barrett Browning (EBB) where I discovered that her female
correspondents, apart from relatives and friends of both
Brownings, were virtually all writers, and that these women
seemed to correspond amongst themselves. Questions such as "Was
there a network of nineteenth-century women writers?" and, if so,
"What kind of relationships did they have?" prompted my search
for the letters of EBB's correspondents, and then, in snowball
fashion, of their correspondents. Thus the collection even in its
present limited form not only represents as writers or recipients
virtually all the women "intellectuals" of the period, but also
shows the many links amongst them. All the texts from my
transcriptions are presently in electronic form. Where copyright
regulations permit, the collection will be augmented by published
letters, especially in nineteenth-century editions and, as
archival work continues, by more unpublished letters. But even as
it stands, the material meets the basic criteria for a corpus in
that it is clearly demarcated by text-type, social group, and
historical period boundaries, as well as having biographical,
bibliographical and textual parameters.
The Linguistics Perspective: Corpus Linguistics
Because corpus linguistics, as Nelleke Oostdijk phrases it, "aims
at the study of actual language use" (Oostdijk 1991: 19), the
first corpora -- such as the Survey of English Usage begun in
1959 of varieties of contemporary English in written and spoken
forms, the London-Lund Corpus of spoken discourse, and the Brown
Corpus of American English c1961[2] -- concentrated on
representing the diversity of uses and forms of twentieth-century
English. Jan Aarts, in his paper "Corpus Linguistics: an
Appraisal", remarked that "corpus linguistics also deals, by
necessity, with all aspects of language variation, individual,
social and regional", and challenged corpus linguists by
suggesting that "one of the aims Corpus Linguistics should set
itself is to describe the structure of texts, not merely
sentences" (Aarts 1988). Oostdijk posited that corpus linguists
had to "come to terms not only with language structure but also
with all its relevant extra-linguistic correlates" (Oostdijk
1991: 21), the concern of stylistics and sociolinguistics. The
first step, Oostdijk maintained, was to identify the sets of
text-internal, linguistic features that determine text-type as
well as the text-external features that delimit genres. Study of
these correlates is essential as a basis for the study of
language variation, but "corpora which have been compiled with
the intention of representing a cross-section of the language are
not suited for the study of linguistic variation" (Oostdijk 1991:
39). To meet these challenges, then, some corpus linguists have
moved in the direction of designing corpora with variously
defined synchronic and diachronic parameters. I shall touch on
four such projects.
The Helsinki Corpus of Historical and Dialectal English
The diachronic part of this corpus contains more than a million
and a half words of British English from 850-1720, many of which
are in epistolary texts. Its compilers, Merja Kyto and Matti
Rissanen, have said that "the primary purpose of our diachronic
corpus is to serve as a database for the variational diachronic
study of English morphology, syntax and vocabulary", contending
that "change in language can best be approached and described
through synchronic variation" (Kyto & Rissanen 1988: 169).
Because some understanding of the structures of the spoken
language is essential to the study of language change, texts that
stand in different relations to spoken language must be selected.
This relation is "determined, among other things, by the
communication situation, genre and register, and the
socio-educational status of the persons involved in the
production of the text" (Kyto & Rissanen 1988: 169-70) -- all
extra-linguistic features. Each text in the corpus, therefore, is
described by a "fairly detailed set of parameter codings" (Kyto &
Rissanen 1988: 170) that contain this information. As the focus
of study in this corpus is at the word and sentence level, genre
questions have been restricted to labelling of text-types and
registers (the type of communicative event), and the style
parameter was eventually dropped.
A Representative Corpus of Historical English Registers
(ARCHER)
The compilers, Douglas Biber and Edward Finegan, described their
corpus in 1993 as "part of a project designed to investigate the
diachronic relations among oral and literate registers of English
between 1650 and the present" (Biber, Finegan & Atkinson 1993: 1)
Facilitating both diachronic and synchronic investigations, the
corpus also enables investigation of what they describe as
"microscopic issues, including a) individual 'style' [...] and b)
variation within works" (Biber, Finegan & Atkinson 1993: 2). In
an earlier paper, they had contended that "very few linguistic
studies to date [had] analyzed the diachronic evolution of
genres" and had described why: historical linguistics was focused
on phonological and syntactic levels; sociolinguistics and
discourse analysis gave little consideration to diachronic
issues; and stylistics, where "considerable attention [had been]
given to comparative analysis of different 'period' styles", was
undertaken from rhetorical or literary perspectives and did not
"reflect much linguistic sophistication"; therefore, they employ
"analytic techniques developed for sociolinguistic analyses of
register variation to study the historical development of genres
over the last three centuries" (all Biber & Finegan 1988a: 22).
Their method is described as a multidimensional/multifunctional
approach (Biber & Finegan 1988b). Five dimensions that represent
functional parameters of variation associated with differences in
communicative situations are identified and are defined by
particular sets of linguistic features. The first dimension,
"informational/involved" or, as it might be expressed, a greater
or lesser distance in the relation between writer and reader, is
shown in Figure 1. The database is made up of texts representing
eleven written and spoken registers (including epistolary texts)
divided into ten 50-year periods from 1650-1990. Although their
letter register is very small and contains only published letters
written by men, at least for the nineteenth century, their
findings reported so far have considerable interest for my own
work. Not unexpectedly, letters on the whole show a movement from
being less involved relationally with the recipient to being more
so over the three hundred year period (Figure 2); more
surprisingly, they show in the mid-nineteenth century a drop back
to relational levels of the seventeenth century before rising by
the end of the century to early eighteenth-century levels and
then continuing to rise in the twentieth century (Figure 3). Why?
Gender relations are also interesting: for the only time in the
three centuries, nineteenth-century letters show no difference in
the relations between women and men except when women write to
men. Men writing to men, women writing to women, and men writing
to women all show the same degree of relational involvement
(Figure 4). Would comparable analyses on a larger number of
nineteenth-century epistolary texts yield the same result? Would
more detailed linguistic analysis support those results and/or be
able to describe in what way the shift in the first dimension
took place?
The Cambridge-Leeds Corpus of Early Modern English
(1600-1800)
The inspiration for this corpus comes both from the Helsinki
Corpus and ARCHER. According to Susan Wright (Wright 1994), the
corpus is
designed to reflect and accommodate the fluidity of genre
distinctions as the period progresses by focussing on
authors rather than text-types as representative of the
state of the language. However, genre remains available as
an isolating feature of a group of texts without determining
the selection of texts. To ensure compatibility with other
historical corpora, [the] texts are given a generic
characterization [...] and functional or situational criteria,
common in sociolinguistic approaches. (Wright 1994: 26)
In her illustration of why "the distinction between literary and
non-literary texts becomes particularly problematic when applied
to [the] period" (Wright 1994: 26), Wright discusses the blurred
boundaries between private and public in the personal letter, a
distinction I also query in my own work on the form. She
questions the usual designation in corpus linguistics of the
letter as an "involved" or informal form because of its literary
status, saying "there is a significant situational difference
between letters written for publication and letters written
privately" (Wright 1994: 27). But my own work on the form has led
me to believe that "written for publication" is not a very useful
distinction. Other than letters written expressly for publication
such as letters to the newspaper, when are letters written in any
other way but with an assumption of "privacy" -- even by "famous"
authors who might well be aware, especially in the nineteenth
century with its interest in biography, that such letters may
eventually be published? To be more precise, a letter is usually
written to a designated recipient in the full awareness that it
may become public, that is, read or heard by others, unless the
recipient destroys the letter after it has been read. All
letters, because they are written documents, depend on the
discretion of the recipient for their privacy. Continuing work on
situational/functional criteria and the correlating linguistic
features derived from studies of contemporaneous idiolects may
well change generic assumptions, particularly for the letter with
its ambivalent relations with written or oral, literary or
non-literary, forms of discourse.[3]
Electronic Language
The final studies I wish to mention very briefly are those taking
place on "Electronic Language", known as EL, used in e-mail
communications, a contemporary variant of the epistolary form.
One such study by Collot and Belmore used the Survey of English
Language Corpus and a privately-collected corpus of e-mail
communications, to hypothesize that, as "EL has unique
situational features", it would embody "a distinctive set of
linguistic features as well" (Collot & Belmore 1993: 42). They
report, so far, clearly different uses of comparative adjectives.
They also refer to a subsequent study on EL by Herring that
"suggests a new textual dimension, a functional continuum ranging
from 'adversarial' to 'attenuated' in which gender and type of
discussion play a major role in the distribution of linguistic
features" (Collot & Belmore 1993: 53).
Thus as smaller, specialized corpora are developed to study,
under a general umbrella of language variation studies, specific
genres, authors, and situational contexts more closely, the
research findings will have considerable relevance to literary
scholars working on either individual authors or genre and
discourse studies. The challenge in designing an epistolary
corpus appealing to linguists will be to identify texts with
coded headings covering extra-linguisitic features, to supply
supporting detailed biographical information and word counts, and
to facilitate the creation of subsets including a parsed one.
The Literary Perspective
The immediate benefit from an epistolary corpus of
nineteenth-century letters would be to make available an
electronic archive of literary women's letters which, in all
probability, will never be published in scholarly editions. Many
of the letters' authors are relatively obscure, of interest
perhaps only to feminist researchers. Yet literary scholars also
need to know much more about professional women writers in the
period. The rise of the Victorian "man of letters" has been
documented, but the equivalent study of the "woman of letters" --
the women who made a career of writing in forms other than the
novel -- has as yet received only sporadic or isolated attention.
With careful tagging for indexing purposes, such a corpus would
aid research of this kind. Epistolary texts, however, offer
greater challenges than their general inaccessibility due to the
obscurity of their writers' lives. There are simply too many
letters; they are too tangled; and they are too inscrutable.
How many is too many? Unless wholesale destruction has taken
place, a single Victorian letter-writer's output may over a
lifetime run to thousands of texts. For instance, in the British
Library alone, there are over 13,000 of Florence Nightingale's
letters. And letter texts are sprinkled over the world where they
often rest in anonymity in autograph albums, private collections,
and respositories large and small. Collecting even a single
writer's letters for publication is a daunting task, and no
edition can ever with much assurance be considered complete.
Neither volume nor binding is, however, a determinant of an
electronic edition. An epistolary corpus can be flexible, and is
not necessarily telelogically driven.
How can epistolary texts be tangled? Letters are tiny texts
complete in themselves which can be studied in isolation. Yet
individual letters are also a part of a larger whole: the
correspondence between the writer and the recipient.
Interpretation, therefore, depends on the availability of the
"other half". Letter texts must be studied in pairs, pairs that
overlap, as long as the writers remain separated, each letter
being a response to a former communication as well as the
initiator of the next communication. Conventional publication
cannot very easily accommodate this Janus-like feature of the
epistolary text. The ambitious edition of the Brownings' letters,
which has recognized this important characteristic by including
both sides of the correspondence, is slated to run to forty
volumes -- a more than life-time project as only about a dozen
volumes have appeared in as many years. And although it is
carefully indexed, the massive size of the edition will make
searching for information or even tracing a single correspondence
over a number of years very cumbersome. A corpus arranged by the
correspondences between interactants can, however, capture this
feature and through subsetting, allow the user to retrieve a
single correspondence, or profile a single author, or display a
network of correspondences within a particular time period. The
tangle becomes instead a web whose strands may be examined
synchronically or diachronically.
How can a letter, a written document, be inscrutable?
Interpretation of epistolary texts, as I indicated above, depends
on entire correspondences: the writer's self is constructed and
revealed in the felt presence of the absent addressee. It depends
as well on entire texts. In many nineteenth-century editions,
letters are published in truncated or censored form, and even
twentieth-century editions often omit or re-locate spatial and
temporal identifiers and subscriptions, all formal constituents
necessary for interpretation. But above all, interpretation
depends on an understanding of the form itself, an understanding
not as yet based on taxonomic or typological criticism. Those few
literary scholars who study letters as texts tend to make as many
a priori judgments as do linguists. As John Burrows points
out, "[i]f the computer is to contribute as much as it might in
literary studies, the wider-ranging studies of the future will
need to include much more work than has yet been done on genre
differences and on historical change in the language of
literature" (Burrows 1992: 188). An electronic corpus of complete
epistolary texts based where possible on manuscript
transcriptions, with textual features tagged for comparative
purposes, will aid literary scholars in their study of the form.
Linguists and literary scholars, it would appear, have much to
offer each other in the study of a long-neglected genre. A corpus
of epistolary texts such as I envision, carefully designed to
accommodate both the requirements of researchers with different
aims and the unique characteristics of the text-type, will
provide an opportunity for unusual side-by-side, perhaps even
collaborative, approaches to the nineteenth-century letter.
[Return to CHWP titles]
Notes
[1] See Lancashire & Wooldridge 1994.
[2] See Johansson 1991 for a brief description of these and other
corpora.
[3] See Wright 1994 for a discussion of some problems arising from
established typological and functional criteria in corpus-based
linguistic analysis.