The second author wishes to thank Philippe Hamon, Bernard Quemada, and Réal Ouellet, his professors at the University of Rennes, the University of Paris III, and Laval University, who instilled in him the desire to study and teach the French language and literature. He dedicates this book to his parents and especially to his wife Hoa for her continued support and encouragement in his professional endeavors.
Page x
Abbreviations Categories Example
adj
adjective
1026 lourd adj
adv
adverb
1071 certainement adv
conj
conjunction
528 puisque conj
det
determiner
214 votre det
intj
interjection
889 euh intj
n
noun
802 absence nf
nadj
noun/adjective 4614 insensén adj
prep
preposition
389 parmi prep
pro
pronoun
522 lui-même pro
v
verb
1014 confirmer v
Features on categories
Example
f
feminine
1011 armée nf
i
invariable
1324 après-midi nmi
m
masculine
707 signe nm
pl
plural
3654 dépens nmpl
(f)
no distinct feminine
3770 apte adj(f) capable
(pl)
no distinct plural
3901 croix nf(pl)
P
IT
chC
ob
Page 1
Introduction
The value of a frequency dictionary for French
Today French is the second most taught and widespread second language globally, behind English.
Yet, surprisingly, there is no current corpus-based frequency dictionary of the French language. The present dictionary is meant to address this shortcoming, and is part of a series that includes other highly useful dictionaries for Spanish (Davies, 2006) and Portuguese (Davies & Preto-Bay, 2008).
As such it is similar in intent, approach, structure, and content to its predecessors. As noted below, some modifications have also been made to make it more usable for English speakers, who do constitute the largest group of speakers on the planet.
The purpose for this book is to prepare students of French for the words that they are most likely to encounter in the “real world”. It is meant to help alleviate the phenomenon encountered all too often in dictionaries and language primers where word lists are introduced based on intuitive or unverifiable notions of which words might conceivably be most useful for students to acquire, and in which order. The dictionary is designed primarily as a reference work which could be used in concert with standard classroom curricular materials or used on an individual study basis. Ideas on how to carry out this integration have been noted in the previous dictionaries noted above.
Contents of the dictionary
This is first and foremost a frequency dictionary. The principal information concerns the 5,000 most frequent words in French as determined in the process described below. This information is arranged in three different formats: (i) a main frequency listing, which begins with the most frequent word (with associated information) followed by the next most frequent word, and so forth; (ii) an alphabetical index of these words, and (iii) a frequency listing of the words organized by part of speech, and (iv) thematic lists grouping some of the words into related semantic classes. Each of the entries in the main frequency listing contains the word itself, its part(s) of speech (e.g. noun, verb, adjective, etc.), a context reflecting its actual usage previously in French, an English translation of that context, and summary statistical information about the usage of that word. Some or all of this information is likely to be highly useful for language learners in different settings.
The vocabulary itself was derived from a corpus, or body, of French texts. The corpus we collected was assembled specifically for this work and totals millions of words, half of them reflecting transcriptions of spoken French and the other half written French texts. Since the dictionary is focused primarily on frequency and usage, the words do not have associated with them any pronunciation guides, etymological history, or domain-specific usage information. The dictionary is also focused on single words, which is a crucial but not exclusive consideration in language learning; to extensively address fixed word expressions such as collocations and idioms would be beyond the scope of this dictionary.
The dictionary, then, is designed as an instrument for helping students acquire a core vocabulary of French words in various ways, including based on their observed frequency in recent French language usage. The versatility in its organization should presumably allow its use in a wide range of language learning scenarios.
Previous frequency dictionaries for French
French dictionaries are plentiful and widely varied in content, so one might wonder whether another ow
P
m
estpT
F
dictionary is necessary. A short survey of existing dictionaries should suffice to illustrate why this one was developed.
Two landmark frequency dictionaries have been produced for French. One (Henmon 1924) was based on 400,000 words of text, and the other (Juilland et al. 1970) derives from a study of 500,000
words.
Page 2
Information on the words contained in those lists, though, was minimal, and the ability to handle more sizable corpora has since – of course – been vastly improved with computer technology.
Other word reference lists have been developed largely for scholarly purposes and hence not very accessible to the average learner. Brunet (1981) focuses on development of French vocabulary over time based on the superb Trésor de la Langue Française (Imbs 1971-1994). Beauchemin et al.
(1992) focus only on the French spoken in Quebec. All of these resources require some effort to use effectively.
Some lexical resources are at the disposal of French language learners through the Internet, such as the ARTFL FRANTEXT and TLFi resources. The subscription costs and on-line access methods are sometimes less practical than having a reasonably sized dictionary like this one at one’s fingertips.
Finally, some helpful recent beginner dictionaries exist, though each has its own limitations. Recent ones by Oxford University Press (2006), Living Language (Lazare 1992), and Dover Publications (Buxbaum 2001) list from 1001 to 20,000 “most useful” words but give no rationale for how they were selected. Another venerable work by Gougenheim (1958) lists 3500 basic French words with related information including definitions, but which are entirely in French and hence challenging for the beginner.
Our dictionary seeks to combine the best from this tradition of French lexical research while at the same time avoiding these shortcomings. Its presentation design and the rationale and methodology for selecting the contents reflect what we believe to be the state of the art in corpus research, text processing, and lexicography.
The corpus and its annotation
Our dictionary is derived from a corpus of some 23,000,000 French words that have been assembled from a wide variety of sources. As mentioned above, half of this total reflects a collection of transcriptions from oral or spoken French, while the other half reflects French in its textual or written form. Reflecting a desire to make our dictionary a modern representation of the French language, we have included no materials that date before the year 1950.
We did not try to proportion our data based on geographical region or demographics, but we did try to achieve some balance across genres; however, this balance is not perfect. It is also important to note that some of our content from particular sources was exhaustive whereas in other cases it was selectively or randomly sampled; in other words, only parts of the material were used because there was too much content and hence the risk of skewing coverage of particular areas.
The spoken text portion of the corpus was made up of approximately 11.5 million words. These words were pulled from such various forms such as transcripts of governmental debates/hearings, telephone calls, and face-to-face dialogues. There were also transcripts of interviews with writers, entertainment figures, business leaders, athletes, academicians and other media personnel. And fT
C
hP
F
1
finally we made use of movie scripts/subtitles and theatrical plays.
The written text portion of the corpus was also made up of roughly 11.5 million words. This part of the corpus was assembled from newswire stories, daily and weekly newspapers, newsletters, bulletins, business correspondence, and technical manuals. Magazines such as popular science and other technical publications were used. We also targeted different genres of literature such as fiction/nonfiction essays, memoirs, novels and more.
Table 1 gives a more detailed listing of the composition of the corpus.
Corpus standardization and annotation
Collection of the corpus involved much work in what has been called corpus standardization or text preprocessing. Given the wide range of sources for the corpus, they involved many different file types, character encodings, and formatting conventions. For example, the documents used a wide range of character representations and formats such as EBCDIC, MACROMAN, ISO, UTF-8, and HTML. In many cases unneeded material such as images, advertisements, or templatic information had to be stripped out, a process called document scrubbing.
Each type of transcription or text document was then processed so that the paragraphs, sentences, words, and characters were identified and encoded in a standard way to enable further processing, a process called tokenization. The scrubbing and tokenization processes involve linguistic issues that had to be addressed, such as deciding on how to break up
Page 3
Table 1 Composition of 23 million word French corpus
Spoken