Conference LLcD, Sorbonne University, 10th Sept. 2024, Paris

Keywords : corpus linguistics, phraseology, discourse markers, association measures, (non-)compositionality
Motivation

There is a rich literature related to complex or multiword expressions (see Bathia et al. 2023 for a recent overview) and phraseology (see Piirainen et al. 2020, Pastor & Mitkov 2022, Mel’čuk 2023 for different perspectives). Works in this domain aim in particular at identifying stable combinations and evaluating their possible degrees of semantic rigidity. Similar problems arise for co-occurring elements whose main function is discursive, in the sense of contributing to discourse organization or speaker’s manifestation in utterance, for instance ah+bon, non+mais+alors, donc+du coup, ah+ben+tu parles, mais+enfin (Waltereit 2007, Dostie 2013, Crible 2018, Crible & Degand 2019, Cuenca & Crible 2019, Haselow 2019, Dargnat 2022). Studying these co-occurrences raises a number of questions which, in spite of the different perspectives they require, all concern the status of complex discourse expressions. The present workshop will focus on the following topics.

1. Annotating the elements of cooccurrences

Some elements belong to several categories. For instance, bon can be an adjective, which includes cases of idioms like à bon escient, à bon droit, a noun and an adverb or a discourse marker. For lexically ambiguous discourse markers, probabilistic or LLM-based1 taggers give middling/poor and unstable results on POS-recognition tasks. One can use finite-state automata to detect the category but there are still problems, most notably with elements whose discourse role is detected via unbounded dependencies. So, après is a preposition in (a) and a concessive adverb in (b), although the first eight words of the two sentences are the same.

              (a) [[après]PREP [le train qui est arrivé en retard]NP]PP [il y en avait un autre]S
               [[after]PREP [the train which was late]NP]PP [there was another one]S

                ‘After the train which was late there was another one’
              (b) [après]ADV [le train qui est arrivé en retard]NP [c’est pas la faute du conducteur]S

                [[after]ADV [the train which was late]NP]PP [it’s not the driver’s fault]S

  ‘This said, for the train which was late, the driver is not responsible’

Moreover, in the absence of a preliminary clause segmentation, and given the poor performances of sentencizers, the sentence or clause initial position cannot be reliably identified in spoken corpora. Thus, it is necessary to combine computer-assisted extraction methods, manual annotation and phonetic/prosodic information, whenever relevant (shortening, pauses, contours, duration, etc.). Lexical disambiguation is crucial for the next stage.

2. Evaluating mutual attraction between elements

Association measures (Desagulier 2017, Brezina 2018) are often considered as the technique of choice in order to evaluate the tendency of elements to cluster. These measures are mainly sensitive to two dimensions: exclusivity and frequency, that is, the tendency to occur together rather than separately and the difference between the expected and observed frequencies of the combinations. Directionality must also be taken into account. It corresponds to the possibility for an element to predict the occurrence of another element on its right or left. For instance, is ah a better predictor of a rightward bon than of another rightward marker (and conversely)? It is necessary to compare the results of various measures and test their efficiency and stability on different types of corpus. It is also useful to compare these results to those of LLM for fill-mask tasks, where the goal is to propose a candidate to fill a blank (the mask) inside a given sequence of words. For example, is a model able to propose bon as a filler for an incomplete sentence like ah <mask> je ne savais pas and for other similar patterns?

3. Semantic contribution of elements in a combination

There are a priori two main questions.

Firstly, should we assume that the contribution of an element to the meaning of the combination in which it occurs is “additive”? In that case the different elements of the combination contribute separately to its meaning. It seems to be true for mais enfin, for instance. Contrariwise, must we, at least in some cases, consider that the combination has a specific meaning, either because it inherits only some of the features of its components or because it has a global, non-decomposable, meaning, which seems to apply to pairs like ah bon? Secondly, For markers in isolation, like ah, tu sais, du coup, tu plaisantes, etc., prosody often allows one to identify values such as surprise, irony, dissatisfaction, etc. With co-occurrences such as allez + bon, non + mais + oh, tiens + donc, are the observed prosodic contours the results of juxtaposing the contours of each constituents, or is some constituent contour dominant and extended to the whole co-occurrence? In that case, how could we describe the interaction, if any, between prosody and semantics? Does one of these two dimensions drive the combination?

4. Taking variation into account

Severable variables can influence the production of discourse marker co-occurrences: the discourse genre (e.g. natural conversation, topic-controlled exchange, conference, debate, school presentation by pupils, fiction texts, etc.), the utterance situation (in particular the hierarchical relations between speakers), the individual parameters (age, social status, sex, academic and professional profile etc.), the corpus elaboration period, etc. It is also important to study the short/long term evolution of the co-occurrences to pinpoint their structure and possible idiomatization. This is for instance relevant when studying the emergence of du coup as a consequence marker and its combination with donc and alors, or the evolution of the verb+ donc marker series (tiens donc, dis donc, va donc, coudonc in québécois French). Such phenomena are discussed under general cover terms like grammaticalization, pragmaticalization and lexicalization (see Dostie 2004, Waltereit 2007, Heine et al. 2021).


1  LLM = Large Language Model.

Submission are closed
Conference Languages: English and French

Workshop organising committee: Mathilde Dargnat (ATILF, Université de Lorraine), Agnès Tutin (LIDILEM, Université Grenoble-Alpes)

Some references:
Bhatia A., Evang K., Garcia M., Giouli V., Han L., Taslimipoor S. 2023. Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023). Association for Computational Linguistics, Dubrovnik, Croatia.
Brezina V. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge UP.
Crible L. 2018. Discourse Markers and (Dis)fluency Forms and functions across languages and registers. Amsterdam: John Benjamins.
Crible L. and Degand L. 2019. « Domains and Functions: A two-dimensional account of discourse markers ». Discours 24, 35 p. (en ligne)
Cuenca M.-J. and Crible L. 2019. « Co-occurrence of discourse markers in English: From juxtaposition to composition ». Journal of Pragmatics 140, 171-184.
Dargnat M. 2022. « Mais enfin: construction et association ». Langages 225, 49-63.
Desagulier G. 2017. Corpus Linguistics and Statistics with R. New York: Springer.
Dostie G. 2004. Pragmaticalisation et marqueurs discursifs. Analyse sémantique et traitement lexicographique. Liège: De Boeck/Duculot.
Dostie G. 2013. « Les associations de marqueurs discursifs. De la cooccurrence libre à la collocation ». Linguistik 62(5), 15-45. (en ligne)
Haselow A. 2019. « Discourse Marker Sequences: Insights into the Serial Order of Communicative Tasks in Real-Time Turn Production ». Journal of Pragmatics 146, 1-18.
Heine B., Kaltenböck G., Kuteva T. & Long H. 2021. The Rise of Discourse Markers. Oxford: Oxford UP.
Mel’čuk I. 2023. General Phraseology, Theory and Practice. Linguisticae Investigationes Supplementa 36, Amsterdam: John Benjamins.
Pastor G. C. and Mitkov R. (éds). 2022. Proceedings of the 4th International Conference on Computation and Corpus-Based Phraseology. Cham: Springer.
Piirainen E. , Filatkina N., Stumpf S. and Pfeiffer C (éds). 2020. Formulaic Language and New Data. Theoretical and Methodological Implications. Berlin: de Gruyter.
Waltereit R. 2007. «  A propos de la genèse diachronique des combinaisons de marqueurs. L’exemple de bon ben et enfin bref ». Langue française 154, 94-109.

Scroll to Top