Title: | An R Implementation for the Family of Collostructional Methods |
---|---|
Description: | Functions and example data for collostructional or collocational analyses. |
Authors: | Susanne Flach |
Maintainer: | Susanne Flach <[email protected]> |
License: | GPL-2 |
Version: | 0.2.0 |
Built: | 2024-11-01 11:15:49 UTC |
Source: | https://github.com/skeptikantin/collostructions |
Data set of the begin/start-to-VERB pattern in the British National Corpus (BNC), with the frequencies of the verbs in the open slot. Aggregate of beginToV
and startToV
, to illustrate the easiest use of collex.dist()
with aggregated frequency lists, e.g., imported from outside R.
data("beginStart")
data("beginStart")
A data frame with 2163 observations on the following 3 variables.
WORD
a factor with 2,163 verb types either in begin to V and/or start to V.
beginToV
numeric vector, frequencies of verbs with begin to V.
startToV
numeric vector, frequencies of verbs with start to V.
## Not run: ## Distinctive Collexeme Analysis # load data data("beginStart") # perform Distinctive Collexeme Analysis (with defaults) # see ?collex.dist() for more use cases: x <- collex.dist(beginStart) ## End(Not run)
## Not run: ## Distinctive Collexeme Analysis # load data data("beginStart") # perform Distinctive Collexeme Analysis (with defaults) # see ?collex.dist() for more use cases: x <- collex.dist(beginStart) ## End(Not run)
Data set of the begin-to-VERB construction in the British National Corpus (BNC), with the frequencies of the verbs in the open slot (CQP query: [hw="begin" & class="VERB"] [hw="to"] [pos="V.I"]).
data("beginToV")
data("beginToV")
A data frame with 1957 observations on the following 2 variables.
WORD
A factor with levels of types in open slot of begin-to-VERB
CXN.FREQ
A numeric vector of the frequencies in V2.
## Not run: data(beginToV) # load str(beginToV) # inspect structure of object head(beginToV) # view head of object ## Calculate Simple Collexeme Analysis # load required data data(beginToV) # load frequency list for construction data(BNCverbL) # load frequency list for verb frequencies (lemmas) # join frequency files to create input for collex() beginToV.in <- join.freqs(beginToV, BNCverbL, all = FALSE) # only types in cxn beginToV.in <- join.freqs(beginToV, BNCverbL) # all types, even if not in cxn # calculate beginToV.out <- collex(beginToV.in, sum(BNCverbL$CORP.FREQ)) # using logL beginToV.out <- collex(beginToV.in, sum(BNCverbL$CORP.FREQ), "mi") # mi # inspect result head(beginToV.out, 20) # view first 20 lines of calculation tail(beginToV.out, 20) # view last 20 lines of calculation ## Calculate Distinctive Collexeme Analysis # load data data(beginToV) data(startToV) # merge frequency lists # the first argument to join.freqs() will be the 'default' by which output is # sorted and Z.DIR is calculated beginStart.in <- join.freqs(beginToV, startToV) # merge both data frames # calculate beginStart.out <- collex.dist(beginStart.in) # inspect result head(beginStart.out, 20) head(beginStart.out, 20) ## End(Not run)
## Not run: data(beginToV) # load str(beginToV) # inspect structure of object head(beginToV) # view head of object ## Calculate Simple Collexeme Analysis # load required data data(beginToV) # load frequency list for construction data(BNCverbL) # load frequency list for verb frequencies (lemmas) # join frequency files to create input for collex() beginToV.in <- join.freqs(beginToV, BNCverbL, all = FALSE) # only types in cxn beginToV.in <- join.freqs(beginToV, BNCverbL) # all types, even if not in cxn # calculate beginToV.out <- collex(beginToV.in, sum(BNCverbL$CORP.FREQ)) # using logL beginToV.out <- collex(beginToV.in, sum(BNCverbL$CORP.FREQ), "mi") # mi # inspect result head(beginToV.out, 20) # view first 20 lines of calculation tail(beginToV.out, 20) # view last 20 lines of calculation ## Calculate Distinctive Collexeme Analysis # load data data(beginToV) data(startToV) # merge frequency lists # the first argument to join.freqs() will be the 'default' by which output is # sorted and Z.DIR is calculated beginStart.in <- join.freqs(beginToV, startToV) # merge both data frames # calculate beginStart.out <- collex.dist(beginStart.in) # inspect result head(beginStart.out, 20) head(beginStart.out, 20) ## End(Not run)
A data frame with case-insensitive BNC verb lemma frequencies.
data("BNCverbL")
data("BNCverbL")
A data frame with 35939 observations on the following 2 variables.
WORD
a factor with levels for all verb lemmas
CORP.FREQ
a numeric vector with verb lemma frequencies
Lemmas starting in problematic characters have been removed (almost exclusively tokenization problems, e.g. single quotes, slashes, backslashes, hashes, asterisks or square brackets). Please make sure you have a clean file of frequencies if you want to use this data for joining frequency lists to avoid problems in collex().
BNCxml version (CQP query: [class="VERB"])
## Not run: data(BNCverbL) str(BNCverbL) head(BNCverbL) ## End(Not run)
## Not run: data(BNCverbL) str(BNCverbL) head(BNCverbL) ## End(Not run)
A dataset of the into-causative (e.g., they forced us into thinking...) from the BNC for the illustration of functions. Contains one observation/token per line.
data("causInto")
data("causInto")
A data frame with 1,426 observations on the following 3 variables.
VOICE
a factor with annotation for voice (levels active
, passive
, and reflexive
)
V1
a factor with levels for the matrix verb, lemmatised (e.g., they forced us into thinking...)
V2
a factor with levels for the content verb, lemmatised (e.g., they forced us into thinking...)
Flach, Susanne. 2018. “What’s that passive doing in my argument structure construction?” A note on constructional interaction, verb classes and related issues. Talk at the Workshop on constructions, verb classes, and frame semantics, IDS Mannheim.
## Not run: ## E.g., in Co-Varying Collexeme Analysis: # load data data(causInto) # inspect, contains more variables than needed for collex.covar() head(causInto) # subset into.cca <- subset(causInto, select = c(V1, V2)) # perform CCA into.cca <- collex.covar(into.cca) ## End(Not run)
## Not run: ## E.g., in Co-Varying Collexeme Analysis: # load data data(causInto) # inspect, contains more variables than needed for collex.covar() head(causInto) # subset into.cca <- subset(causInto, select = c(V1, V2)) # perform CCA into.cca <- collex.covar(into.cca) ## End(Not run)
A toy sample data set of pronominal make-causative (this makes me feel X) from the British National Corpus (BNC) to illustrate multiple distinctive co-varying collexeme analysis (see collex.covar.mult()
).
data("causMake")
data("causMake")
A data frame with 5000 observations of the following 3 variables.
MAKE
a factor with 4 levels of different inflectional forms of make
OBJ
a factor with 7 pronoun levels levels (i.e., her
him
it
me
them
us
you
)
V2
a factor with levels 633 levels, i.e., the verbs in the complement.
## Not run: ## Multiple Distinctive Collexeme Analysis # load data data("causMake") # perform Multiple Distinctive Co-Varying Collexeme Analysis (with defaults) # see ?collex.covar.mult for more use cases: x <- collex.covar.mult(causMake) ## End(Not run)
## Not run: ## Multiple Distinctive Collexeme Analysis # load data data("causMake") # perform Multiple Distinctive Co-Varying Collexeme Analysis (with defaults) # see ?collex.covar.mult for more use cases: x <- collex.covar.mult(causMake) ## End(Not run)
Data based on Johannsen & Flach (2015) with frequencies of all verbs occurring in the present progressive in the Corpus of Late Modern English Texts (CLMET-3.1). Contains only occurrences with unambiguous assignment of quarter century.
data("CLMETprog.qc")
data("CLMETprog.qc")
A data frame with 5,291 observations on the following 3 variables.
WORD
a factor with levels for all lemmas occurring as types in present progressive
QUARTCENT
a factor with levels 1700-1724
1725-1749
... 1900-1924
CXN.FREQ
a numeric vector with the corpus frequencies
CQP version query: [pos="VB[PZ]" & lemma="be"] [class="ADV|PRON"]* [pos="V.G"]
Corpus of Late Modern English Texts, CLMET-3.1 (De Smet, Flach, Tyrkkö & Diller 2015), CQP version.
De Smet, Hendrik, Susanne Flach, Jukka Tyrkkö & Hans-Jürgen Diller. 2015. The Corpus of Late Modern English (CLMET), version 3.1: Improved tokenization and linguistic annotation. KU Leuven, FU Berlin, U Tampere, RU Bochum.
Johannsen, Berit & Susanne Flach. 2015. Systematicity beyond obligatoriness in the history of the English progressive. Paper presented at ICAME 36, 27–31 May 2015, Universität Trier.
## Not run: data(CLMETprog.qc) ## End(Not run)
## Not run: data(CLMETprog.qc) ## End(Not run)
Data based on Johannsen & Flach (2015) with frequencies of all verbs occurring in the simple present in the Corpus of Late Modern English Texts (CLMET-3.1). Contains only occurrences with unambiguous assignment of quarter century.
data("CLMETsimple.qc")
data("CLMETsimple.qc")
A data frame with 24,693 observations on the following 3 variables.
WORD
a factor with levels for all lemmas occurring as types in present simple
QUARTCENT
a factor with levels 1700-1724
1725-1749
... 1900-1924
CORP.FREQ
a numeric vector with the corpus frequencies
CQP version query: [pos="VB[PZ]"] – frequencies for be were reduced by their frequency value in CLMETprog.qc
to avoid them being present in both data sets.
Corpus of Late Modern English Texts, CLMET-3.1 (De Smet, Flach, Tyrkkö & Diller 2015), CQP version.
De Smet, Hendrik, Susanne Flach, Jukka Tyrkkö & Hans-Jürgen Diller. 2015. The Corpus of Late Modern English (CLMET), version 3.1: Improved tokenization and linguistic annotation. KU Leuven, FU Berlin, U Tampere, RU Bochum.
Johannsen, Berit & Susanne Flach. 2015. Systematicity beyond obligatoriness in the history of the English progressive. Paper presented at ICAME 36, 27–31 May 2015, Universität Trier.
## Not run: data(CLMETsimple.qc) str(CLMETsimple.qc) head(CLMETsimple.qc) ## End(Not run)
## Not run: data(CLMETsimple.qc) str(CLMETsimple.qc) head(CLMETsimple.qc) ## End(Not run)
Implementation of Simple Collexeme Analysis (Stefanowitsch & Gries 2003) over a data frame with frequencies of verbs in a construction and their total frequencies in a corpus.
collex(x, corpsize = 1e+08L, am = "logl", reverse = FALSE, decimals = 5, threshold = 1, cxn.freq = NULL, str.dir = FALSE, p.fye = FALSE, delta.p = FALSE, p.adj = "none")
collex(x, corpsize = 1e+08L, am = "logl", reverse = FALSE, decimals = 5, threshold = 1, cxn.freq = NULL, str.dir = FALSE, p.fye = FALSE, delta.p = FALSE, p.adj = "none")
x |
A data frame with types in a construction in column 1 ( |
corpsize |
The size of the corpus in number of tokens (e.g., verb frequencies for all verb constructions, or total corpus size etc.). If not given, default is 100 million words, roughly the size of the BNC, but you should always provide the appropriate number. (Note: |
am |
Association measure to be calculated. Currently available, tested, and conventional in the collostruction literature: |
reverse |
If |
decimals |
Number of decimals in the output. Default is 5 decimal places (except for |
threshold |
Frequency threshold of items for which collostruction strength is to be calculated, i.e., if you want to exclude hapaxes from the output (they are not excluded from the calculation). |
cxn.freq |
Frequency of construction. Use only if |
str.dir |
Do you want a "directed" association measure in the output? For measures that are positive for attracted and repelled items, |
p.fye |
Should the traditional Fisher-Yates p-value be calculated? This will not have any effect unless |
delta.p |
Should deltaP be calculated? If yes, both types of deltaP will be calculated (see below). |
p.adj |
If an association measure is chosen that provides significance levels ( |
Corpus size: It's highly recommended to specify corpsize
to a conceptually sensible value. If you do not, the default may give you anything from an error for data from much larger corpora than the BNC to highly inaccuarate results for small corpora or infrequent phenomena. For phenomena from the BNC, this will probably not distort results too much (cf. Gries in multiple discussions of the "Fourth Cell(TM)").
FYE: Note to users of am = "fye"
: packages versions up to 0.0.10 used the negative natural logarithm for p-value transformation. Versions >= 0.1.0 use the negative decadic logarithm. If you want to continue using the natural logarithm transformation, use am = "fye.ln"
as an association measure and repeat procedure. (If you see this message, you are using a version >= 0.1.0. It will disappear in versions >= 1.0 once on CRAN).
Association measures: The default "logl"
as an association measure is due to the fact that for larger datasets from larger corpora the original "fye"
easily returns Inf
for the most strongly associated and/or dissociated items, which are then non-rankable (they are ranked by frequency of occurrence if this happens).
Thresholds: The threshold
argument default of 1 does not remove non-occurring items (although the logic of this argument implies as much). This is a "bug" that I decided to keep for historical reasons. If you do not want to calculate the repulsion for non-occurring items, you need to enter a frequency list that contains only the occurring items.
The output of collex()
is sorted by collostructional strength, with most attracted to least attracted; for ties, ordering is by frequency (further columns depent on argument settings):
COLLEX |
The collexemes. |
CORP.FREQ |
Frequency of collexeme in corpus. |
OBS |
Observed frequency of collexeme in construction. |
EXP |
Expected frequency in construction |
ASSOC |
Association of collexeme: |
COLL.STR./AM/ |
Association measure used. |
STR.DIR |
Same as collostruction strength ( |
DP1 |
Association (word-to-cxn), i.e., deltaP(w|cxn), see Gries & Ellis (2015: 240) |
DP2 |
Association (cxn-to-word), i.e., deltaP(cxn|w), see Gries & Ellis (2015: 240) |
SIGNIF |
Significance level. |
The function will abort if your input data frame has items with 'non-sensical' data, that is, if a collexeme has a higher frequency in the construction than it has in the corpus (which is logically impossible of course). This is a surprisingly common problem with untidy corpus frequency lists derived from messy annotation, especially when the collexemes have been manually cleaned from a rather inclusive query, but the corpus frequencies have different/erroneous part-of-speech tagging (cf. Flach 2015), where a syntactically quirky constructions in Let's go party was "hand-cleaned", but party did not have any frequency as a verb, because it was always tagged as a noun. As of package version 0.2.0, the function aborts with a descriptive error message, and prints a list of the items with non-sensical frequencies. For further input checks, see input.check()
.
Susanne Flach, [email protected]
Thanks to Anatol Stefanowitsch, Berit Johannsen, Kirsten Middeke, Volodymyr Dekalo and Robert Daugs for suggestions, debugging, and constructive complaining, and to Stefan Hartmann, who doesn't know what a complaint is, but who provided invaluable feedback when asked how the package could be improved.
Evert, Stefan. 2004. The statistics of word cooccurrences: Word pairs and collocations. U Stuttgart Dissertation. http://www.collocations.de/AM/
Flach, Susanne. 2015. Let's go look at usage: A constructional approach to formal constraints on go-VERB. In Thomas Herbst & Peter Uhrig (eds.), Yearbook of the German Cognitive Linguistics Association (Volume 3), 231-252. Berlin: De Gruyter Mouton. doi:10.1515/gcla-2015-0013.
Gries, Stefan Th. & Nick C. Ellis. 2015. Statistical measures for usage-based linguistics. Language Learning 65(S1). 228–255. doi:10.1111/lang.12119.
Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2). 209-243.
## Not run: #### Calculate Simple Collexeme Analysis ## Example 1: goVerb (cf. Flach 2015) # load data data(goVerb) # inspect data (optional) head(goVerb, 15) # displays first 15 lines of data frame # perform collex goV.out <- collex(goVerb, 616336708) # total words in corpus (excl. punct) goV.out <- collex(goVerb, 93993713) # total verbs in corpus # inspect output head(goV.out, 15) # first 15 items (strongly attracted) tail(goV.out, 15) # last 15 items (strongly repelled) # clear workspace (remove objects) rm(goVerb, goV.out) ## Example 2: beginToV (also see help file for ?beginToV) data(beginToV) # load data for begin-to-V data(BNCverbL) # load a frequency list for verb string frequencies # merge frequency lists (see ?join.freqs): beginToV.in <- join.freqs(beginToV, BNCverbL, all = FALSE) # perform collex beginToV.out <- collex(beginToV.in, sum(BNCverbL$CORP.FREQ)) # using logL # inspect output head(beginToV.out, 30) # first 30 most strongly associated types tail(beginToV.out, 20) # last 20 items least strongly associated types # clear workspace (remove objects) rm(beginToV, BNCverbL, beginToV.in, beginToV.out) ##### SPECIAL: IN USE OVER LISTS ## collex() can be used to perform several Simple Collexeme Analyses ## in one function call. See ?collex.dist for an example of multiple ## analyses across time periods (simple vs. progressive). The procedure with ## collex() is almost identical, except that you should use Map(...), ## because you have to provide corpus frequencies for each period ## (i.e., for each iteration of collex()): # 1. Create a numeric vector of corpus frequencies: corpfreqs <- c(corpFreqPeriod1, corpFreqPeriod2, ...) # 2. Pass 'corpfreqs' vector as an argument to Map() like so: myList.out <- Map(collex, myList.in, corpsize = corpfreqs, ...) ## End(Not run)
## Not run: #### Calculate Simple Collexeme Analysis ## Example 1: goVerb (cf. Flach 2015) # load data data(goVerb) # inspect data (optional) head(goVerb, 15) # displays first 15 lines of data frame # perform collex goV.out <- collex(goVerb, 616336708) # total words in corpus (excl. punct) goV.out <- collex(goVerb, 93993713) # total verbs in corpus # inspect output head(goV.out, 15) # first 15 items (strongly attracted) tail(goV.out, 15) # last 15 items (strongly repelled) # clear workspace (remove objects) rm(goVerb, goV.out) ## Example 2: beginToV (also see help file for ?beginToV) data(beginToV) # load data for begin-to-V data(BNCverbL) # load a frequency list for verb string frequencies # merge frequency lists (see ?join.freqs): beginToV.in <- join.freqs(beginToV, BNCverbL, all = FALSE) # perform collex beginToV.out <- collex(beginToV.in, sum(BNCverbL$CORP.FREQ)) # using logL # inspect output head(beginToV.out, 30) # first 30 most strongly associated types tail(beginToV.out, 20) # last 20 items least strongly associated types # clear workspace (remove objects) rm(beginToV, BNCverbL, beginToV.in, beginToV.out) ##### SPECIAL: IN USE OVER LISTS ## collex() can be used to perform several Simple Collexeme Analyses ## in one function call. See ?collex.dist for an example of multiple ## analyses across time periods (simple vs. progressive). The procedure with ## collex() is almost identical, except that you should use Map(...), ## because you have to provide corpus frequencies for each period ## (i.e., for each iteration of collex()): # 1. Create a numeric vector of corpus frequencies: corpfreqs <- c(corpFreqPeriod1, corpFreqPeriod2, ...) # 2. Pass 'corpfreqs' vector as an argument to Map() like so: myList.out <- Map(collex, myList.in, corpsize = corpfreqs, ...) ## End(Not run)
Implementation of Covarying Collexeme Analysis (Gries & Stefanowitsch 2004, Stefanowitsch & Gries 2005) to investigate the collostructional interaction between two slots of a construction.
collex.covar(x, am = "logl", raw = TRUE, all = FALSE, reverse = FALSE, decimals = 5, str.dir = FALSE, p.fye = FALSE, delta.p = FALSE)
collex.covar(x, am = "logl", raw = TRUE, all = FALSE, reverse = FALSE, decimals = 5, str.dir = FALSE, p.fye = FALSE, delta.p = FALSE)
x |
Input, a data frame. Two options: EITHER as raw, with one observation per line, and with collexeme 1 in column 1 and collexeme 2 in column 2 (in which case |
am |
Association measure to be calculated. Currently available, tested, and conventional in the collostruction literature: |
raw |
|
all |
|
reverse |
|
decimals |
Number of decimals in the output. Default is 5 decimal places (except for |
str.dir |
Do you want a "directed" collostruction strength in the output? For measures that are positive for both attracted and repelled items, |
p.fye |
If |
delta.p |
Should deltaP be calculated? If yes, both types of deltaP will be calculated (see below). |
Output is ordered in descending order of association (unless reverse = TRUE
):
SLOT1 |
type/item in slot 1 (e.g. string or lemma...) |
SLOT2 |
type/item in slot 2 (e.g. string or lemma...) |
fS1 |
Total frequency of item 1 in slot 1 of cxn |
fS2 |
Total frequency of item 2 in slot 2 of cxn |
OBS |
Observed frequency of combination |
EXP |
Expected frequency of combination |
ASSOC |
Association of combination ( |
COLL.STR./AM/ |
Value of association measure used. |
STR.DIR |
"Directed" collostruction strength. |
DP1 |
Association (slot1-to-slot2), i.e., deltaP(s2|s1), how predictive is slot 1 of slot 2? See Gries & Ellis (2015: 240) |
in the constructional context.
DP2 |
Association (slot2-to-slot1), i.e., deltaP(s1|s2), how predictive is slot 2 of slot 1? See Gries & Ellis (2015: 240) |
in the constructional context.
SIGNIF |
Significance level. |
If you use the function on constructions with a high type frequency, be patient when setting all = TRUE
. The function needs to perform Types.In.A x Types.In.B
number of tests. Even on a fairly powerful computer it can take about half an hour to perform ~500,000+ tests.
For Multiple Distinctive Collexeme Analysis (MDCA), where you have more than two conditions/constructions, you can use collex.covar()
. The association scores of CCA (collex.covar()
) and approximation-based MDCA correlate highly (e.g., for the modadv
data: Pearson r = .9987841; Spearman's rho = .9999993), suggesting collex.covar()
is a workable alternative to approximation.
Note to users of am = "fye"
: packages versions up to 0.0.10 used the negative natural logarithm for p-value transformation. Versions >= 0.1.0 use the negative decadic logarithm. If you want to continue using the natural logarithm transformation, use am = "fye.ln"
as an association measure and repeat procedure. (If you see this message, you are using a version >= 0.1.0. It will disappear in versions >= 1.0 once on CRAN).
Susanne Flach, [email protected]
Thanks to Anatol Stefanowitsch, Berit Johannsen, Kirsten Middeke and Volodymyr Dekalo for suggestions, debugging, and constructive complaining.
Evert, Stefan. 2004. The statistics of word cooccurrences. Word pairs and collocations. Stuttgart: Universität Stuttgart Doctoral Dissertation. http://www.stefan-evert.de/PUB/Evert2004phd.pdf.
Gries, Stefan Th. & Nick C. Ellis. 2015. Statistical measures for usage-based linguistics. Language Learning 65(S1). 228–255. doi:10.1111/lang.12119.
Gries, Stefan Th. & Anatol Stefanowitsch. 2004. Covarying collexemes in the into-causative. In Michel Archard & Suzanne Kemmer (eds.), Language, culture, and mind, 225–236. Stanford, CA: CSLI.
Stefanowitsch, Anatol & Stefan Th. Gries. 2005. Covarying collexemes. Corpus Linguistics and Linguistic Theory 1(1). 1–43. doi:10.1515/cllt.2005.1.1.1.
Use reshape.cca
to transform the output of collex.covar
from the 'long' format to a 'wide' format (i.e., cross-tabulate association scores.)
## Not run: ### Example I: Attested combinations (only) data(caus.into) # inspect head(caus.into) # subset, because caus.into contains too many variables into.vrbs <- caus.into[, c(2,3)] # CCA between V1 and V2 into.voice <- caus.into[, c(1,2)] # CCA between VOICE and V1 # perform Co-Varying Collexeme Analysis into.vrbs.cca <- collex.covar(into.vrbs) into.voice.cca <- collex.covar(into.voice) # clear workspace (remove objects) rm(into.voice, into.vrbs.cca, into.voice.cca) ### Example 2: If you want to test all possible combinations # Depending on your machine, this may take a while, because it needs # to perform 199*426 = 84,774 tests for the 199 slot 1 types and the # 426 slot 2 types (rather than the 1,076 tests for attested combinations). into.vrbs.cca.all <- collex.covar(into.vrbs, all = TRUE) ### Example 3: An aggregated list # set raw = FALSE, otherwise the function will abort # (output wouldn't make any sense): data(modadv) head(modadv, 12) modadv.cca <- collex.covar(modadv, raw = FALSE) ## End(Not run)
## Not run: ### Example I: Attested combinations (only) data(caus.into) # inspect head(caus.into) # subset, because caus.into contains too many variables into.vrbs <- caus.into[, c(2,3)] # CCA between V1 and V2 into.voice <- caus.into[, c(1,2)] # CCA between VOICE and V1 # perform Co-Varying Collexeme Analysis into.vrbs.cca <- collex.covar(into.vrbs) into.voice.cca <- collex.covar(into.voice) # clear workspace (remove objects) rm(into.voice, into.vrbs.cca, into.voice.cca) ### Example 2: If you want to test all possible combinations # Depending on your machine, this may take a while, because it needs # to perform 199*426 = 84,774 tests for the 199 slot 1 types and the # 426 slot 2 types (rather than the 1,076 tests for attested combinations). into.vrbs.cca.all <- collex.covar(into.vrbs, all = TRUE) ### Example 3: An aggregated list # set raw = FALSE, otherwise the function will abort # (output wouldn't make any sense): data(modadv) head(modadv, 12) modadv.cca <- collex.covar(modadv, raw = FALSE) ## End(Not run)
Implementation of Multiple/Distinctive Covarying Collexeme Analysis (Stefanowitsch & Flach 2020) to investigate the collostructional association between three or more slots/conditions of a construction.
collex.covar.mult(x, am = "t", raw = TRUE, all = FALSE, reverse = FALSE, threshold = 1, decimals = 5)
collex.covar.mult(x, am = "t", raw = TRUE, all = FALSE, reverse = FALSE, threshold = 1, decimals = 5)
x |
A data frame with each (categorical) condition in one column. In principle, these conditions can be anything: open slots in a construction (verbs, nouns, prepositions, etc.), constructions (e.g., ditransitive/prep-dative, negation [y/n]), annotated variables, genre, time periods...). The function assumes a raw list by default, i.e., with one observation per line. If you have an aggregated list, the last column must contain the frequencies (in which case set |
am |
Association measure to be calculated. Currently available (though very experimental, see below): |
raw |
Does the data frame contain raw observations, i.e., one observation per line? Default is |
all |
Should association be calculated for all possible combinations? The default is |
reverse |
The default sorting is in descending order by positive attraction. Set to |
threshold |
Set this to a positive integer if the output should only contain combinations with a frequency above a certain threshold. Please read the notes below. |
decimals |
The number of decimals in the association measure. |
General: This function uses a code section from the function scfa
from the R package cfa
(Mair & Funke 2017) for the calculation of expected values. Multiple Covarying Collexeme Analysis is conceptually essentially Configural Frequency Analysis (CFA; van Eye 1990) in a collostructional context. If you use this method, you can cite Stefanowitsch & Flach (2020).
Note that this function can only provide measures based on the counts/observations in the data frame. That is, different to the other collexeme functions, there is no option to supply overall (corpus) frequencies if your data frame contains only a sample of all corpus counts. For instance, if you have removed hapax combinations from your data frame, then the frequencies of all types that occur in hapax types will not be included.
All combinations: While the calculation of association measures for all possible combinations of all conditions all = TRUE
is necessary if you want to assess the relevance of the absence of a combination (negative association value for most measures), you need to be aware of the consequences: for use cases with high type frequencies this will involve the calculation of a huge number of n-types, which can break the R session. It is doubtful if it is linguistically relevant anyway: most high(er) frequent combinations that are linguistically interesting will have at least a few observations. Also note that if you supply an aggregated data frame (with a frequency column, i.e., raw = FALSE
), all = TRUE
currently has no effect (and it's doubtful that this will be implemented). In this case, you can 'untable' your original data frame, see examples below.
Threshold: You can restrict the output such that only combinations that occur an n number of times are shown. This might be useful if you have a large data frame.
Association measures: The implemented measures are relatively simple (as they only involve observed and expected values), but they do the trick of separating idiomatic, highly associated types from less strongly associated ones. Most measures are based on Evert (2004), with the exception of tmi
. As there are, as of yet, no sufficient number of studies it is difficult to advise, but the t-score
appears relatively robust (hence the default). However, since an observed value of 0 (if all = TRUE
) would result in -Inf
, 0.5 is added to the observed value of unattested combinations before the t-value
is calculated.
The output is sorted in decending order of attraction.
\emph{CONDITION}... |
The first n columns are the conditions you originally entered. |
OBS |
Observed frequency of the combination |
EXP |
Expected frequency of the combination |
\emph{AM} |
The chosen association measure. |
Susanne Flach, [email protected]
Patrick Mair and Stefan Funke (2017). cfa: Configural Frequency Analysis (CFA). R package version
0.10-0. https://CRAN.R-project.org/package=cfa
Eye, A. von (1990). Introduction to configural frequency analysis. The search for types and anti-types in cross-classification. Cambridge: CUP.
Stefanowitsch, Anatol & Susanne Flach. 2020. Too big to fail but big enough to pay for their mistakes: A collostructional analysis of the patterns [too ADJ to V] and [ADJ enough to V]. In Gloria Corpas & Jean-Pierre Colson (eds.), Computational Phraseology, 248–272. Amsterdam: John Benjamins.
## Not run: ## Multiple Distinctive Collexeme Analysis ## Case 1: Raw list of observations # load data ## Case 3: You only have an aggregated list with attested values, ## but want to measure the relevance of the absence (i.e., dissociation measures) library(reshape) # Untable (where df stands for your original data frame, minus the last column with the frequencies) df_new <- untable(df[,-ncol(df)], num = df[, ncol(df)]) ## End(Not run)
## Not run: ## Multiple Distinctive Collexeme Analysis ## Case 1: Raw list of observations # load data ## Case 3: You only have an aggregated list with attested values, ## but want to measure the relevance of the absence (i.e., dissociation measures) library(reshape) # Untable (where df stands for your original data frame, minus the last column with the frequencies) df_new <- untable(df[,-ncol(df)], num = df[, ncol(df)]) ## End(Not run)
Implementation of Distinctive Collexeme Analysis (Gries & Stefanowitsch 2004) to be run over a data frame with frequencies of items in two alternating constructions OR, more generally, for keyword analysis over a data frame with frequency lists of two corpora OR, most generally, over a data frame with comparison of frequencies of two conditions. Note: if you want to perform Multiple Distinctive Collexeme Analysis, use collex.covar()
, see Notes below.
collex.dist(x, am = "logl", raw = FALSE, reverse = FALSE, decimals = 5, threshold = 1, cxn.freqs = NULL, str.dir = FALSE, p.fye = FALSE, delta.p = FALSE)
collex.dist(x, am = "logl", raw = FALSE, reverse = FALSE, decimals = 5, threshold = 1, cxn.freqs = NULL, str.dir = FALSE, p.fye = FALSE, delta.p = FALSE)
x |
Input, a data frame. Two options: EITHER as aggregated frequencies with the types in column A ( |
am |
Association measure to be calculated. Currently available, tested, and conventional in the collostruction literature: |
raw |
Does input data frame contain a raw list of occurrences? Leave default |
reverse |
|
decimals |
Number of decimals in the output. Default is 5 decimal places (except for |
threshold |
Frequency threshold for items you want to calculate association measures for. By default, this is 1 (= calculated for all items). Any number higher than 1 will exclude items that have a combined frequency lower than the threshold. For instance, |
cxn.freqs |
A numeric vector or list of length 2. This option lets you enter construction frequencies 'manually' if |
str.dir |
Do you want a "directed" association measure in the output? For measures that are positive for attracted and repelled items, |
p.fye |
If |
delta.p |
Should deltaP be calculated? If yes, both types of deltaP will be calculated (see below). |
FYE: Note to users of am = "fye"
: packages versions up to 0.0.10 used the negative natural logarithm for p-value transformation. Versions >= 0.1.0 use the negative decadic logarithm. If you want to continue using the natural logarithm transformation, use am = "fye.ln"
as an association measure and repeat procedure. (If you see this message, you are using a version >= 0.1.0. It will disappear in versions >= 1.0 once on CRAN).
Association measures: The default "logl"
as an association measure is due to the fact that for larger datasets from larger corpora the original "fye"
easily returns Inf
for the most strongly associated and/or dissociated items, which are then non-rankable (they are ranked by frequency of occurrence if this happens).
Thresholds: The threshold
argument default of 1 does not remove non-occurring items (although the logic of this argument implies as much). This is a "bug" that I decided to keep for historical reasons (and to avoid problems with collex()
. Use primarily to leave out low-frequency items in the output.
OUTPUT ordered in descending order of association to cxn/condition A (unless reverse = TRUE
):
COLLEX |
type/item (e.g. string, lemma...) |
O.CXN1 |
Observed frequency in cxn/condition A |
E.CXN1 |
Expected frequency in cxn/condition A |
O.CXN2 |
Observed frequency in cxn/condition B |
E.CXN2 |
Expected frequency in cxn/condition B |
ASSOC |
Name of cxn/condition with which the COLLEX type is associated. |
COLL.STR./AM/ |
Association measure used. |
STR.DIR |
"Directed" collostruction strength. |
DP1 |
Association (word-to-cxn), i.e., deltaP(w|cxn), see Gries & Ellis (2015: 240) |
DP2 |
Association (cxn-to-word), i.e., deltaP(cxn|w), see Gries & Ellis (2015: 240) |
SIGNIF |
Significance level. |
SHARED |
Is |
Multiple Distinctive Collexeme Analysis: If you want to perform a Multiple Distinctive Collexeme Analysis (MDCA) for more than two levels of your construction, you cannot use collex.dist()
, as it can only handle two levels of cxn. Instead, use collex.covar()
for Co-Varying Collexeme Analysis (CCA), which can handle more than two levels of the first condition. The main difference between MDCA and CCA is conceptual (and arguably historical), but they are mathematically the same thing; the association scores of CCA and approximation-based MDCA correlate highly (e.g., for the modadv
data: Pearson r = .9987841; Spearman's rho = .9999993).
Susanne Flach, [email protected]
Thanks to Anatol Stefanowitsch, Berit Johannsen, Kirsten Middeke and Volodymyr Dekalo for discussion, suggestions, debugging, and constructive complaining.
Evert, Stefan. 2004. The statistics of word cooccurrences: Word pairs and collocations. U Stuttgart Dissertation. http://www.collocations.de/AM/
Gries, Stefan Th. & Nick C. Ellis. 2015. Statistical measures for usage-based linguistics. Language Learning 65(S1). 228–255. doi:10.1111/lang.12119.
Gries, Stefan Th. & Anatol Stefanowitsch. 2004. Extending collostructional analysis: A corpus-based perspective on "alternations." International Journal of Corpus Linguistics 9(1). 97-129.
Hilpert, Martin. 2011. Diachronic collostructional analysis: How to use it and how to deal with confounding factors. In Kathryn Allan & Justyna A. Robinson (eds.), Current methods in historical semantics, 133–160. Berlin & Boston: De Gruyter.
Johannsen, Berit & Susanne Flach. 2015. Systematicity beyond obligatoriness in the history of the English progressive. Paper presented at ICAME 36, 27–31 May 2015, Universität Trier.
See freq.list()
for an easy function to create frequency lists from vectors of characters in preparation for Distinctive Collexeme Analysis.For use with incomplete data, see example in ditrdat_pub
.
## Not run: ##### Calculate Distinctive Collexeme Analysis ## This is a little lengthy, because there are multiple ways to provide ## input to DCA (Case 1, Case 2, and Case 3). There are also use cases ## to run multiple DCA over a list of files (see below). ### Case 1: An aggregated frequency list ## The easiest use case: Words in col1, and their frequencies ## in cxns A and B in col2 and col3, respectively: # load data data(beginStart) # perform DCA (no settings necessary, more as required): beginStart.dca1 <- collex.dist(beginStart) beginStart.dca2 <- collex.dist(beginStart, am = "fye") beginStart.dca3 <- collex.dist(beginStart, am = "fye", str.dir = TRUE) # inspect: head(beginStart.dca1, 15) # 15 most strongly attracted items to cxn A tail(beginStart.dca1, 25) # 20 most strongly attracted items to cxn B # cleanup (remove objects from workspace): rm(beginStart.dca1, beginStart.dca2, beginStart.dca3) ### Case 2: Two separate aggregated frequency lists ## Like Case 1, but with separate lists of cxns A and B that need to combined: # load data data(beginToV) data(startToV) # I. Merge the lists beginStart.in <- join.freqs(beginToV, startToV) head(beginStart.in, 12) # II. Calculate association beginStart.out <- collex.dist(beginStart.in) # III. Inspect head(beginStart.out, 15) # 15 most strongly attracted items to cxn A tail(beginStart.out, 20) # 20 most strongly attracted items to cxn B # cleanup (remove objects from workspace): rm(beginToV, startToV, beginStart.in, beginStart.out) ### Case 3: A list with one observation per line (i.e. raw = TRUE) # where the cxns are in col1 and the collexemes are in col2: # load & inspect the will/going-to-V alternation: data(future) head(future, 12) # Calculate: future.out <- collex.dist(future, raw = TRUE) head(future.out, 6) tail(future.out, 6) # cleanup (remove objects from workspace): rm(future, future.out) ##### IF YOU HAVE INCOMPLETE DATA SETS ## Illustrate the application of the cxn.freq argument if you do not have all ## types; this is *not* a sample of a larger data set, but rather as if lines ## from an aggregate frequency lists were unavailable. To illustrate, we'll ## recreate the dative alternation from Gries & Stefanowitsch (2004). data(ditrdat_pub) # The data is from Gries & Stefanowitsch (2004: 106), ie. the top collexemes for # the ditransitive vs. the to-dative. That is, the low-frequent items are not # in their results list in the table. # So the following would lead to (linguistically) wrong results, because # collex.dist() calculates the cxn frequencies from an incomplete data set, # when in fact they are much higher: collex.dist(ditrdat_pub, am = "fye") # However, you can recreate the results table by specifying the cxn frequencies # provided in the publication (as cxn.freq), as a vector, where the first # element contains the total of the first cxn and the second contains the total # of the second cxn. You can also get the traditional Fisher-Yates p-value as # in the original publication: ditrdat.dca <- collex.dist(ditrdat_pub, am = "fye", p.fye = TRUE, cxn.freqs = c(1035, 1919), decimals = 3) # Inspect: head(ditrdat.dca, 20) # left side of Table 2 (Gries & Stefanowitsch 2004: 106) tail(ditrdat.dca, 19) # right side of Table 2 (Gries & Stefanowitsch 2004: 106) # the right side of Table 2 is "upside down", because collex.dist() orders by # collostructional continuum. Run the above again with reverse = T. # NB: If you have a raw input file, make sure you pass a vector with the correct # frequencies for whatever R recognizes as the first element (usually # alphabetically if column 1 is a factor); this behavior has to be checked # carefully in R4.x, as R now standardly reads in characters as characters. ##### IN USE OVER LISTS, e.g., ## We performed several Distictive Collexeme Analyses for (present) progressive ## vs. (simple) present over 10 25-yr periods in CLMET (Johannsen & Flach 2015). ## Note that although using historical data, this is quite different to ## Diachronic Distinctive Collexeme Analysis (e.g., Hilpert 2011), where periods ## are conditions in *one* DCA and thus mathematically *not* independent of ## each other. The sample data below runs one DCA *per period*, so the DCAs are ## mathematically independent of each other. The conditions are still ## two alternating constructions as in 'ordinary' DCA. ## So this means 'multiple' DCAs in the sense of 'several' DCAs, not in the ## sense of 'Multiple Distinctive Collexeme Analysis' (MDCA). ## load data data(CLMETprog.qc) data(CLMETsimple.qc) ### I. Prepare # split data by time period and drop redudant period column prog <- split(CLMETprog.qc[, c(1,3)], CLMETprog.qc$QUARTCENT) simple <- split(CLMETsimple.qc[, c(1,3)], CLMETsimple.qc$QUARTCENT) # combine frequencies for progressive & simple as input to collex.dist() dist.in <- join.lists(prog, simple) dist.in <- lapply(dist.in, droplevels) ### II. Perform several DCA (returns a list of DCA output for list) # if only defaults of arguments are used, use lapply() like so: dist.out <- lapply(dist.in, collex.dist) # if you want to override default arguments, use Map() like so: dist.out <- Map(collex.dist, dist.in, am = "fye") dist.out <- Map(collex.dist, dist.in, decimals = 7) ### III. Inspect output str(dist.out) # structure of list str(dist.out[1]) # structure of first item in list ## VI. Export (works if you have installed package 'openxlsx') # Will write each DCA in a separate Excel worksheet openxlsx::write.xlsx(dist.out, "ProgSimpleDistColl_CLMET.xlsx") ## End(Not run)
## Not run: ##### Calculate Distinctive Collexeme Analysis ## This is a little lengthy, because there are multiple ways to provide ## input to DCA (Case 1, Case 2, and Case 3). There are also use cases ## to run multiple DCA over a list of files (see below). ### Case 1: An aggregated frequency list ## The easiest use case: Words in col1, and their frequencies ## in cxns A and B in col2 and col3, respectively: # load data data(beginStart) # perform DCA (no settings necessary, more as required): beginStart.dca1 <- collex.dist(beginStart) beginStart.dca2 <- collex.dist(beginStart, am = "fye") beginStart.dca3 <- collex.dist(beginStart, am = "fye", str.dir = TRUE) # inspect: head(beginStart.dca1, 15) # 15 most strongly attracted items to cxn A tail(beginStart.dca1, 25) # 20 most strongly attracted items to cxn B # cleanup (remove objects from workspace): rm(beginStart.dca1, beginStart.dca2, beginStart.dca3) ### Case 2: Two separate aggregated frequency lists ## Like Case 1, but with separate lists of cxns A and B that need to combined: # load data data(beginToV) data(startToV) # I. Merge the lists beginStart.in <- join.freqs(beginToV, startToV) head(beginStart.in, 12) # II. Calculate association beginStart.out <- collex.dist(beginStart.in) # III. Inspect head(beginStart.out, 15) # 15 most strongly attracted items to cxn A tail(beginStart.out, 20) # 20 most strongly attracted items to cxn B # cleanup (remove objects from workspace): rm(beginToV, startToV, beginStart.in, beginStart.out) ### Case 3: A list with one observation per line (i.e. raw = TRUE) # where the cxns are in col1 and the collexemes are in col2: # load & inspect the will/going-to-V alternation: data(future) head(future, 12) # Calculate: future.out <- collex.dist(future, raw = TRUE) head(future.out, 6) tail(future.out, 6) # cleanup (remove objects from workspace): rm(future, future.out) ##### IF YOU HAVE INCOMPLETE DATA SETS ## Illustrate the application of the cxn.freq argument if you do not have all ## types; this is *not* a sample of a larger data set, but rather as if lines ## from an aggregate frequency lists were unavailable. To illustrate, we'll ## recreate the dative alternation from Gries & Stefanowitsch (2004). data(ditrdat_pub) # The data is from Gries & Stefanowitsch (2004: 106), ie. the top collexemes for # the ditransitive vs. the to-dative. That is, the low-frequent items are not # in their results list in the table. # So the following would lead to (linguistically) wrong results, because # collex.dist() calculates the cxn frequencies from an incomplete data set, # when in fact they are much higher: collex.dist(ditrdat_pub, am = "fye") # However, you can recreate the results table by specifying the cxn frequencies # provided in the publication (as cxn.freq), as a vector, where the first # element contains the total of the first cxn and the second contains the total # of the second cxn. You can also get the traditional Fisher-Yates p-value as # in the original publication: ditrdat.dca <- collex.dist(ditrdat_pub, am = "fye", p.fye = TRUE, cxn.freqs = c(1035, 1919), decimals = 3) # Inspect: head(ditrdat.dca, 20) # left side of Table 2 (Gries & Stefanowitsch 2004: 106) tail(ditrdat.dca, 19) # right side of Table 2 (Gries & Stefanowitsch 2004: 106) # the right side of Table 2 is "upside down", because collex.dist() orders by # collostructional continuum. Run the above again with reverse = T. # NB: If you have a raw input file, make sure you pass a vector with the correct # frequencies for whatever R recognizes as the first element (usually # alphabetically if column 1 is a factor); this behavior has to be checked # carefully in R4.x, as R now standardly reads in characters as characters. ##### IN USE OVER LISTS, e.g., ## We performed several Distictive Collexeme Analyses for (present) progressive ## vs. (simple) present over 10 25-yr periods in CLMET (Johannsen & Flach 2015). ## Note that although using historical data, this is quite different to ## Diachronic Distinctive Collexeme Analysis (e.g., Hilpert 2011), where periods ## are conditions in *one* DCA and thus mathematically *not* independent of ## each other. The sample data below runs one DCA *per period*, so the DCAs are ## mathematically independent of each other. The conditions are still ## two alternating constructions as in 'ordinary' DCA. ## So this means 'multiple' DCAs in the sense of 'several' DCAs, not in the ## sense of 'Multiple Distinctive Collexeme Analysis' (MDCA). ## load data data(CLMETprog.qc) data(CLMETsimple.qc) ### I. Prepare # split data by time period and drop redudant period column prog <- split(CLMETprog.qc[, c(1,3)], CLMETprog.qc$QUARTCENT) simple <- split(CLMETsimple.qc[, c(1,3)], CLMETsimple.qc$QUARTCENT) # combine frequencies for progressive & simple as input to collex.dist() dist.in <- join.lists(prog, simple) dist.in <- lapply(dist.in, droplevels) ### II. Perform several DCA (returns a list of DCA output for list) # if only defaults of arguments are used, use lapply() like so: dist.out <- lapply(dist.in, collex.dist) # if you want to override default arguments, use Map() like so: dist.out <- Map(collex.dist, dist.in, am = "fye") dist.out <- Map(collex.dist, dist.in, decimals = 7) ### III. Inspect output str(dist.out) # structure of list str(dist.out[1]) # structure of first item in list ## VI. Export (works if you have installed package 'openxlsx') # Will write each DCA in a separate Excel worksheet openxlsx::write.xlsx(dist.out, "ProgSimpleDistColl_CLMET.xlsx") ## End(Not run)
A subset of the ditransitive-dative alternation in ICE-GB, taken from Gries & Stefanowitsch (2004: 106).
data("ditrdat_pub")
data("ditrdat_pub")
A data frame with 39 observations on the following 3 variables.
VERB
a factor variable with the top attracted collexemes to the ditransitive and to-dative, respectively
DITR
a numeric variable, containing the frequency of VERB
in the ditransitive
DAT
a numeric variable, containing the frequency of VERB
in the to-dative
Data to illustrate the use of collex.dist()
with incomplete data sets. For full datasets, collex.dist()
determines the construction totals from the files directly, but this is unavailable if you have incomplete data sets (i.e., with types and their frequencies missing. For example, in their publication, Gries & Stefanowitsch list only data that amounts to 957 (ditransitive) and 813 (to-dative) data points, while the construction totals are 1,035 and 1,919, respectively. In cases of incomplete data, but where construction totals are known, the construction totals need to be passed to the function, see below for an example.
Recreated from Table 2 in Gries & Stefanowitsch (2004: 106).
Gries, Stefan Th. & Anatol Stefanowitsch. 2004. Extending collostructional analysis: A corpus-based perspective on “alternations.” International Journal of Corpus Linguistics 9(1). 97–129.
## Not run: ## 1 Inspect the data: a data frame with 3 columns and the aggregated ## frequencies of verbs in the ditransitive (DITR) and to-dative (DAT). head(ditrdat_pub) ## 2 Recreate the results in Gries & Stefanowitsch (2004: 106), with # the construction frequencies given as the cxn.freqs argument: # with standards in collex.dist(), i.e., log-likelihood (G2): collex.dist(ditrdat_pub, cxn.freqs = c(1035, 1919), decimals = 3) # with p.fye = TRUE, to recreate the traditional p-value (Fisher-Yates) # (note that p.fye = TRUE will only make sense if am = "fye"): collex.dist(ditrdat_pub, am = "fye", p.fye = TRUE, cxn.freqs = c(1035, 1919)) ## See ?collex.dist() for further examples. ## End(Not run)
## Not run: ## 1 Inspect the data: a data frame with 3 columns and the aggregated ## frequencies of verbs in the ditransitive (DITR) and to-dative (DAT). head(ditrdat_pub) ## 2 Recreate the results in Gries & Stefanowitsch (2004: 106), with # the construction frequencies given as the cxn.freqs argument: # with standards in collex.dist(), i.e., log-likelihood (G2): collex.dist(ditrdat_pub, cxn.freqs = c(1035, 1919), decimals = 3) # with p.fye = TRUE, to recreate the traditional p-value (Fisher-Yates) # (note that p.fye = TRUE will only make sense if am = "fye"): collex.dist(ditrdat_pub, am = "fye", p.fye = TRUE, cxn.freqs = c(1035, 1919)) ## See ?collex.dist() for further examples. ## End(Not run)
Sometimes it is handy to create a frequency list if working with some annotated data rather than importing frequency lists from external programs. Many R functions can create frequency lists, this is just a handy alternative that creates lists that are directly compatible the join.freqs
function. The output is a data frame and is sorted in descending order of frequency, which may be one or two steps quicker than R functions such as table()
.
freq.list(x, convert = TRUE, asFactor = TRUE)
freq.list(x, convert = TRUE, asFactor = TRUE)
x |
A factor or character vector. |
convert |
Should the elements in |
asFactor |
Should the types be converted to |
WORD |
The types |
FREQ |
The frequencies. |
Can be used in preparation for join.freqs.
## Not run: ## From a list of raw observations: head(caus.into, 12) # so to get a frequency list of the V1 in the into-causative: freq.list(caus.into$V1) # or from the future-time expressions: head(future, 12) freq.list(future$COLLEXEME) ## End(Not run)
## Not run: ## From a list of raw observations: head(caus.into, 12) # so to get a frequency list of the V1 in the into-causative: freq.list(caus.into$V1) # or from the future-time expressions: head(future, 12) freq.list(future$COLLEXEME) ## End(Not run)
Data set (sample) of 10,000 random tokens of the will- vs. going to-VERB alternation in the spoken section of the BNC2014, for illustration in collex.dist()
with raw input.
data("future")
data("future")
A data frame with 10000 observations on the following 2 variables.
CXN
a factor with levels going.to
and will
COLLEXEME
a factor with 599 verb types in either the going to V or will cxn.
## Not run: ## Distinctive Collexeme Analysis # load data data("future") # perform Distinctive Co-Varying Collexeme Analysis (with defaults) # see ?collex.dist() for more use cases: x <- collex.dist(future, raw = TRUE) # If you do not set raw = TRUE, function aborts: x <- collex.dist(future) ## End(Not run)
## Not run: ## Distinctive Collexeme Analysis # load data data("future") # perform Distinctive Co-Varying Collexeme Analysis (with defaults) # see ?collex.dist() for more use cases: x <- collex.dist(future, raw = TRUE) # If you do not set raw = TRUE, function aborts: x <- collex.dist(future) ## End(Not run)
Data set from Flach (2015) containing the construction and corpus frequencies of all verbs that occur as V2 in the go-VERB construction in the ENCOW14AX01 corpus (Schäfer & Bildhauer 2012). See Flach (2015:§4.2) for data extraction procedure.
data("goVerb")
data("goVerb")
A data frame with 752 observations on the following 3 variables.
WORD
A factor with the levels for each of the 725 verbs that occurs in the construction
CXN.FREQ
A numeric vector, containing the observed frequency of V2 in go-VERB
CORP.FREQ
A numeric vector, containing the string frequency of V2 in the corpus
Flach, Susanne. 2015. Let’s go look at usage: A constructional approach to formal constraints on go-VERB. In Thomas Herbst & Peter Uhrig (eds.), Yearbook of the German Cognitive Linguistics Association (Volume 3), 231-252. Berlin: De Gruyter Mouton. doi:10.1515/gcla-2015-0013.
Schäfer, Roland & Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 486–493. Istanbul: ELRA. Available at http://webcorpora.org
## Not run: data(goVerb) # load str(goVerb) # inspect structure of object head(goVerb) # view head of object collex(goVerb, 616113708) # used in Flach (2015), all tokens minus punctuation collex(goVerb, 93993713) # could/should probably be used, all verb tokens collex(goVerb, 93993713, "chisq") # returning chisquare statistic ## End(Not run)
## Not run: data(goVerb) # load str(goVerb) # inspect structure of object head(goVerb) # view head of object collex(goVerb, 616113708) # used in Flach (2015), all tokens minus punctuation collex(goVerb, 93993713) # could/should probably be used, all verb tokens collex(goVerb, 93993713, "chisq") # returning chisquare statistic ## End(Not run)
Input data from corpora can be messy. This function checks if your input to collex()
, collex.dist
and collex.covar()
has one of three frequent, potential issues: (a) does it contain leading and trailing whitespace, (b) does it contain types that are essentially identical, but not case sensitive (e.g., fine, Fine, and/or (c) does it contain missing values?
Beware, though, that the input types to all functions vary a lot and errors can be plentiful, so this function is (still) somewhat experimental and may show issues where they may not be any (and, sadly, vice versa). Function-specific errors are usually also reflected in error messages, but do let me know if you find bugs or have suggestions for more improved checks and error messages.
input.check(x, raw = FALSE)
input.check(x, raw = FALSE)
x |
The data frame that you would pass as |
raw |
Does |
DISCLAIMER: this function is quite experimental, so there is of course no guarantee that your data frame is free of technical or conceptual error if the function output does not report any issues. (In some cases, for example, case-sensitivity might even be wanted.) But function should detect the most common problems that arise from messy corpus data or data processing that I have come across in my own work, teaching, and workshops.
Note also that the function does not check for problems specific to a particular collex-function: for example, the check whether corpus frequencies are lower than construction frequencies for simple collexeme analysis is run during the execution of collex()
. If you have suggestions of how to improve the checker function, let me know.
I'd recommend any errors are fixed in the source files, e.g., in Excel or csv files before import into R.
Returns an attempt at diagnostics for leading/trailing whitespace, duplicate types (e.g., due to capitalisation inconsistency) and/or empty/missing frequencies and/or types.
Report bugs and suggestions to:
Susanne Flach, [email protected]
## Not run: ## dirty: df <- data.frame(WORD = c("Hello", "hello", "hello ", " hello", "Hi ", " hi", "HI", "", "G'day"), CXN = c(23, NA, 2, 1, 30, 59, NA, 10, 3), CRP = c(569, 3049, 930, 394, 2930, 87, 9, 23, 40)) df input.check(df) ## a little dirty: df <- data.frame(WORD = c("Hello", "Hi", "hi", "HI", "", "G'day"), CXN = c(23, 12, 2, 1, 30, 59), CRP = c(569, 3049, 930, 394, 2930, 28)) df input.check(df) ## clean: df <- data.frame(WORD = c("Hello", "Hi", "Goodmorning", "Goodafternoon"), CXN = c(234, 139, 86, 74), CRP = c(23402, 2892, 893, 20923)) input.check(df) ## End(Not run)
## Not run: ## dirty: df <- data.frame(WORD = c("Hello", "hello", "hello ", " hello", "Hi ", " hi", "HI", "", "G'day"), CXN = c(23, NA, 2, 1, 30, 59, NA, 10, 3), CRP = c(569, 3049, 930, 394, 2930, 87, 9, 23, 40)) df input.check(df) ## a little dirty: df <- data.frame(WORD = c("Hello", "Hi", "hi", "HI", "", "G'day"), CXN = c(23, 12, 2, 1, 30, 59), CRP = c(569, 3049, 930, 394, 2930, 28)) df input.check(df) ## clean: df <- data.frame(WORD = c("Hello", "Hi", "Goodmorning", "Goodafternoon"), CXN = c(234, 139, 86, 74), CRP = c(23402, 2892, 893, 20923)) input.check(df) ## End(Not run)
Function to merge two data frames of frequency lists into a combined data frame of frequencies.
join.freqs(x, y, all = TRUE, threshold = 1)
join.freqs(x, y, all = TRUE, threshold = 1)
x |
A data frame with frequencies for condition A, with |
y |
A data frame with (i) corpus frequencies of an item in a construction ( |
all |
logical. If |
threshold |
Numerical. How many times must an item occur overall to be included in the joint list? Default is 1, which means all items are included. If |
Output suitable for collex()
and collex.dist()
. The column names of the output data frame (cols 2 and 3) will be identical to the names of the objects of x
and y
. Header for column one will be WORD
.
The difference to join.lists()
is that join.freqs()
two frequency lists are joined, whereas join.lists()
joins two lists (i.e., 'list', the R object type) of frequency lists, which may contain, e.g., freuency lists from different periods.
The behaviour of join.freqs()
deviates from that of merge()
: If you merge construction frequencies with a list of corpus frequencies with all = FALSE
, and a construction item does not occur in the corpus frequencies list, the item will occur in the output with a corpus frequency of 0. While this will throw an error if you use such a list in collex()
, it will allow you to identify faulty data rather than silently dropping it.
Also, you can merge a frequency list of two columns with a third frequency list, but then you will have to manually adjust the column headers afterwards.
## Not run: #### Example for Simple Collexeme Analysis # Using the verb lemma frequencies from the BNC # begin to rain, begin to think, begin to blossom... data(beginToV) ## I. Prepare # merge by lemmas, only types that occur in the construction: begin1.in <- join.freqs(beginToV, BNCverbL, all = FALSE) # merge by lemmas, all types, including unattested for 'negative evidence' begin2.in <- join.freqs(beginToV, BNCverbL) ## II. Perform SCA # second argument is taken directly from the source data begin1.out <- collex(begin1.in, sum(BNCverbL$CORP.FREQ)) begin2.out <- collex(begin2.in, sum(BNCverbL$CORP.FREQ)) ### Example for Distinctive Collexeme Analysis # Comparing begin to rain, start to go,... data(beginToV) # load set 1 data(startToV) # load set 2 ## I. Prepare beginStart <- join.freqs(beginToV, startToV) # merge lists (all types) ## II Perform beginStart.out <- collex.dist(beginStart) ## End(Not run)
## Not run: #### Example for Simple Collexeme Analysis # Using the verb lemma frequencies from the BNC # begin to rain, begin to think, begin to blossom... data(beginToV) ## I. Prepare # merge by lemmas, only types that occur in the construction: begin1.in <- join.freqs(beginToV, BNCverbL, all = FALSE) # merge by lemmas, all types, including unattested for 'negative evidence' begin2.in <- join.freqs(beginToV, BNCverbL) ## II. Perform SCA # second argument is taken directly from the source data begin1.out <- collex(begin1.in, sum(BNCverbL$CORP.FREQ)) begin2.out <- collex(begin2.in, sum(BNCverbL$CORP.FREQ)) ### Example for Distinctive Collexeme Analysis # Comparing begin to rain, start to go,... data(beginToV) # load set 1 data(startToV) # load set 2 ## I. Prepare beginStart <- join.freqs(beginToV, startToV) # merge lists (all types) ## II Perform beginStart.out <- collex.dist(beginStart) ## End(Not run)
Merges two lists of data frames pair-wise. Both lists need to be of equal length (i.e. identical number of data.frames), and where all data.frames have two columns (for WORD
and FREQ
); merging is by items in x[,1]. All pair-wise data frames must have an identical ID column name to match by (using WORD
as a column name is recommended). Returns a list of the same length with data frames of length 3, named WORD
(or name of ID column), name.of.df1
and name.of.df2
; the latter two contain the frequencies of WORD
. Suitable to handle the output of split()
, e.g. if a data frame was split by time period.
join.lists(x, y, all = TRUE, threshold = 1)
join.lists(x, y, all = TRUE, threshold = 1)
x |
List 1 containing data frames of frequencies for construction A. |
y |
List 2 containing data frames of (i) corpus frequiences for |
all |
If |
threshold |
Numerical. How many times must an item occur overall to be included in the joint list? Default is 1, which means all items are included (per list split, e.g., time period). If |
## Not run: ### We performed a series of Distictive Collexeme Analyses ### for (present) progressive vs. (simple) present over ten ### 25-year periods in CLMET (Johannsen & Flach 2015). ### Note that although working with historical data, this is something ### very different to Diachronic Distinctive Collexeme Analysis ### (e.g., Hilpert 2011), where periods are conditions in *one* DCA ### and thus mathematically not independent of each other. ### The sample data below runs one DCA per period, which are ### mathematically independent of each other. The conditions are still ### two alternating constructions as in 'ordinary' DCA. ### Also note that this means 'multiple' DCAs in the sense of 'several' DCAs, ### not in the sense of 'Multiple Distinctive Collexeme Analysis' (MDCA). # Load data data(CLMETprog.qc) data(CLMETsimple.qc) head(CLMETprog.qc, 10) head(CLMETsimple.qc, 10) ### I. Prepare ## Make frequency lists by decade of class list, i.e., ## split constructions by period, ## keep ITEM & FREQ only, droplevels prog <- split(CLMETprog.qc[, c(1,3)], CLMETprog.qc$QUARTCENT) prog <- lapply(prog, droplevels) simp <- split(CLMETsimple.qc[, c(1,3)], CLMETsimple.qc$QUARTCENT) simp <- lapply(simp, droplevels(x) dist.in <- join.lists(prog, simp) dist.in <- lapply(dist.in, droplevels) # Cosmetics: dist.in <- lapply(dist.in, setNames, c("WORD", "progressive", "simple")) #### CALCULATE COLLEXEMES dist.out.log <- lapply(prog.collexDist.in, function(x) collex.dist(x)) dist.out.fye <- Map(collex.dist, dist.in, am="fye") ### EXPORT ## Note: for this strategy, you need to install and load library(openxlsx) write.xlsx(dist.out.log, "progCollexDistLL.xlsx") write.xlsx(dist.out.log, "progCollexDistFYE.xlsx") ## End(Not run)
## Not run: ### We performed a series of Distictive Collexeme Analyses ### for (present) progressive vs. (simple) present over ten ### 25-year periods in CLMET (Johannsen & Flach 2015). ### Note that although working with historical data, this is something ### very different to Diachronic Distinctive Collexeme Analysis ### (e.g., Hilpert 2011), where periods are conditions in *one* DCA ### and thus mathematically not independent of each other. ### The sample data below runs one DCA per period, which are ### mathematically independent of each other. The conditions are still ### two alternating constructions as in 'ordinary' DCA. ### Also note that this means 'multiple' DCAs in the sense of 'several' DCAs, ### not in the sense of 'Multiple Distinctive Collexeme Analysis' (MDCA). # Load data data(CLMETprog.qc) data(CLMETsimple.qc) head(CLMETprog.qc, 10) head(CLMETsimple.qc, 10) ### I. Prepare ## Make frequency lists by decade of class list, i.e., ## split constructions by period, ## keep ITEM & FREQ only, droplevels prog <- split(CLMETprog.qc[, c(1,3)], CLMETprog.qc$QUARTCENT) prog <- lapply(prog, droplevels) simp <- split(CLMETsimple.qc[, c(1,3)], CLMETsimple.qc$QUARTCENT) simp <- lapply(simp, droplevels(x) dist.in <- join.lists(prog, simp) dist.in <- lapply(dist.in, droplevels) # Cosmetics: dist.in <- lapply(dist.in, setNames, c("WORD", "progressive", "simple")) #### CALCULATE COLLEXEMES dist.out.log <- lapply(prog.collexDist.in, function(x) collex.dist(x)) dist.out.fye <- Map(collex.dist, dist.in, am="fye") ### EXPORT ## Note: for this strategy, you need to install and load library(openxlsx) write.xlsx(dist.out.log, "progCollexDistLL.xlsx") write.xlsx(dist.out.log, "progCollexDistFYE.xlsx") ## End(Not run)
Data set of 792 modal-adverb pairs in the BNC-BABY, such as would possibly, may well or can hardly.
data("modadv")
data("modadv")
A data frame with 792 observations on the following 3 variables.
MODAL
A factor with 11 levels of the core modals and contractions, i.e., \'d
, \'ll
, can
, could
, may
, might
, must
, shall
, should
will
, would
ADVERB
A factor with 280 levels, for each adverb types following modal verbs, e.g., certainly
, essentially
, even
, lawfully
, publicly
, quickly
, and well
FREQ
The frequency of the combination.
BNC-BABY; [hw="will|would|can|could|may|might|must|shall|should" & pos="VM0"] [pos="AV0"]; cf. Flach (2020) with COCA data.
Flach, Susanne. 2020. Beyond modal idioms and modal harmony: A corpus-based analysis of gradient idiomaticity in MOD+ADV collocations. English Language and Linguistics. aop.
## Not run: data(modadv) ## Inspect: # This is an aggregated frequency list: head(modadv, 12) ### Perform co-varying collexeme analysis ## ?collex.covar() # since it's aggregated, you must set raw = FALSE, or it will make no sense. cca.att <- collex.covar(modadv, am="fye", raw = FALSE, all = FALSE) # only attested combinations cca.all <- collex.covar(modadv, am="fye", raw = FALSE, all = TRUE) # all combinations ## Reshape the cca output by association measure: # ?reshape.cca cca.wide.att <- reshape.cca(cca.att) cca.wide.all <- reshape.cca(cca.all) ## End(Not run)
## Not run: data(modadv) ## Inspect: # This is an aggregated frequency list: head(modadv, 12) ### Perform co-varying collexeme analysis ## ?collex.covar() # since it's aggregated, you must set raw = FALSE, or it will make no sense. cca.att <- collex.covar(modadv, am="fye", raw = FALSE, all = FALSE) # only attested combinations cca.all <- collex.covar(modadv, am="fye", raw = FALSE, all = TRUE) # all combinations ## Reshape the cca output by association measure: # ?reshape.cca cca.wide.att <- reshape.cca(cca.att) cca.wide.all <- reshape.cca(cca.all) ## End(Not run)
The output of collex.covar
is a so-called 'long format', where each row represents a slot1~slot2 pair (or, more generally, a cond1~cond2 pair). This function cross-tabulates the association measures in a 'wide format', where one condition occurs as rows and the other represents the columns and the cells contain the row~col association measure.
reshape.cca(x, cond = "shorter", str.dir = TRUE, value = "COLL.STR", abs.dev = FALSE, max.assoc = FALSE, sorton = "abs.dev", decimals = 5)
reshape.cca(x, cond = "shorter", str.dir = TRUE, value = "COLL.STR", abs.dev = FALSE, max.assoc = FALSE, sorton = "abs.dev", decimals = 5)
x |
A data frame containing the output of |
cond |
Which of the two conditions in |
str.dir |
Should the values in the cells indicate the direction of association? |
value |
Which value should be cross-tabulated, i.e., put in the cells? For cond1~cond2 pairs that do not occur in |
abs.dev |
The function sums all absolute association measures row-wise Should this be included in the output as an extra column? The default is |
max.assoc |
Should the col-condition of maximum assocation per row-condition be included in the output? If |
sorton |
By default the output is sorted in descending order of |
decimals |
Rounding of cell values. If this is set to a higher value than what was used for this argument in |
The function makes most sense for a collex.covar
that was run for all possible combinations. If association scores were only calculated for attested combinations, the output of reshape.cca
contains NA
in the cells of unattested combinations and it is up to the user to decide what to do with it. Both abs.dev
and max.assoc
can still be calculated and displayed, but they are based on observed combinations only. Since association measures for unobserved combinations can be read as 'negative evidence', the abs.dev
will be and the max.assoc
type may be different, depending on the strength of (potential) 'negative association'. See examples below for the case of unattested *.
Returns cross-tabulated association scores or observed values.
Susanne Flach, [email protected]
## Not run: data(modadv) ## Inspect: # This is an aggregated frequency list: head(modadv, 12) ### Perform co-varying collexeme analysis ## ?collex.covar() # since it's aggregated, you must set raw = FALSE, or it will make no sense. cca.att <- collex.covar(modadv, am="fye", raw = FALSE, all = FALSE) # only attested combinations cca.all <- collex.covar(modadv, am="fye", raw = FALSE, all = TRUE) # all combinations ## Reshape the cca output by association measure: # ?reshape.cca cca.wide.att <- reshape.cca(cca.att) View(cca.wide.att) cca.wide.all <- reshape.cca(cca.all) View(cca.wide.all) #### Co-occurrence of observations modadv.obs <- reshape.cca(cca.att, value = "OBS", str.dir = FALSE) # you must set false in this case # since we ran this on only the attested values, you can replace NA with 0: modadv.obs[is.na(modadv.obs)] <- 0 View(modadv.obs) ## End(Not run)
## Not run: data(modadv) ## Inspect: # This is an aggregated frequency list: head(modadv, 12) ### Perform co-varying collexeme analysis ## ?collex.covar() # since it's aggregated, you must set raw = FALSE, or it will make no sense. cca.att <- collex.covar(modadv, am="fye", raw = FALSE, all = FALSE) # only attested combinations cca.all <- collex.covar(modadv, am="fye", raw = FALSE, all = TRUE) # all combinations ## Reshape the cca output by association measure: # ?reshape.cca cca.wide.att <- reshape.cca(cca.att) View(cca.wide.att) cca.wide.all <- reshape.cca(cca.all) View(cca.wide.all) #### Co-occurrence of observations modadv.obs <- reshape.cca(cca.att, value = "OBS", str.dir = FALSE) # you must set false in this case # since we ran this on only the attested values, you can replace NA with 0: modadv.obs[is.na(modadv.obs)] <- 0 View(modadv.obs) ## End(Not run)
Data set of the start-to-VERB construction in the British National Corpus (BNC), with the frequencies of the verbs in the open slot ([hw="start" & class="VERB"] [hw="to"] [pos="V.I"]).
data("startToV")
data("startToV")
A data frame with 1168 observations on the following 2 variables.
WORD
a factor with levels of types start-to-V
CXN.FREQ
a numeric vector of the frequencies in V2.
## Not run: data(startToV) # load str(startToV) # inspect structure of object head(startToV) # view head of object ## End(Not run)
## Not run: data(startToV) # load str(startToV) # inspect structure of object head(startToV) # view head of object ## End(Not run)