forked from quanteda/quanteda
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfcm.Rd
124 lines (108 loc) · 5.21 KB
/
fcm.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fcm.R
\name{fcm}
\alias{fcm}
\alias{is.fcm}
\title{Create a feature co-occurrence matrix}
\usage{
fcm(x, context = c("document", "window"), count = c("frequency",
"boolean", "weighted"), window = 5L, weights = NULL,
ordered = FALSE, tri = TRUE, ...)
}
\arguments{
\item{x}{character, \link{corpus}, \link{tokens}, or \link{dfm} object from
which to generate the feature co-occurrence matrix}
\item{context}{the context in which to consider term co-occurrence:
\code{"document"} for co-occurrence counts within document; \code{"window"}
for co-occurrence within a defined window of words, which requires a
positive integer value for \code{window}. Note: if \code{x} is a dfm
object, then \code{context} can only be \code{"document"}.}
\item{count}{how to count co-occurrences:
\describe{
\item{\code{"frequency"}}{count the number of co-occurrences within the
context}
\item{\code{"boolean"}}{count only the co-occurrence or not within the
context, irrespective of how many times it occurs.}
\item{\code{"weighted"}}{count a weighted function of counts, typically as
a function of distance from the target feature. Only makes sense for
\code{context = "window"}.}
}}
\item{window}{positive integer value for the size of a window on either side
of the target feature, default is 5, meaning 5 words before and after the
target feature}
\item{weights}{a vector of weights applied to each distance from
\code{1:window}, strictly decreasing by default; can be a custom-defined
vector of the same length as \code{window}}
\item{ordered}{if \code{TRUE} the number of times that a term appears before
or after the target feature are counted separately. Only makes sense for
context = "window".}
\item{tri}{if \code{TRUE} return only upper triangle (including diagonal).
Ignored if \code{ordered = TRUE}}
\item{...}{not used here}
}
\description{
Create a sparse feature co-occurrence matrix, measuring co-occurrences of
features within a user-defined context. The context can be defined as a
document or a window within a collection of documents, with an optional
vector of weights applied to the co-occurrence counts.
}
\details{
The function \code{\link{fcm}} provides a very general
implementation of a "context-feature" matrix, consisting of a count of
feature co-occurrence within a defined context. This context, following
Momtazi et. al. (2010), can be defined as the \emph{document},
\emph{sentences} within documents, \emph{syntactic relationships} between
features (nouns within a sentence, for instance), or according to a
\emph{window}. When the context is a window, a weighting function is
typically applied that is a function of distance from the target word (see
Jurafsky and Martin 2015, Ch. 16) and ordered co-occurrence of the two
features is considered (see Church & Hanks 1990).
\link{fcm} provides all of this functionality, returning a \eqn{V * V}
matrix (where \eqn{V} is the vocabulary size, returned by
\code{\link{nfeat}}). The \code{tri = TRUE} option will only return the
upper part of the matrix.
Unlike some implementations of co-occurrences, \link{fcm} counts feature
co-occurrences with themselves, meaning that the diagonal will not be zero.
\link{fcm} also provides "boolean" counting within the context of "window",
which differs from the counting within "document".
\code{is.fcm(x)} returns \code{TRUE} if and only if its x is an object of
type \link{fcm}.
}
\examples{
# see http://bit.ly/29b2zOA
txt1 <- "A D A C E A D F E B A C E D"
fcm(txt1, context = "window", window = 2)
fcm(txt1, context = "window", count = "weighted", window = 3)
fcm(txt1, context = "window", count = "weighted", window = 3,
weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)
# with multiple documents
txt2 <- c("a a a b b c", "a a c e", "a c e f g")
fcm(txt2, context = "document", count = "frequency")
fcm(txt2, context = "document", count = "boolean")
fcm(txt2, context = "window", window = 2)
# from tokens
txt3 <- c("The quick brown fox jumped over the lazy dog.",
"The dog jumped and ate the fox.")
toks <- tokens(char_tolower(txt3), remove_punct = TRUE)
fcm(toks, context = "document")
fcm(toks, context = "window", window = 3)
}
\references{
Momtazi, S., Khudanpur, S., & Klakow, D. (2010).
"\href{https://www.lsv.uni-saarland.de/fileadmin/publications/SaeedehMomtazi-HLT_NAACL10.pdf}{A
comparative study of word co-occurrence for term clustering in language
model-based sentence retrieval.}" \emph{Human Language Technologies: The
2010 Annual Conference of the North American Chapter of the ACL}, Los
Angeles, California, June 2010, 325-328.
Jurafsky, D. & Martin, J.H. (2018).
From \emph{Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition}. Draft of September 23, 2018
(Chapter 6, Vector Semantics). Available at \url{https://web.stanford.edu/~jurafsky/slp3/}.
Church, K. W. & P. Hanks (1990) .
\href{http://dl.acm.org/citation.cfm?id=89095}{Word association norms,
mutual information, and lexicography}. \emph{Computational Linguistics},
16(1), 22-29.
}
\author{
Kenneth Benoit (R), Haiyan Wang (R, C++), Kohei Watanabe (C++)
}