Wednesday, August 29, 2012

Plotting model fits

We all know that it is important to plot your data and explore the data visually to make sure you understand it. The same is true for your model fits. First, you want to make sure that the model is fitting the data relatively well, without any substantial systematic deviations. This is often evaluated by plotting residual errors, but I like to start with plotting the actual model fit.

Second, and this is particularly important when using orthogonal polynomials, you want to make sure that the statistically significant effects in the model truly correspond to the “interesting” (i.e., meaningful) effects in your data. For example, if your model had a significant effects on higher-order terms like the cubic and quartic, you might want to conclude that this corresponds to a difference between early and late competition. Plotting the model fits with and without that term can help confirm that interpretation.

The first step to plotting model fits is getting those model-predicted values. If you use lmer, these values are stored in the eta slot of the model object. It can be extracted using m@eta, where m is the model object. Let's look at an example based on eye-tracking data from Kalenine, Mirman, Middleton, & Buxbaum (2012).

##       Time           fixS           cond     obj          subj    
##  Min.   : 500   Min.   :0.0000   Late :765   T:510   21     :102  
##  1st Qu.: 700   1st Qu.:0.0625   Early:765   C:510   24     :102  
##  Median : 900   Median :0.1333               U:510   25     :102  
##  Mean   : 900   Mean   :0.2278                       27     :102  
##  3rd Qu.:1100   3rd Qu.:0.3113                       28     :102  
##  Max.   :1300   Max.   :1.0000                       40     :102  
##                                                      (Other):918  
##     timeBin        ot1              ot2               ot3        
##  Min.   : 1   Min.   :-0.396   Min.   :-0.2726   Min.   :-0.450  
##  1st Qu.: 5   1st Qu.:-0.198   1st Qu.:-0.2272   1st Qu.:-0.209  
##  Median : 9   Median : 0.000   Median :-0.0909   Median : 0.000  
##  Mean   : 9   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.000  
##  3rd Qu.:13   3rd Qu.: 0.198   3rd Qu.: 0.1363   3rd Qu.: 0.209  
##  Max.   :17   Max.   : 0.396   Max.   : 0.4543   Max.   : 0.450  
##       ot4         
##  Min.   :-0.3009  
##  1st Qu.:-0.1852  
##  Median :-0.0231  
##  Mean   : 0.0000  
##  3rd Qu.: 0.2392  
##  Max.   : 0.4012  
ggplot(data.ex, aes(Time, fixS, color = obj)) + facet_wrap(~cond) + 
    stat_summary(fun.y = mean, geom = "line", size = 2)
plot of chunk plot-data
I've renamed the conditions "Late" and "Early" based on the timing of their competition effect: looking at fixation proportions for the related Competitor (green lines) relative to the Unrelated distractor, it looks like the “Late” condition had a later competition effect than the “Early” condition. We start by fitting the full model and plotting the model fit. For convenience, we'll make a new data frame that has the modeled observed data and the model fit:
m.ex <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
data.ex.fits <- data.frame(subset(data.ex, obj != "T"), GCA_Full = m.ex@eta)
ggplot(data.ex.fits, aes(Time, fixS, color = obj)) + facet_wrap(~cond) + stat_summary( = mean_se, geom = "pointrange", size = 1) + stat_summary(aes(y = GCA_Full), fun.y = mean, geom = "line", size = 2) + labs(x = "Time Since Word Onset (ms)", y = "Fixation Proportion")
plot of chunk fit-full-model
The fit looks pretty good and the model seems to capture the early-vs.-late competition difference, so now we can use the normal approximation to get p-values for the object-by-condition interaction:
coefs.ex <-
coefs.ex$p <- format.pval(2 * (1 - pnorm(abs(coefs.ex[, "t value"]))))
coefs.ex[grep("*objU:cond*", rownames(coefs.ex), value = T), ]
##                     Estimate Std. Error t value       p
## objU:condEarly     -0.004164    0.01701 -0.2448 0.80664
## ot1:objU:condEarly  0.065878    0.07586  0.8685 0.38514
## ot2:objU:condEarly -0.047568    0.04184 -1.1370 0.25554
## ot3:objU:condEarly -0.156184    0.02327 -6.7119 1.9e-11
## ot4:objU:condEarly  0.075709    0.02327  3.2535 0.00114
There are significant object-by-condition interaction effects on the cubic and quartic terms, so that's where competition in the two conditions differed, but does that correspond to the early-vs.-late difference? To answer this question we can fit a model that does not have those cubic and quartic terms and visually compare it to the full model. We'll plot the data with pointrange, the full model with thick lines, and the smaller model with thinner lines.
m.exSub <- lmer(fixS ~ (ot1 + ot2 + ot3 + ot4) * obj + (ot1 + ot2 + ot3 + ot4) * cond + (ot1 + ot2) * obj * cond + (1 + ot1 + ot2 + ot3 + ot4 | subj) + (1 + ot1 + ot2 | subj:obj:cond), data = subset(data.ex, obj != "T"), REML = F)
data.ex.fits$GCA_Sub <- m.exSub@eta
ggplot(data.ex.fits, aes(Time, fixS, color = obj)) + facet_wrap(~cond) + stat_summary( = mean_se, geom = "pointrange", size = 1) + stat_summary(aes(y = GCA_Full), fun.y = mean, geom = "line", size = 2) + stat_summary(aes(y = GCA_Sub), fun.y = mean, geom = "line", size = 1) + labs(x = "Time Since Word Onset (ms)", y = "Fixation Proportion")
plot of chunk sub-model
Well, it sort of looks like the thinner lines have less early-late difference, but it is hard to see. It will be easier if we look directly at the competition effect size (that is, the difference between the competitor and unrelated fixation curves):
ES <- ddply(data.ex.fits, .(subj, Time, cond), summarize, Competition = fixS[obj == "C"] - fixS[obj == "U"], GCA_Full = GCA_Full[obj == "C"] - GCA_Full[obj == "U"], GCA_Sub = GCA_Sub[obj == "C"] - GCA_Sub[obj == "U"])
ES <- rename(ES, c(cond = "Condition"))
ggplot(ES, aes(Time, Competition, color = Condition)) + stat_summary(fun.y = mean, geom = "point", size = 4) + stat_summary(aes(y = GCA_Full), fun.y = mean, geom = "line", size = 2) + labs(x = "Time Since Word Onset (ms)", y = "Competition") + stat_summary(aes(y = GCA_Sub), fun.y = mean, geom = "line", size = 1)
plot of chunk effect-sizes
Now we can clearly see that the full model (thick lines) captures the early-vs.-late difference, but when we remove the cubic and quartic terms (thinner lines), that difference almost completely disappears. So that shows that those higher-order terms really were capturing the timing of the competition effect.

P.S.: For those that care about behind-the-scenes/under-the-hood things, this post was created (mostly) using knitr in RStudio.

Monday, August 27, 2012

Time course of thematic and functional semantics

I am pleased to report that our paper on the time course of activation of thematic and functional semantic knowledge will be published in the September issue of the Journal of Experimental Psychology: Learning, Memory, and Cognition. This project was led by Solene Kalénine when she was a post-doc at MRRI working with Laurel Buxbaum and me. Humbly, I think this paper is pretty cool for a few different reasons.

First, and most central, we found (using eye-tracking) that the knowledge that two things are used together ("thematic relations", like toaster and bread) is activated relatively quickly and the knowledge that two things serve the same general function (like toaster and coffee maker both being used to prepare breakfast) is activated more slowly. Here are some smoothed activation curves based on our data:

I think this is interesting from the perspective of studying the dynamics of semantic cognition and it is nice from a methodological perspective because, to my knowledge, this is the first time eye-tracking has been used to reveal on-line time course differences in activation of semantic relations the way Allopenna et al. (1998) showed for phonological relations.

Our second cool finding was that context (sentence context, in this case) can emphasize different aspects of functional knowledge. This is a new addition to the existing (and growing) body of evidence that semantic representations are dynamic and context-sensitive, not static and self-contained.

The third thing I like this about this paper is that we were able to have a meaningful interpretation of effects on the higher-order polynomial terms in growth curve analysis. That early-vs-late effect difference between thematic and function relation was on the cubic and quartic terms. Usually these higher-order effects are difficult to interpret because it is not intuitive and hard to mentally picture what a "steeper" cubic or quartic curve would look like. We simplified that task by plotting the GCA curves with and without those higher order terms, so their contribution became easy to see (plotting model fits will be the subject of an upcoming blog post).

Full reference for our paper: 
Kalénine, S., Mirman, D., Middleton, E. L., and Buxbaum, L. J. (2012). Temporal dynamics of activation of thematic and functional knowledge during conceptual processing of manipulable artifacts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38(5), 1274-1295. DOI: 10.1037/a0027626. Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory & Language, 38(4), 419-439 DOI: 10.1006/jmla.1997.2558

Monday, August 20, 2012

The translational pipeline

Over the weekend I read yet another excellent article by Atul Gawande in the most recent issue of the New Yorker. There are many interesting things in this article and I highly recommend it, but there was one minor comment that really resonated with my own experience. Dr. Gawande mentioned that it's hard to get health care providers (doctors, nurses, clinicians of all types) to accept changes. This resistance to change is one of the obstacles in the translational research pipeline identified by my colleague John Whyte (e.g., Whyte & Barrett, 2012). The other major obstacle is using a theoretical understanding of some process, mechanism, or impairment to develop a potential treatment. 

I deeply value basic science (after all, it is most of what I do) and I recognize the importance of specialization -- the skills required for good basic science are not the same as the skills required for developing and testing treatments. Nevertheless, sometimes I worry that we basic scientists don't even speak the same language as the researchers trying to develop and test interventions. The clinical fields that border cognitive science (education, rehabilitation medicine, etc.) certainly stand to benefit from rigorous development of cognitive and neuroscience theory and this is the standard motivation given by basic scientists when applying for funding to the National Institutes of Health. 

Over the last few years I've come to realize that the benefits also run in the other direction: interventions can provide unique tests of theories. Making new, testable predictions is one of the hallmarks of a good a theory, but if the new predictions are limited to much-used, highly constrained laboratory paradigms then it can feel like we're just spinning our wheels. Making predictions for interventions, or even just for individual differences, is one way to test a theory and to simultaneously expand its scope. As NIH puts more emphasis on its health mission, I hope cognitive and neural scientists will see this as an opportunity to expand the scope of our theories rather than as an inconvenient constraint. Whyte J, & Barrett AM (2012). Advancing the evidence base of rehabilitation treatments: A developmental approach. Archives of Physical Medicine and Rehabilitation, 93 (8 Suppl 2) PMID: 22683206

Thursday, August 16, 2012

Brain > Mind?

My degrees are in psychology, but I consider myself a (cognitive) neuroscientist. That's because I am interested in how the mind works and I think studying the brain can give us important and useful insights into mental functioning. But it is important not to take this too far. In particular, I think it is unproductive to take the extreme reductionist position that "the mind is merely the brain". I've spelled out my position (which I think is shared by many cognitive neuroscientists) in a recent discussion on the Cognitive Science Q&A site The short version is that I think it is trivially true that the mind is just the brain, but the brain is just molecules, which are just atoms, which are just particles, etc., etc. and if you're interested in understanding human behavior, particle physics is of little use. In other words, when I talk about the mind, I'm talking about a set of physical/biological processes that are best described at the level of organism behavior.

The issue of separability of the mind and brain is also important when considering personal responsibility, as John Monterosso and Barry Schwartz pointed out in a recent piece in the New York Times and in their study (Monterosso, Royzman, & Schwartz, 2005). (Full disclosure: Barry's wife, Myrna Schwartz, is a close colleague at MRRI). Their key finding was that perpetrators of crimes were judged to be less culpable given a physiological explanation (such as neurotransmitter imbalance) than an experiential imbalance (such as having been abused as a child), even though the link between the explanation and the behavior was matched. That is, when participants were told that (for example) 20% of people with this neurotransmitter imbalance commit such crimes or 20% of people who had been abused as children commit such crimes, the ones with the neurotransmitter imbalance were judged to be less culpable. 

Human behavior is complex and explanations can be framed at different levels of analysis. Neuroscience can provide important insights and constraints for these explanations, but precisely because psychological processes are based in neural processes, neural processes cannot be any more "automatic" than psychological processes, nor can neural evidence be any more "real" than behavioral evidence. Monterosso, J., Royzman, E.B., & Schwartz, B. (2005). Explaining Away Responsibility: Effects of Scientific Explanation on Perceived Culpability Ethics & Behavior, 15 (2), 139-158 DOI: 10.1207/s15327019eb1502_4

Friday, August 10, 2012

Treating participants (or items) as random vs. fixed effects

Connoisseurs of multilevel regression will already be familiar with this issue, but it is the single most common topic for questions I receive about growth curve analysis (GCA), so it seems worth discussing. The core of the issue is that in our paper about using GCA for eye tracking data (Mirman, Dixon, & Magnuson, 2008) we treated participants as fixed effects. In contrast, multilevel regression in general, and specifically the approach described by Dale Barr (2008), which is nearly identical to ours, treated participants as random effects. 

First, we should be clear about the conceptual distinction between "fixed" and "random" effects. This turns out not to be a simple matter because many, sometimes conflicting, definitions have been proposed (cf. Gelman, 2005). In the context of the kind of questions and data scenarios we typically face in cognitive neuroscience, I would say:
  • Fixed effects are the effects that we imagine to be constant in the population or group under study. As such, when we conduct a study, we would like to conclude that the observed fixed effects generalize to the whole population. So if I've run a word recognition study and found that uncommon (low frequency) words are processed slower than common (high frequency) words, I would like to conclude that this difference is true of all typical adults (or at least WEIRD adults: Henrich, Heine, & Norenzayan, 2010).
  • Random effects are the differences among the individual observational units in the sample, which we imagine are randomly sampled from the population. As such, these effects should conform to a specified distribution (typically a normal distribution) and have a mean of 0. So in my word recognition experiment, some participants showed large a word frequency effect and some showed a small effect, but I am going to assume that these differences reflect random, normally-distributed variability in the population.
Statistically, the difference is that fixed effect parameters are estimated independently and not constrained by a distribution. So, in the example, estimated recognition time for low and high frequency conditions can have whatever values best describe the data. Random effects are constrained to have a mean of 0 and follow a normal distribution, so estimated recognition time for a particular participant (or item, in a by-items analysis) reflects the recognition time for that individual as well as the pattern of recognition times across all other individuals in the sample. The consequence is that random effect estimates tend to be pulled toward their mean, which is called "shrinkage". So the trade-off is between independent estimation (fixed effects) and generalization (random effects).

Returning to the original question: should participants (or items) be treated as random or fixed effects? In experimental cognitive science/neuroscience, we usually think of participants (or items) as sampled observations from some population to which we would like to generalize -- our particular participants are assumed to be randomly sampled and representative of the population of possible participants, our particular test items are assumed to be representative of possible items. These assumptions mean that participants (and items) should be treated as random effects. 

However, we don't always make such assumptions. Especially in cognitive neuropsychology, we are sometimes interested in particular participants that we think can demonstrate something important about how the mind/brain works. These participants are "interesting in themselves" and comprise a "sample that exhausts the population" (two proposed definitions of fixed effects, see Gelman, 2005). In such cases, we may want to treat participants as fixed effects. The decision depends on the ultimate inferential goals of the researcher, not on the research domain or sample population. For example, studies of certain idiosyncratic words (onomatopoeia, some morphological inflections) may reasonably treat items as fixed effects if the statistical inferences are restricted to these particular items (although, as with case studies, the theoretical implications can be broader). On the other side, in our recent study examining the effect of left temporo-parietal lesions on thematic semantic processing (Mirman & Graziano, 2012), we treated participants as random effects because our goal was to make general inferences about the effect of TPC lesions (assuming, as always, that our small group constituted a random representative sample of individuals with left TPC lesions).

Additional discussion of this issue can be found in a tech report (LCDL Technical Report 2012.03) that we have just added of our growth curve analysis page. Also, it is only fair to note that this is only one aspect of determining the "right" random effect structure and there remain important unanswered questions in this domain (here are some recent developments and discussions). Barr D.J. (2008). Analyzing ‘visual world’ eyetracking data using multilevel logistic regression. Journal of Memory and Language, 59(4), 457-474. DOI: 10.1016/j.jml.2007.09.002
Gelman A. (2005). Analysis of variance -- why it is more important than ever. Annals of Statistics, 33(1), 1-33 arXiv: math/0504499v2
Henrich J., Heine S.J., & Norenzayan A. (2010). The weirdest people in the world? The Behavioral and Brain Sciences, 33(2-3), 61-83 PMID: 20550733
Mirman D., Dixon J.A., & Magnuson J.S. (2008). Statistical and computational models of the visual world paradigm: Growth curves and individual differences. Journal of Memory and Language, 59(4), 475-494 PMID: 19060958
Mirman D., & Graziano K.M. (2012). Damage to temporo-parietal cortex decreases incidental activation of thematic relations during spoken word comprehension. Neuropsychologia, 50(8), 1990-1997 PMID: 22571932

Tuesday, August 7, 2012

Customizing ggplot graphs

There are many things I love about the R package ggplot2. For the most part, they fall into two categories:

  1. The "grammar of graphics" approach builds a hierarchical relationship between the data and the graphic, which creates a consistent, intuitive (once you learn it), and easy-to-manipulate system for statistical visualization. Briefly, the user defines a set of mappings ("aesthetics", in the parlance of ggplot) between variables in the data and graph properties (e.g., x = variable1, y = variable2, color = variable3, ...) and the visual realizations of those mappings (points, lines, bars, etc.), then ggplot does the rest. This is great, especially for exploratory graphing, because I can visualize the data in lots of different ways with just minor edits to the aesthetics.
  2. Summary statistics can be computed "on the fly". So I don't need to pre-compute sample means and standard errors, I can just tell ggplot that this is what I want to see and it will compute them for me. And if something doesn't look right, I can easily visualize individual participant data, or I can look at sample means excluding some outliers, etc. All without creating separate copies of the data tailored to each graph.
This great functionality comes at a price: customizing graphs can be hard. In addition to the ggplot documentation, the R Cookbook is a great resource (their section on legends saved me today) and StackOverflow is a fantastic Q&A site. Today I also stumbled onto a very detailed page showing how to generate the kinds graphs that are typical for psychology and neuroscience papers. These are quite far from the ggplot defaults and my hat is off to the author for figuring all this out and sharing it with the web.

Monday, August 6, 2012

Crawford-Howell (1998) t-test for case-control comparisons

Cognitive neuropsychologists (like me) often need to compare a single case to a small control group, but the standard two-sample t-test does not work for this because the case is only one observation. Several different approaches have been proposed and in a new paper just published in Cortex, Crawford and Garthwaite (2012) demonstrate that the Crawford-Howell (1998) t-test is a better approach (in terms of controlling Type I error rate) than other commonly-used alternatives. As I understand it, the core issue is that with a typical t-test, you're testing whether two means are different (or, for a one-sample t-test, whether one mean is different from some value), so the more observations you have, the better your estimate of the mean(s). In a case-control comparison you want to know how likely it is that the case value came from the distribution of the control data, so even if your control group is very large, the variability is still important -- knowing that your case is below the control mean is not enough, you want to know that it is below 95% (for example) of the controls. That is why, as Crawford and Garthwaite show, Type I error increases with control sample size for the other tests, but not for the Crawford-Howell test.

It is nice to have this method validated by Monte Carlo simulation and I intend to use it next time the need arises. I’ve put together a simple R implementation of it (it takes a single value as case and a vector of values for control and returns a data frame containing the t-value, degrees of freedom, and p-value):
CrawfordHowell <- function(case, control){
  tval <- (case - mean(control)) / (sd(control)*sqrt((length(control)+1) / length(control)))
  degfree <- length(control)-1
  pval <- 2*(1-pt(abs(tval), df=degfree)) #two-tailed p-value
  result <- data.frame(t = tval, df = degfree, p=pval)
Created by Pretty R at Crawford, J.R., & Howell, D.C. (1998). Comparing an Individual’s Test Score Against Norms Derived from Small Samples. The Clinical Neuropsychologist, 12 (4), 482-486 DOI: 10.1076/clin.12.4.482.7241
Crawford, J. R., & Garthwaite, P. H. (2012). Single-case research in neuropsychology: A comparison of five forms of t-test for comparing a case to controls. Cortex, 48 (8), 1009-1016 DOI: 10.1016/j.cortex.2011.06.021

Friday, August 3, 2012

A lexicon without semantics?

I spend a lot of time thinking about words. The reason I am so focused on words is that they sit right at that fascinating boundary between “perception” and “cognition”. Recognizing a spoken word is essentially a (rather difficult) pattern recognition problem: there is a complex and variable perceptual signal that needs to be mapped to a particular word object. But what is that word object? Is it just an entry in some mental list of known words? Are perceptual properties preserved or is it completely abstracted from the surface form? Does the word object include the meaning of the word, like a dictionary entry? The entire contents of all possible meanings or just some context-specific subset?

At least going back to Morton’s (1961) “logogen” model, and including the work of Patterson & Shewell (1987) and Coltheart and colleagues (e.g., Coltheart et al., 2001), researchers have argued that the lexicon (or lexicons) must represent words in a way that is abstracted from the surface form and independent of meaning. In part, this argument was based on evidence that some individuals with substantial semantic impairments could nevertheless reasonably accurately distinguish real words from fake words (the “lexical decision” task).

An alternative approach, based on parallel distributed processing and emphasizing emergent representations (e.g., McClelland, 2010), argues that the “lexicon” is really just the intermediate representation between perceptual and semantic levels, so it will necessarily have some properties of both. Michael Ramscar conducted a very elegant set of experiments showing how semantic information infiltrates past-tense formation (Ramscar, 2002): given a novel verb like “sprink”, if participants were led to believe that it meant something like “drink”, they tended to say that the past tense should be “sprank”, but if they were led to believe that it meant something like “wink” or “blink”, then they tended to say that the past tense should be “sprinked”. In other words, past-tense formation is influenced both by meaning and surface similarity. Tim Rogers and colleagues (Rogers et al., 2004) showed that the apparent ability of semantically-impaired individuals to perform lexical decision was really based on visual familiarity: these individuals consistently chose the spelling that was more typical of English, regardless of whether it was correct or not for this particular word (for example, “grist” over “gryst”, but also “trist” over “tryst”; “cheese” over “cheize”, but also “seese” over “seize”).

Data like these have been enough to convince me that the PDP view is right, but there are a few counter-examples that I am not sure how to explain. Among them is a recent short case report of a patient with a severe semantic deficit (semantic dementia), but remarkably good ability to solve anagrams (Teichmann et al., 2012). She was able to solve 18 out of 20 anagrams (“H-E-T-A-N-L-E-P” --> “ELEPHANT”) without knowing what any of the 20 words meant. Neurologically intact age-matched controls solved essentially the same number of anagrams (17.4 ± 1.5) in the same amount of time. Related cases of “hyperlexia” (good word reading with impaired comprehension) have also been reported (e.g., Castles et al., 2010). I can imagine how a PDP account of these data might look, but to my knowledge, it has not been developed.

Castles, A., Crichton, A., & Prior, M. (2010). Developmental dissociations between lexical reading and comprehension: Evidence from two cases of hyperlexia. Cortex, 46(10), 1238-47. doi:10.1016/j.cortex.2010.06.016 Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108(1), 204-256. 
McClelland, J. L. (2010). Emergence in Cognitive Science. Topics in Cognitive Science, 2(4), 751-770. doi:10.1111/j.1756-8765.2010.01116.x. 
Morton, J. (1961). Reading, context and the perception of words. Unpublished PhD thesis, University of Reading, Reading, England.
Patterson, K., & Shewell, C. (1987). Speak and spell: Dissociations and word-class effects. In M. Coltheart, G. Sartori, & R. Job (Eds.), The cognitive neuropsychology of language (pp. 273-294). London: Erlbaum.
Ramscar, M. (2002). The role of meaning in inflection: why the past tense does not require a rule. Cognitive Psychology, 45(1), 45-94.
Rogers, T. T., Lambon Ralph, M. A., Hodges, J. R., & Patterson, K. E. (2004). Natural selection: The impact of semantic impairment on lexical and object decision. Cognitive Neuropsychology, 21(2-4), 331-352.
Teichmann et al. (2012). A mental lexicon without semantics. Neurology. DOI: 10.1212/WNL.0b013e3182635749

Thursday, August 2, 2012

Statistical models vs. cognitive models

My undergraduate and graduate training in psychology and cognitive neuroscience focused on computational modeling and behavioral experimentation: implementing concrete models to test cognitive theories by simulation and evaluating predictions from those models with behavioral experiments. During this time, good ol’ t-test was enough statistics for me. I continued this sort of work during my post-doctoral fellowship, but as I became more interested in studying the time course of cognitive processing, I had to learn about statistical modeling, specifically, growth curve analysis (multilevel regression) for time series data. These two kinds of modeling – computational/cognitive and statistical – are often conflated, but I believe they are very different and serve complementary purposes in cognitive science and cognitive neuroscience.

It will help to have some examples of what I mean when I say that statistical and cognitive models are sometimes conflated. I have found that computational modeling talks sometimes provoke a certain kind of skeptic to ask “With a sufficient number of free parameters it is possible to fit any data set, so how many parameters does your model have?” The first part of that question is true in a strictly mathematical sense: for example, a Taylor series polynomial can be used to approximate any function with arbitrary precision. But this is not how cognitive modeling works. Cognitive models are meant to implement theoretical principles, not arbitrary mathematical functions, and although they always have some flexible parameters, these parameters are not “free” in the way that the coefficients of a Taylor series are free.

On the other hand, when analyzing behavioral data, it can be tempting to use a statistical model with parameters that map in some simple way onto theoretical constructs. For example, assuming Weber’s Law  holds (a power law relationship between physical stimulus magnitude and perceived intensity), one can collect data in some domain of interest, fit a power law function, and compute the Weber constant for that domain. However, if you happen to be studying a domain where Weber’s law does not quite hold, your Weber constant will not be very informative.

In other words, statistical and computational models have different, complementary goals. The point of statistical models is to describe or quantify the observed data. This is immensely useful because extracting key effects or patterns allows us to talk about large data sets in terms of a small number of “effects” or differences between conditions. Such descriptions are best when they focus on the data themselves and are independent of any particular theory – this allows researchers to evaluate any and all theories against the data. Statistical models need to worry about number of free parameters and this is captured by standard goodness-of-fit statistics such as AIC, BIC, and log-likelihood.

In contrast, cognitive models are meant to test a specific theory, so fidelity to the theory is more important than counting the number of parameters. Ideally, the cognitive model’s output can be compared directly to the observed behavioral data, using more or less the same model comparison techniques (R-squared, log-likelihood, etc.). However, because cognitive models are usually simplified, that kind of quantitative fit is not always possible (or even advisable) and a qualitative comparison of model and behavioral data must suffice. This qualitative comparison critically depends on an accurate – and theory-neutral – description of the behavioral data, which is provided by the statistical model. (A nice summary of different methods of evaluating computational models against behavioral data is provided by Pitt et al.,2006).

Jim Magnuson, J. Dixon, and I advocated this kind of two-pronged approach – using statistical models to describe the data and computational models to evaluate theories – when we adapted growth curve analysis to eye-tracking data (Mirman etal., 2008). Then, working with Eiling Yee and Sheila Blumstein, we used this approach to study phonological competition in spoken word recognition in aphasia (Mirman etal., 2011). To my mind, this is the optimal way to simultaneously maximize accurate description of behavioral data and theoretical impact of the research.