Psychology should slow down and collect more diverse, interpretable data: A conversation with Nick Michalak

So far this month, we have announced that Prolific's representative samples feature is coming soon, we've given you a glimpse into some of the tech minds behind Prolific, we've argued that open science is more robust science, and we've announced that our participants will be able to cash out instantly going forward. So many good things in just one month!

We're wrapping up October with a thought-provoking interview I've had with Nick Michalak from the University of Michigan. From MTurkification to the Psychological Science Accelerator – Nick and I talked about a range of burning questions and issues in social science today. In the end, what emerged was a pretty clear consensus between the two of us: Psychology needs to slow down if it wants to collect more diverse, quality data. Let's hear why exactly that is.

Nick-Michalak

Katia: Nick, you’re a PhD student in Psychology at the University of Michigan. What’s your area of research and what stage of your PhD are you at?

Nick: I’m a PhD candidate in the social psychology area working with Josh Ackerman. I’m currently in my 4th year of a 5-year graduate program. I’ll be going on the job market next year, so I’m currently trying to wrap up some projects.

K: What research questions or methodological developments have you been most excited about lately?

N: I think many research questions could be better tested with more diverse and larger samples. On that note, I’m very excited about the Psychological Science Accelerator, which is a distributed laboratory network of 350+ psychological science laboratories, representing more than 45 countries, that coordinates data collection for democratically selected studies.

I’m excited about some big research questions related to how people make trait judgements about others based on just their face. For example, can perceivers agree on who looks more trustworthy or dominant based on only people’s faces? A prominent model of face perception suggests two dimensions underly all trait judgments based on faces: valence (best characterized by perceived trustworthiness) and dominance (defined as perceived intention to inflict harm). Nikolaas Oosterhof and Alexander Todorov originally tested this model on U.S. participants, and now a new group of researchers are using the Psychological Science Accelerator Network to test whether this two-dimensional structure replicates across the world.

In 2012 and 2013 it seemed that researchers mostly thought: “Let’s get bigger samples and replicate popular effects!” Now people seem to be discussing more of the nuances of statistical power, preregistration, and replication.

I’m also interested in developing norms and standards around open practices. It’s been about 6-8 years since a handful of events sparked a new wave of interest in questionable research practices and their influence on false-positive error rates. It’s interesting to me how people discuss what “open science” means. For example, in 2012 and 2013 it seemed that researchers mostly thought: “Let’s get bigger samples and replicate popular effects!” Now people seem to be discussing more of the nuances of power, preregistration, replication, etc. For example, early on people seemed to focus on how preregistration constrains confirmatory tests, but now people seem to talk more about how it also increases transparency and simply provides valuable information about what was planned (and what was not planned). Even bad preregistrations can include useful information.

Finally, I’m excited to see more “participant-driven” or “data-driven” research. I think too often we lose sight of how much our beliefs influence every aspect and stage of our research designs (e.g., stimuli and/or manipulation materials, question wording). I think our field could benefit if at least during the initial stages of a research project we asked participants to tell us what they think, feel and do. To the extent we can measure these kinds of “participant-driven data” (i.e., add numbers to their more open-ended responses), we can use some “fancy” and increasingly popular techniques like multi-dimensional scaling, reverse-correlation, and cluster analyses to paint pictures of psychological phenomenon that go a little further than mean plots.

K: How do you typically collect your data? Do you have any preferred way(s) of collecting data?

N: Every semester I get a couple 100 participants through my department’s undergraduate subject pool. So far, I’ve collected most of PhD data via MTurk/Turkprime. I’ve developed my own practices around online data collection. For example, I for a long time, I’ve paid 50 cents for 10-minute surveys (so around $3/4 per hour), but lately I’ve started paying more. Also, given research suggesting attention checks can affect responses to later questions, and MTurkers are particularly good at passing them (e.g., the instructional manipulation check), I often don’t include them. Recently I’ve become concerned that MTurk data are getting harder to interpret, so I’ve been running a couple studies on Prolific (through my lab’s account) and I’m impressed by how much smoother it is!

Recently I’ve become concerned that MTurk data are getting harder to interpret, so I’ve been running a couple studies on Prolific (through my lab’s account) and I’m impressed by how much smoother it is!

K: You’ve signed up to Prolific a while ago. What do you think about the platform we’ve built?

N: There are two reasons I hadn’t used Prolific initially: The cost is slightly higher than I’m used to paying on MTurk (because Prolific stipulates a minimum reward of $6.50 per hour) and in the early days, there was lack of U.S. American participants. But this has changed in the past few years – Prolific now has about 10,000 US Americans in the participant pool. This year, I gave a presentation about Prolific to the incoming psychology class at Michigan. Multiple people came to talk to me afterward about questions they had about data quality from MTurk, and they said they loved the Prolific philosophy about making both researchers and participants happy. I was talking to Postdoc in our lab who is totally sold on Prolific. For his first study, he had collected 200 participants, and all 200 participant responses were great, “No B.S. responses.” I wish I had used Prolific more in the past, and I will definitely use it more in the future.

K: What do you think about using attention checks in studies?

N: Originally, attention checks (e.g., instructional manipulation checks) were probably a good idea, but they’re now over-used, so participants know what to look out for. Besides, I’m pretty convinced that attention checks change how people think about the rest of the survey. They can serve as a speed bump and a signal to slow down. However, they’re likely to prompt participants to think that researchers are trying to trick them. This can be useful in some contexts, but I certainly don’t think they’re measuring only attention (definitely not on MTurk, maybe it does work for undergrads). I think that on Mturk attention checks most likely measures pattern recognition more than attention.

I’m pretty convinced that attention checks change how people think about the rest of the survey. They can serve as a speed bump and a signal to slow down. However, they’re likely to prompt participants to think that researchers are trying to trick them.

K: What do you think are some of the biggest challenges that researchers face when recruiting participants online? It could be any issues you can think of. To give some examples, issues could be ethical in nature (e.g., how to appropriately compensate participants for their time?), technological in nature (e.g., can we build a platform that can reliably handle millions of live participants?), motivational in nature (e.g., how to motivate people to contribute to science over time without them getting bored or overexposed?), or they could be issues related to diversity and representativeness (e.g., how can we reach niche, under-represented samples in different corners of the world)?

N: From my perspective as researcher, I feel that 1) motivational questions and 2) questions around diversity/representativeness are most important. Regarding motivation, as researchers we want participants to have fun, to be honest, and to be engaged. A participant platform that achieves that has the most promise for the future.

As I mentioned earlier, I feel strongly about sampling diversity: We don’t want a science that largely studies undergraduate women. We often make general claims about the psychology of “people”, but we often test these claims on a WEIRD subset of people.

In addition, ethical issues seem connected to motivational issues. When I was using MTurk I always thought: They’re volunteering time and not required to complete my study in any particular way. MTurk has been equated to low-wage, unethical labour (e.g., like sweat shops). I’d love to see a paper making the argument for why paying participants below minimum wage in the U.S. to take surveys online is unethical. I don’t want participants to hate me for their survey experience — that’s a key reason why I want to pay more. But I’ll admit that often I’m motivated to maximize the most interpretable data for the most reasonable about of money (whatever that means).

We don’t want a science that largely studies undergraduate women. We often make general claims about the psychology of “people”, but we often test these claims on a WEIRD subset of people.

K: You may have seen this new paper discussing the the “MTurkification” of social and personality psychology. Here the authors argue that “the introduction of easy-to-use and inexpensive online participants has led to [a] shift that favors quick, easy studies at the expense of high-difficulty low-volume types of studies”. Further, they write that “[if] we disproportionately allocate resources (e.g., publications, tenure) to researchers and research domains that eschew difficult-to-conduct studies, then we impair the field’s ability to contribute to human welfare.” How do you feel about this perspective?

N: This resonates a lot with me. I have collected data too quickly in the past, and I have slowed down considerably lately. This allows me to be more thoughtful in terms of how I design my studies. I feel that with MTurk at our disposal, we’ve gotten used to shoehorning complex, contrived experiments into online survey experiments just so we can collect “large N” studies more quickly. This too often comes at the expense of external validity and (ironically) statistical power (because manipulations are weaker and measurements are “noisier”). This has made it harder for me to interpret research findings from MTurk samples.

K: I’m really curious: What do you think are some of the most profound changes currently happening to the way we do science? Are you hopeful for the future of psychological science?

N: I’m hopeful that people are becoming increasingly transparent about how they conduct research and about how messy their research projects really are. People are starting to care more about methods and statistics (perhaps too much about statistics and not enough about experimental design and causal inference from observational studies). All the norms and best practices coming along with these things are very interesting, too. Preregistration norms used to seem like the “Wild Wild West”, but I’m seeing more and more best practices emerge. For example, the Open Science Framework (OSF) has, among many other things, developed a short list of preregistration templates and best practices around preregistration more broadly. In general, they seem to have a diverse team of smart people motivated to make science better.

You can learn more about Nick and his work here. Thanks for making it all the way to the end! 😋

Show Comments