Support CELTUA CELTUA Blog Facebook Twitter

The Center for the Enhancement of Learning, Teaching, and University Assessment

Assessment Projects

Outcomes in the Majors PSY Example

January 5, 2010

To Psychology Faculty

From: Cecilia Shore, Beth Uhler & Joe Johnson

Greetings from the committee on Assessment of Outcomes in the Major! We are writing to bring you up to date on our process and results so far, and, ultimately, ask for your guidance as to the next steps the department should take. These will be discussed at the faculty meeting January 15.

We would like to thank the other members of the Assessment Team: Len Mark, Peter Wessels, Ann Fuehrer, Paul Flaspohler, and Amanda Diekman. We would also like to thank the many faculty members and grad students who helped by allowing data collection in their classes, or by encouraging their students to participate.

Project history

In the spring of 2006, at the direction of the Liberal Education office, Psychology, along with the other social science departments began discussing methods of assessing learning outcomes within our majors. An examination of project assignments in capstone course syllabi indicated that faculty highly value critical thinking, research proficiency and application as outcomes among our majors. Previous attempts to do this have included using a rubric to measure research methods performance among undergraduate students presenting at departmental poster day, and the use of a rubric to evaluate research projects being conducted by students in capstone courses. These attempts have been marred by unrepresentative samples and the lack of a common assignment to which students were responding.

The current iteration: Goals and learning outcomes to be assessed

This project brought together two initiatives, the Assessment in Majors project and the Top 25 project. A goal of both of these groups is to assess the critical thinking skills of our majors as they relate to understanding research methods. The departmental assessment team and the Top25 teams met in a joint session in Spring 2008. We agreed that we would like to develop a common structure for assessments at several key points in our students' progress through the curriculum: before and after Introductory Psych, before and after Research Methods (PSY 294) and before and after capstone experiences.

We composed a list of learning objectives that we would like to assess, from a number of sources:

  • Decision making. Recognize biases and failures to consider evidence in decision making.
  • Scientific Thinking. Understand the scientific method, both in theory and in practice.
  • Conceptualization and theory-making , e.g., qualities of good theory
  • Falsifiability. Recognize the role of falsifiability in scientific theories. (short answers)
  • Recognize ethical obligations in research, and limits of contrived laboratory situations.
  • Measurement, e.g, reliability & validity
  • Case Studies and Personal 'Testimonials' (short answers)
  • Experiments, Correlation and Causation
  • Simple Causes, Interactions and Multiple Causality
  • Variability , Chance, e.g., gambler's fallacy
  • Data description, e.g., explain basic descriptive statistics, including graphical displays.
  • Data analysis, e.g., choice of appropriate statistical analysis
  • Data interpretation , e.g., recognize flaws in study (short answers)
  • Information literacy, e.g., plagiarizing vs paraphrasing.


We designed two forms of a quiz intended to measure these learning outcomes, mostly multiple choice, but a few short answer questions. We strove to avoid the use of jargon in the questions—we wanted to know whether students knew the concept (e.g., comparison of independent means), rather than whether they remembered the term (e.g., between-groups t-test). The two forms were intended to be similar, e.g., the same concepts being assessed with two different 'cover stories'. The two forms ended up totaling approximately 40 questions each. To enable in-class administration, each form was split into two 20-21 item subforms, with an effort to distribute the learning outcomes across the two subforms. To keep things simple, we decided to run form 1 as the pretest in the fall semester and form 2 as the post-test in Psy 111, Psy 294 and Psy 410. This of course confounded form with time of administration and we needed to confirm that forms 1 and 2 were comparable in difficulty. Practical constraints suggested that we not re-run testing in all three classes in the spring reversing the order of the forms; instead we just did 294.

Psy 111 and Psy 294 students were quizzed in class (111 on paper, 294 on computer). Most 410 students were quizzed in class on paper; a few were quizzed in the computer lab. All pretests were run within the first two weeks of the semester, at the convenience of the instructor, and post-tests in the last two weeks of term. Instructors were not present for the administration. Researchers administered the consent procedure, including a lottery prize structure for motivating participation and good performance. Participants were NOT anonymous, but were assured that their performance was unrelated to the class and would not be reported to their teachers. Some teachers left a sign-in sheet for students to receive extra credit for participating (on the honor system).

Short answer questions were graded by a key; each grader got students from all three classes, so as to not confound rater with course. Two graders were psy 111 teachers—they did not get papers from their own sections to grade.

Key points from the results (details posted below):

  1. Were the forms equivalent in difficulty? No. Form 2 was more difficult than Form 1, regardless of whether it was administered as a pretest or post-test.
  2. Are there gains in performance within semesters and across courses? Post-test scores for students in PSY111 and PSY410 were greater than their pre-test scores, but the opposite was true for students in PSY294. We're not sure why. Could be 'over thinking' or maybe something about taking the harder form 2 on computer as opposed to paper.
  3. How do students in different courses do on different learning outcomes? Items showing reasonably clear evidence of improvement with instruction: conceptualization, measurement, experiments, interactions, variability. Items showing very little benefit from instruction: using evidence, data analysis. Info literacy ceiling effect.

Future steps

  • The Fall 2008 data were before re-design to be consistent with Top25 goals. It would be very interesting to simply repeat the survey in 111. We would need to include both re-designed sections and traditional sections, since another thing that has changed is the AP course cut-off.
  • The information about individual items could be used to construct a psychometrically valid standard assessment tool that could be used for future curricular design in PSY 293 and PSY 294 as well as plans to vertically integrate inquiry throughout the curriculum.
  • The short answer items are potentially a very rich source of data—however, many students simply left them blank. They were at the end of the quiz. It would be useful to figure out a way of readministering these items that would motivate students more, e.g., ask ONLY those questions.

More information about results

1) Were the forms equivalent in difficulty? We compared psy 294 students in the fall (who got form 1a or 1b as pretest and form 2a or 2b for the post-test) with students in the spring who got the reverse. Sample sizes are below.

Between-Subjects Factors



















Graph showing percent correct on each form as pretest vs post test

2) Are there gains in performance within semesters and across courses?

An analysis was performed on the pre- and post-test survey data from three courses from Fall 2008, PSY111 (N = 255), PSY294 (N = 48), and PSY410 (N = 30). Specifically, a 2 (order: pre, post) x 3 (course: PSY111, PSY294, PSY410) between-participants analysis of variance was conducted on the total percentage correct responses to the survey questions. Only students who participated in both the pre-test and the post-test were included; however, since the pre-test and post-test were different forms, a difference score would not have been meaningful.

The results indicate that post-test scores (M = 46.22%, SD = 14.28%) were significantly greater than pre-test scores (M = 39.82%, SD = 12.69%), F(1,331) = 6.81, p < .05. There was also a significant main effect of course, F (2,331) = 22.36, p < .001. Scores for students in PSY410 ((M = 49.97%, SD = 15.64%) were greater, although not significantly so, than scores for students in PSY294 (M = 48.69%, SD = 14.00%). Scores for students in PSY111 (M = 40.81%, SD = 12.95%) were significantly lower than scores for students in PSY294 and PSY410. Finally, there was a significant interaction between order and course, F (2,331) = 11.61, p < .001. As shown in the figure below, post-test scores for students in PSY111 and PSY410 were greater than their pre-test scores, but the opposite was true for students in PSY294.

Graph showing estimated marginal means

3) How do students in different courses do on different learning outcomes?

Recall that there were 15 learning objectives for which we wrote items. The descriptive data from the pre-test and post-test are pasted below. Recall that pre-post comparisons are not really valid, since the post-test was a different form than the pre-test. Also recall that at both time points, there were two different forms, which have been collapsed for these analyses. So, for example, the "ethics" question in Form A might be asking about informed consent, and the same learning objective might be covered in Form B asking about deception or something else. But then these separate items were combined under the "Ethics" learning objective in these graphs. We did not do inferential tests of these many comparisons. ) In both cases, info literacy for 410 was comparable to 294, though it didn't survive copy and paste.) A very rough 'eyeballing' of the data suggest some patterns:


Items showing substantial improvements across courses

Items showing little improvement and performance under 50%

Items showing little improvement and performance above 50%


  • Conceptualization
  • Ethics
  • Experiments vs corr
  • Variability
  • Central tendency
  • Data description
  • Using evidence
  • Measurement
  • Interactions
  • Data analysis
  • Scientific method
  • Info literacy


  • Scientific method
  • Conceptualization
  • Ethics??
  • Measurement
  • Experiments
  • Interactions
  • variability
  • Data description
  • Using evidence
  • Central tendency
  • Data analysis
  • Info literacy

A highly speculative summary might be offered:

  • Items showing reasonably clear evidence of improvement with instruction—there are gains across courses at post-test and pretest scores hold reasonably steady from the levels obtained in the post-tests: conceptualization, measurement, experiments, interactions, variability
  • Items showing perhaps benefit from instruction: central tendency, data description
  • Items showing very little benefit from instruction: using evidence, data analysis. Info literacy ceiling effect.
  • Items with puzzling patterns: ethics, scientific method.

Graph showing proportion correct, pretest

Graph showing proportion correct, posttest