Helping schools evaluate edtech

I've been helping my kids' school evaluate the evidence for a new EdTech program. It's been such a humbling experience, seeing how evidence is presented to school administrators and teachers and how difficult it is for them to make sense of this evidence.

I wrote up the following as a few tips about how to evaluate evidence that is reported in "research reports" presented on EdTech and other curricular websites. No promises that this is complete, but thought I'd share it in case it's useful to someone else.

Beth's Tips for Evaluating Programs

A large sample size is often indicated as providing evidence that a program works. However, while a large study provides more evidence than a small study, more information is needed to determine if the sample is relevant to your school. For example, you should ask:

Were schools with similar student backgrounds, prior test scores, and baseline curriculum to my own included in this study?

A small p-value (“statistical significance”) is often trumpeted as evidence that the program works. This is not necessarily true for two reasons:

First, a p-value is about comparisons between groups. In order for the results of this test to matter, the two groups need to be equivalent (see below).
Second, when sample sizes are large, small p-values can indicate that there are differences between two groups, but these differences may be very small and not substantively meaningful.

The design of the study is often downplayed in a report, but it is the most important criterion you have for evaluating efficacy.

In a pre-test, post-test design, student test scores are compared at post-test to their scores at pre-test to determine gain scores. The difficulty here is that students would have grown without the program anyway. The question is not if they grew but how much they grew relative to what they would have grown without the program.
In a correlational design, some students using the program are compared to some students that are not. The important question here is: why did some receive the program while the others did not? You want to guarantee that it’s not the case that the “best” students received the program and the “worst” students did not, thus tipping the results in the favor of the program from the outset. (Be skeptical if the comparison group is a “national sample”, which is often very different than the sample using the program.)
In a randomized design, students, classrooms, or schools are randomized to either be in a group that receives the program or does not. This is the best design because it ensures that the two groups are equivalent at the beginning of the study. Thus any differences observed between the two groups at the end of the study must be because of the program.

Reports might focus on “high implementers” within the group receiving the program. This is tricky and should make you ask questions. What % of students and classrooms are high implementers? How are high implementers different than the rest of the students in terms of pre-test score and demographics? Why aren’t all students high implementers?

Analyses that focus on high implementers often are produced because analyses of all students did not show adequate effects.
When not all students are high implementers, it tells you that there may be problems with implementation of this program. It could be that the technology is difficult to use. It could be that teachers do not like the program. It could be that only the “best” students are able to implement the program well, while others cannot.
Are your school and students likely to be like these high implementers? Now the demographics and test-scores that matter are not for the whole sample in the study, but just for the high implementers.

Analyses should make the two groups as equivalent as possible. Were the two groups equivalent at the beginning of the study? If not, were differences between the groups adjusted for in the analysis? Only these results are relevant. (Look for the ANCOVA adjusted effects).

Reports only sometimes provide an effect size - a standardized measure that indicates the proportion of a standard-deviation in growth gained in the program group versus comparison group. These are typically reported as Cohen’s d or Hedges’ g. These are important.

Even though these are standardized, they can still be difficult to interpret. Bloom, Hill, Black, and Lipsey (2008) provide useful metrics for comparison.
When only F-tests or t-tests are reported, you can calculate these effect sizes yourself using online calculators.
In general, Bloom and colleagues show that the following amounts of growth are typical for ELA by grade:

Grade: 1 2 3 4 5

Typical 1-year .97 .60 .36 .40 .32

This means that students tend to grow more in early grades than later grades anyway. Thus for a program to be unusually efficacious in an early grade, it needs to have a very large effect size.

Reports often provide information on the “average” improvement across grades. This can include averaging across K - 12 grades! It is easy to imagine that a program might be effective in Grade 5 but not in Grade K.

Ask for and examine the grade-by-grade results.
Examine the general trend - is it equally efficacious across grades? Does the effect seem to grow?
This information can help you determine the grades a program should be implemented. What is best for Grade 5 may not be best for Grade K.

Overall, be a smart, skeptical consumer. Ask questions. A well-balanced report will focus on the design of the study and will include all analyses - including those that do not result in large gains or promises of efficacy. What is not shown or reported is often as important as what is reported.