Economist Doug Allen’s new study, published earlier this month, found lower high-school graduation rates among children of same-sex couples than those of opposite-sex couples and even single parents. (A brief and informative summary of the study is available here.) Meanwhile, the majority of, but not all, prior studies have concluded that there are “no differences” in outcomes for children raised by same-sex couples.

Obviously, both of these claims cannot be true. So what are the differences between the studies that cause them to find such different results?

To put it bluntly, statistics is a messy science. Previous statistical studies have not adequately established a consensus because almost all of these studies have been substantially subpar. Unfortunately, this point has rarely been conveyed in articles reporting about these studies. By and large, earlier studies have used nonrandom sampling and sample sizes that are too small to detect even large differences. Although Allen’s study is limited in some respects, it tells us more than almost all of the previous studies on the subject, because Allen is able to analyze a large number of children from a random, representative sample. In fact, when this basic selection criterion is implemented, only one other study makes the cut.

Because of a scarcity of data on the subject, much of the social science around same-sex parenting has deviated from the ideal in order to glean what knowledge can be obtained from limited data sets. While this is understandable, it is nevertheless a concession. Researchers should not “settle” by prematurely accepting a consensus about what’s possible in select same-sex parenting communities rather than pursuing the truth about what is probable in the population of same-sex households at large. Whether it is probable that children of same-sex couples fare better or worse than those of opposite-sex couples still remains largely unknown because of the failings of the vast majority of prior studies.

Start your day with Public Discourse

Sign up and get our daily essays sent straight to your inbox.

While no statistical study is perfect, most of the previous literature falls well short of even basic criteria for making convincing arguments. In terms of random sampling and large sample size, which are the most basic of social scientific criteria, the vast majority of the previous literature simply fails. Meanwhile, among the small number of studies that pass even these foundational tests, the decision is still split.

How Big Is Big Enough?

Allen’s study has far more cases than most of its peers, with nearly 1400 young adults raised by same-sex couples. Meanwhile, among studies conducted prior to 2010 (which constitute forty-four of the fifty studies reviewed by Allen), the average sample size was just sixty-nine children raised by gay or lesbian couples. The largest of these early studies included only 475 children, making it just barely acceptable in terms of sample size.

Why is a small sample size a damning characteristic of such studies?

Suppose a researcher wishes to compare two groups based on their likelihood of exhibiting a certain behavior—in the case of Allen’s new study, completing high school. Further, suppose that the true unknown underlying population average for Group A is 87%, and for Group B it is 90%. This means that the failure rates for each group are 13% and 10%, respectively. Now, a 3-percentage-point difference may not seem like much, but it would represent a 30% increase in the rate at which children fail to finish high school in Group A as compared to Group B. This indicates a dramatic distinction that could have far-reaching implications for such children’s future job prospects and quality of life.

A researcher interested in estimating group differences would collect data and then assess differences between the two groups. To tell the two samples apart statistically 80% of the time (the accepted standard), she would need a sample of 785 persons from each of the two groups.

Put another way, with a sample of only seventy (the mean for earlier studies on same-sex parenting), a researcher would expect to be able to find real differences that exist only 21% of the time. In other words, even if genuine and profound differences existed in reality, we should expect 79% of small-sample studies to document no differences. So a large body of small studies in which no differences were detected is not surprising, but it also tells us very little about what may or may not actually be the case.

Thus, on the basis of sample size alone, prudence recommends skepticism toward any premature, preemptive consensus that there are “no differences” between children raised by same-sex couples and those raised by opposite-sex couples. With so few studies boasting adequate sample sizes, much more research needs to be done to establish whether children raised by same-sex couples fare better, worse, or about the same as those raised by opposite-sex couples.

The “Random” Sample as the Gold Standard

Professor Allen’s study is also better than most of its peers because he uses randomly drawn samples, a characteristic that is of paramount importance for making inferences about differences between groups. In the absence of random sampling, it often is the case that any nonrandomness in the recruitment process is also in some way related to the variables of interest (same-sex vs. opposite-sex parenting) and the outcome measure (high school graduation), thereby biasing the results. So, while nonrandom samples can still illuminate when ideal sampling is not possible, they are not the basis on which to foster a scholarly consensus. Of the reviewed studies on same-sex parenting, only seven have employed random samples, and among these the conclusion is split nearly down the middle between differences and no differences.

Random sampling, however, creates problems for researchers studying small populations. Minority groups such as gay or lesbian couples that are parenting can be quite difficult to locate randomly, even in a large population with thousands of respondents.

Most scholars of sexual orientation suggest that 2-4% of the population is gay or lesbian. Focus in on the couples among them, and then those couples that are parenting, and you quickly watch a small segment of the population get even smaller. Thus, random samples that are large enough can be extremely expensive to obtain. Budget constraints have thus limited much of the previous literature about same-sex parenting to using nonrandom samples. Just how representative of the population these samples are is often a mystery.

For example, one longstanding data collection effort, yielding well over a dozen peer-reviewed publications declaring “no differences,” recruited children raised by lesbian couples at women’s bookstores in San Francisco, Washington, and Boston. It’s reasonable to expect that women who frequent such bookstores are likely to be better educated on average than the population as a whole, and so one ought to expect for that reason alone that their children would fare better along educational measures than the average child raised by average lesbian parents.

To claim “no differences” between a nonrandom sample and the population as a whole really doesn’t tell us much about how each group actually performs. It tells us what’s possible among a small group of people, not what’s probable or likely on a large scale. However, Professor Allen’s sample—and a small number of others—are both large and random; thus, they do not suffer from these problems.

The Allen Study: How Close to Ideal Is It?

Random sampling and large sample sizes are necessary attributes for a sound science, and Allen’s study possesses them both. Yet these tests are really only prerequisites for an ideal study. So how does Allen’s study stack up against those others that used large, random samples?

It turns out that when we impose these two very basic qualifications, only one other study remains. Allen spends considerable space in his new study documenting the methodological differences between his analyses of the Canadian census and the other study using a large, randomly selected sample: Michael J. Rosenfeld’s analysis of US Census data. Unfortunately for all of us, the US Census does not ask about same-sex coupling, relationships, or behavior, raising the risk of accidentally miscoding as a same-sex couple two persons of the same gender who just happen to be sharing the same domicile but were not couples at all, or an opposite-sex couple where through carelessness in filling in the census form one spouse’s gender was miscoded. While it is likely that Rosenfeld got the vast majority of these guesses correct, even a small amount of error in this domain can make it substantially harder to detect differences between groups. (For a more in-depth treatment of this challenge, click here.)

Professor Allen’s study has fewer limitations than most but is of course not entirely free of them. No study is. First, the Canadian census doesn’t track the same persons over time. No assessments of change—improvement or decline—are possible, nor are compelling, data-based explanations of why we see between-group differences. Like Professor Rosenfeld’s, Allen’s study is one-dimensional, meaning that it evaluates just one (albeit important) outcome: timely high school graduation. Perhaps more importantly, due to the limitations imposed by a desire to link children to existing couple-headed households in the Canadian data, Allen is only able to study those who remain at home, or whose parents claim them on the census form while they are ages 17-22. But this limitation is minor, as there is no compelling reason to believe that this differs between same-sex parents and opposite-sex ones.

Allen’s study could also be improved by changing the way in which he accounts for age. Children raised by same-sex couples in Allen’s sample are younger on average than those raised by opposite-sex couples. This need not be a problem, since the age of respondents can be accounted for using regression analysis. Unfortunately, Allen does so in a commonly accepted way that has the potential slightly, artificially, to inflate differences between these children. Despite this concern, we do not have a serious reason to dismiss Allen’s findings entirely based on this choice alone. (Regrettably, Dr. Allen was only allowed to access the data by the Canadian government for a limited period of time, and so was unable to generate alternative estimates using the alternative methodology I find more appropriate.)

Despite its modest limitations, Professor Allen’s study and possibly Rosenfeld’s come closest—methodologically—to the ideal among studies published on the subject to date. Allen employs an objectively verifiable outcome measure, random and representative data, has less trouble identifying same-sex couples, and has a large number of cases.

Although far from the final say, the study makes a significant contribution to the extremely inadequate current literature. In light of its documentation of clear differences, further study of possible differences between children raised by same-sex and opposite-sex couples is warranted, and we ought to question any premature consensus that the gender mix of parents is irrelevant to child outcomes.

In saying this, I do not wish to make normative claims about the implications of Allen’s work, only its methodology. Do his results mean that as a society we ought to discourage same-sex parents from raising children? Or that we ought to institute changes that give same-sex parents more support and less stigmatization? Or both?

That is an argument that I will leave up to you. Either way, you won’t be informed enough to argue your point if all you read are the headlines.