Predicting education from DNA?

Lady Orlando, flickr.com, CC BY 2.0

by K. Paige Harden and Daniel W. Belsky

 

Are these predictions from DNA of how far a person gets in school strong or weak?” This question from Antonio Regalado, a science reporter from MIT Technology Review, captures how confusing new discoveries in social science genetics can be, even to experts.

In this blog post, we want to help clarify this question: What do social science geneticists mean when we say that DNA can “predict” educational attainment, and that those predictions are “strong” or “weak”? Given the history of atrocities perpetuated under the banner of eugenic ideologies, any scientific effort to connect DNA differences to social inequalities between people is bound to be controversial, to say the least.  Clear understanding of what DNA measures can (and cannot) statistically predict is essential for grounding debates about how DNA measures should be used.

“We want to help clarify this question: What do social science geneticists mean when we say that DNA can ‘predict’ educational attainment, and that those predictions are ‘strong’ or ‘weak’?”

Two new papers have re-animated the public conversation about prediction in social science genetics. This week, Daniel Benjamin and his colleagues from the Social Science Genetics Association Consortium (SSGAC) reported an analysis of the genomes of over 1 million people that uncovered more than a thousand genetic variants associated with educational attainment.

One of the products of this giant study is an algorithm called a polygenic score. This algorithm can be applied to the genomes of people not included in the original study to predict their educational attainment and, as one of us showed in a paper published earlier this month, their career success and wealth accumulation.

The three figures below all illustrate DNA predictions of life course outcomes. Figures A and C, which are from the SSGAC study, show educational attainment in a sample called the Health and Retirement Study. Figure A is a scatterplot of each person’s educational attainment by their polygenic score, whereas Figure C is the percentage of people attaining a college degree by quintile of polygenic score. Figure B gives yet another way of visualizing polygenic associations with life course outcomes – a binned scatterplot. The outcome here is wealth, rather than educational attainment, plotted separately for people from low, medium, and high childhood socioeconomic status.

 

Figure A. Figure provided by the Social Science Genetic Association Consortium to The Atlantic. Data and analyses reported in Lee et al., 2018, Nature Genetics. Data are from the Health and Retirement Study.

Figure A. Figure provided by the Social Science Genetic Association
Consortium to The Atlantic. Data and analyses reported in Lee et al., 2018,
Nature Genetics. Data are from the Health and Retirement Study.

 

Figure B. From Belsky et al., 2018, PNAS. Each plotted point reflects average x and y coordinates for a bin of 50 participants. The red regression lines are plotted from the raw data. The box-and-whisker plots at the bottom of the graphs show the distribution of the education polygenic score for each childhood SES category. Data are from the Health and Retirement Study.

Figure B. From Belsky et al., 2018, PNAS. Each plotted point reflects
average x and y coordinates for a bin of 50 participants. The red regression
lines are plotted from the raw data. The box-and-whisker plots at the
bottom of the graphs show the distribution of the education polygenic
score for each childhood SES category. Data are from the Health and
Retirement Study.

 

Figure C. From Lee et al., 2018, Nature Genetics. Mean prevalence of college completion by polygenic score quintile. Data are mean ± 95% confidence interval.

Figure C. From Lee et al., 2018, Nature Genetics. Mean prevalence of
college completion by polygenic score quintile. Data are mean ± 95%
confidence interval.

 

Relationships with DNA seem to be stronger as you move from Figure A to Figure C, but the major difference between these figures is not the size of the effect. It’s how many people are represented by each data point in the graphs. The data points in Figure A (the small grey dots) represent single individuals. The data points in Figure B (the big blue dots) represent averages from groups of 50 people. And the data points in Figure C represent averages from groups of about 1000 people (the blue bars) or about 1700 people (the yellow bars).

The genetics discovered in the new study of educational attainment are highly predictive of average outcomes in large groups of people but not very predictive of outcomes for any one individual.

This is a basic point about statistics that often gets lost: For an effect of any size, statistical models predict the average value for a group of people with much more certainty than they predict the individual value for any one person. It comes down to signal-to-noise ratio.

“Those unique and serendipitous events that might have steered your life didn’t matter in someone else’s. That’s what we mean by statistical noise. “

When you reflect on the course your life has taken, you can often identify some unique and serendipitous events and circumstances that were influential in bringing you to where you are. But serendipity is just that: Those unique and serendipitous events that might have steered your life didn’t matter in someone else’s. That’s what we mean by statistical noise.  In group averages, that noise gets cancelled out, because it’s different for each person. What’s left is the signal – what’s common between us. All three plots show us the DNA signal predicting educational attainment; the difference between the figures is in the amount of noise. Figure A shows more noise than Figure B, and Figure B shows more than Figure C.

So, how predictive are these DNA differences for life outcomes? It depends on the question.

Researchers are interested in averages. We want to know how patterns of educational differences in the population come about. For that question, these DNA differences are predictive enough to be useful. (Think about Figure C). However, parents and educators might want to make predictions about an individual child – for example, in order to tailor a curriculum in a precision education intervention. For that question, these DNA differences are likely not predictive enough. The DNA will guess wrong more often than we will feel comfortable with. Think about Figure A: pick any given value of the polygenic score, and the dots – individual human lives – are scattered up and down the full range of educational attainment.

“Polygenic scores are useful tools for social science researchers who are interested in average trends, but specific predictions about an individual human life will be wildly uncertain.”

Something else we should consider when understanding DNA predictions of educational attainment is the extent to which DNA is capturing information about people’s social environments. A paper published in Science earlier this year found that a polygenic score calculated for parental DNA that their children did not inherit still predicted their children’s educational attainment. As one of us wrote in a commentary on that study, DNA associations with life course outcomes “could operate through any physical or social environment woven by genetic kin – a tangled web indeed.”

Public debates regarding the value of polygenic scores vacillate between a ready embrace of their possibilities for individual prediction, such as personalized education, and an overly pessimistic dismissal of them as “next to worthless.” The reality is much more nuanced. They are useful tools for social science researchers who are interested in average trends, but specific predictions about an individual human life will be wildly uncertain.  For polygenic prediction, there is safety only in (large) numbers.

Weekly newsletter