Humanvarieties.org

A Meta-Analysis of Jensen Effect on Heritability and Environmentality of Cognitive Tests Using the Method of Correlated ...

2013-08-19

A correlation between the g factor and indices of heritability (h2) gives support for the genetic g hypothesis but, on the other hand, the interpretation may appear questionable if g correlates with shared (c2) and/or non shared (e2) environment to the same extent. The results from the present meta-analysis tend to support the hereditarian hypothesis.

Before introducing the topic however, I have to say that the EXCEL file (5 sheets) can be acquired here. It contains the calculations, formulas, references and all other relevant numbers and informations about the specific studies. The pearson correlations are in blue and the Spearman correlations are in red.

Introduction. Earlier, Rushton & Jensen (2010) cited an unpublished psychometric meta-analysis by van Bloois et al. (2009, p. 61) reporting a perfect true correlation of g with heritability (eight studies, 1512 twin pairs), that is, after correction for several susbtantial artifacts such as (1) sampling error, (2) reliability of the vector of g loadings, (3) reliability of the vector of a specific variable of theoretical interest, (4) restriction of range of g loadings, and (5) deviation from perfect construct validity.

It is important to make this distinction between a true and an observed correlation. Because correlations can be affected by so many artifacts, it is important to consider them all. In the test of the Spearman hypothesis, Jensen (1998, pp. 381-383) tells us that the method of correlated vectors (MCV) yields a somewhat attenuated correlation due to the same artifacts Nijenhuis listed (2007) in his own test of the Spearman hypothesis. The relevant passage of The g Factor reads :

The theoretically ideal conditions for this test of Spearman’s hypothesis unfortunately are mutually contradictory: large g loadings on the subtests and maximum variation among the subtests’ g loadings; also, large mean group differences on the subtests and maximum variation among the group differences. Within the necessarily limited range of values from zero to one for g loadings, it is impossible to maximize the mean and the standard deviation simultaneously. As is evident from the data I have analyzed, the constructors of test batteries have selected mostly subtests with fairly large g loadings in order to maximize the g of the battery as a whole, hence a necessarily restricted variation among the subtests’ g loadings. In the 149 subtests in Figure 11.6, for example, the g loadings range from +.26 to +.89, with a mean of +.60 and SD of .13; the g loadings are concentrated in the upper part of the range, as shown in the left panel of Figure 11.7. The right panel shows the distribution of the standardized mean W-B differences on 149 subtests.

Still another condition that attenuates the rgd test of Spearman’s hypothesis is the reliability of the vector of g loadings and of the vector of the group differences for a given battery. The vectors are subject to sampling error, as are the statistics on the single subtests. These two sources of sampling error, though statistically somewhat related because they are both based on one and the same subject sample, can be quite different. In large samples (as have been used to test Spearman’s hypothesis), the reliability of each of the correlated vectors is generally lower than the reliability of the factor loadings and the group differences on any single subtest.

Since we have data on the twelve subtests of the WISC-R obtained in three large independent representative samples of blacks and whites, we can determine the average correlation between the g vectors obtained for each sample and also the average correlation between the vector of group differences (d). The average correlation is an estimate of the reliability of each vector. For the g vector it is .86; for d it is .78. For the test of Spearman’s hypothesis based on the WISCR, if we use these reliability coefficients to disattenuate the average correlation rgd = .61, the correlation is raised to .61/SQRT(.86)(.78) = .74. It seems likely that if the values of rgd obtained for all the other batteries used to test Spearman’s hypothesis could be disattenuated in this way, the overall average value of rgd would be increased by some .10 to .20 points, thus putting it somewhere between .70 and .80.

Besides the attenuating effect of subject sampling error, each of the batteries from which a g factor has been extracted is also liable to psychometric sampling error. The g extracted from a given battery is not exactly the same as the g extracted from a similar battery composed of different subtests. Although each battery gives an estimate of the hypothetical “true” g (as explained in Chapter 4, p. 87), estimates of g based on different test batteries typically correlate about +.80 with each other. If one also disattenuated the grand average rgd for the effect of psychometric sampling error (in addition to subject sampling error), the fully corrected rgd would rise to about .90.

Generally, the meta-analytic tests of Jensen effects show indeed a non-trivial increase in the r(g*d) after correction for those artifacts. See Dragt (2010). Recently, MCV has been widely used in meta-analytic tests of the Spearman hypothesis (Dragt, 2010; Smit, 2011; Repko, 2011; Metzen, 2012; Nijenhuis, 2013). These studies show that g is not related with the secular gains, that g does not show a positive correlation with biological-environmental variables, or a clear-cut relationship with breastfeeding gains since the variance explained by the artifacts is way too low. On the other hand, they show that g is related with heritability coefficients of reaction time measures, brain volume, aging, and racial differences.

The empirical fact that g and heritability are correlated has been subjected to debate about the causal (genetic) g. This is beyond the scope of that article and it is likely I will say something about this in the near future. Anyway, Dalliard and Chuck already discussed about it, briefly.

Technical notes. Initially, 36 studies (42 correlations, or data points) have been included. But given the lack of appropriateness of some test batteries, we need to make a decision about the criteria of inclusion. Indeed, some studies included in the meta-analysis are not ideal for our purpose, because the team researchers took one or two subtests from a given battery, and another one or two from still another battery, and so forth. Some other studies administered a complete battery of tests, but the number of subtests was quite low (4 or 5), so that the results were too unreliable. Nijenhuis et al. (2007) chose to include batteries with a minimum of 7 subtests, which I did as well. Some may think that this decision is purely arbitrary, but as I show below, even an IQ test with 11 subtests would not necessarily yield stable results and I think 7 is a way too low. Also, in most of the located studies, the battery of tests had a very strong verbal flavor. In other words, they were highly selected. A battery of tests more diverse in content yields a much more reliable g. I have categorized the good batteries (=1) and bad batteries (=0) those that appeared more or less diverse in content. In reality, some of the good batteries were not very good in fact, but surely much more than those categorized as =0. Thus with that said, when we remove the bad batteries and those having less than 7 subtests, we are left with only 19 data points (N=4010) or 18 studies.

This is important because I noticed that most of the low g*h2 and high g*c2 correlations comes mainly from the studies with verbal biased subtests and perhaps, as a coincidence, having a very small number of subtests. Even in the popular, commonly used intelligence tests such as the Wechsler or the MAB, the produced g-loadings are biased in favor of crystallized subtests (Ashton & Lee, 2005, pp. 436-438). Because crystallized tests are suspected to be more informational (or cultural) loaded (see Kan, 2011, ch. 3), it is not implausible that this artifact explains some of the positive g*c2 and low g*h2, just like it could explain the strong correlation between informational-loadings and g-loadings, which I discussed previously. In contrast, it has been established by Davies et al. (2011) that fluid intelligence was more heritable than the crystallized intelligence. Rushton et al. (2007) also showed a positve correlation between complexity and heritability in the Raven matrices, a purer measure of g. Keeping that in mind, the Rijsdijk et al. (2002) study, included in our meta-analysis, reveals the following :

It was also unusual to find the Raven correlating more highly with the verbal than with the performance subtests. At the same time, factor analysis of the WAIS subtests plus the Raven showed that the Raven and the verbal subtests (with the predictable exception of digit span) had the highest g-loadings (averaging 0.73) in comparison to the performance subtests whose g-loadings averaged 0.49.

It was not unusual however. Jensen (1998, pp. 90, 120, 167) was affirmative about the very high obtained g-loadings for the Raven’s matrices when it was factor analyzed along with the Wechsler’s subscales even if it was a known fact that among the Wechsler’s subtests that crystallized tests have larger g-loadings than fluid tests (Kan, 2011, pp. 43-46). This makes Kan’s conclusion about g correlating with cultural loadings somewhat questionable.

Next, is about which estimates of the variance components (h2, c2, e2, or simply ACE) we must correlate with g. In sheet #2, the readers could notice that, in some but not all cases, depending on which estimates we use, the r(g*h2), r(g*c2) and r(g*e2) diverge substantially within studies, one being highly negative and the other highly positive or vice versa, rendering it difficult to choose among them. One likely culprit here could be the number of subtests, the Dale et al. (2010) study being by far the best illustration of what I am arguing. Their Table 3 gives the MZ and DZ intraclass correlations (from which we can derive h2, c2 and e2) and model-fitting parameters. The produced h2, c2 and e2 numbers were extremely similar. However, the r(g*h2) were 0.119 and 0.635, and r(g*c2) being 0.531 and 0.361, respectively, with the h2 vector correlation as high as 0.825. This can be easily explained by the small number of subtests (k=4) rendering the use of MCV completely worthless for our test of the genetic g. Even one slight deviation will dramatically distort the correlation.

Speaking of that, when we look at the present analysis precisely about the values of h2 and c2 derived from MZ and DZ twin correlations using Falconer’s formula, we sometimes notice that some values are higher than 1 or lower than 0 (i.e., negative). In a case like this, following most of the papers I’ve read so far (e.g., Haworth et al., 2009, Table 3), I put those values at 1 and 0, respectively, because this seems to be a common practice. Rushton et al. (2007; see supplemental data) for instance did the same thing.

Another problem is the heterogeneity of the correlations. The restriction of range in g-loadings is a good candidate. This causes a serious problem because it would artificially lower the correlations, just like range enhancement increases the observed correlations. According to te Nijenhuis (2007) “The average standard deviation of g loadings of the various Dutch and US versions of the WISC-R and the WISC-III was 0.128.” (p. 288). And Metzen (2012, pp. 49-50) shows that the range of g-loadings is quite different for various measures (IQ tests, achievement tests, reaction time tests). Having said that, I take the value of 0.128 as a point of reference for applying the correction for range restriction.

Another serious concern is about the (un)reliability of the column vectors for either g-loadings, h2, c2, or e2. Note that heritabilities and environmentalities can change across ages. These are the likely suspects for the curious numbers given in the Rietveld et al. (2003) study (N=209). They provide estimates of h2, c2 and e2 for three different ages : 5, 7, 10. But curiously enough, the correlations between g and those variables varied substantially across the ages. Most likely this could be due to the unreliability of the h2, c2 and e2 column vectors : their correlations range from -0.099 to +.897. In constrast, the reliabilities of these vectors in the Plomin (1994) Swedish data are quite high (ranging from +0.68 to +0.96) perhaps because the subject’s mean age was 65 (with N=223). Pedersen et al. (1992) use apparently the same sample (although larger) as Plomin. I correlated Pedersen vectors of h2, c2 and e2 using either Falconer’s method on the MZ and DZ twins or either the estimates derived from their latent factor models, with Plomin (1994) estimates at Time #1 and Time #2, resulting in a total of 5 correlations for each variable. One subtest was absent in Pedersen, and this may affect the correlations. In any case, the h2 reliability seem to be modest, not bad but not good either. The c2 reliability is extremely unreliable with a lot of negative correlations. The most likely reason is that c2 has a very small range, with a lot of zeros. Thus an even slight deviation in one number in any given subtest can dramatically reshape the direction of the correlation. Finally, when looking at the e2 vector, we see that the reliability is very high, at around 0.90. Perhaps this explains why the r(g*e2) seems to be more stable than the r(g*h2) or r(g*c2) among the 42 data points, that is, even when we include all of the highly unreliable cognitive batteries.

The heterogeneity of the correlations could be due to the unreliability or inaccuracy of the g-loadings as well. In fact, when the authors give the subtests intercorrelations for computing the g-loadings (by way of principal component analysis) and report the g or unrotated PC1 loadings as well, we use the latter. This problem, anyway, is well illustrated in the Rietveld (2000) study which uses the RAKIT intelligence. The g-loadings have been derived from the correlation matrix given by the authors; these g-loadings correlate poorly (around +0.40) with the estimates given by Woodley & Meisenberg (2013) and Dragt (2010) taken from the manuals. Generally, a correlation of 0.40 is good, but for an interpretable MCV test, a very high reliability of g-loading vectors is needed and must not be avoided. Thus r(g*h2) and r(g*c2) in Rietveld (2000) differ markedly depending on which estimates we use. But even when I took Woodley/Dragt estimates, the correlations varied across ages. In any case, I dropped it from the final meta-analytic correlation because of the number of subtests, amounting to only 6.

In the Rietveld et al. (2000) study, again, the readers may also notice (sheet #2) that female twins and male twins do not give us a consistent Jensen effect. The g*h2 was positive and the g*c2 negative in the female sample, but they were near zero in the male sample. This is not surprising given the small sample size. If we divide the entire group by subgroups, first by gender, and then by age, and finally by SES levels, the inevitable outcome is that error variances tend to increase.

If we had to compare the Jensen effects on h2, c2 and e2, looking at sheet #1, one can see the g*c2 correlations going and jumping all around the place. I suspect the likely reason being the very small variations in c2 values among the subtests. Even with a large enough number of subtests, like the WAIS (k=11) I noticed that a slight deviation can produce a significant change. Initially, there was a mistake in one number for ‘Specific C’ in the Friedman et al. 2008 study (sheet #2). It has been wrongly reported as 0.15 in our spreadsheet instead of 0.18 for the Picture Completion subtest, resulting in a ‘Total C’ of 0.17 instead of 0.20. Just this slight error in a single subtest among all others made the r(g*c2) to be as high as 0.572 while in fact it was 0.510 when I corrected this error of just 0.03 points. In any case, the correlations of g*c2 seem to produce no clear pattern at all.

Now, if we look at the Jensen effects on e2, we see that generally they were very consistent, all of them being (strongly) negative, with only 4 real exceptions, even when we include all of the unreliable batteries. Perhaps the most likely reason for this is the very high reliability of e2 vector (generally around 0.90), compared to h2 and c2. Interestingly, the c2 vector seems to be the less reliable and it shows the less consistent Jensen effect, not only in the degree but also with respect to the sign of the correlation.

With regard to unreliability, Dale et al. (2010) said that, “Correlations between MZ twins are often used as ‘lower-bound’ estimates of reliability”, and that the use of latent factor analysis, “which abstracts away from error variance, assigning it to measure-specific nonshared environment”, would be preferable, because it allows for a more stable nonshared (e²) environmental variance by removing the error variance and thus leading to an increase and somewhat better estimates of h² and c² by the same token. Davis et al. (2009) summarize : “This latent factor approach made it possible to conduct more powerful multivariate genetic analyses at the level of the latent factors representing reading, mathematics, language, and g because the latent factors are independent of test-specific and uncorrelated error variance associated with each method of measurement.” (p. 316). In light of this, when the researchers report both intraclass correlations and estimates derived from more sophisticated technique, such as latent factor models, the latter is used. In any case, it is not implausible that these differences in the method employed account for some of the variance between studies. I dot not think of it as really influential in our results because the vectors correlate highly with each other.

When we think of it, the unreliability of the column vectors really constitutes one of the biggest defect in the present meta-analysis. As Nijenhuis and colleagues (2007, pp. 287-288) made it clear in their meta-analytic study of the Flynn Effect, the vector reliabilities increase with sample size; it was predictable because reliabilities are subjected to sampling error. This puts in light another huge problem with the present meta-analysis. For nearly 50% of the studies (initially included) the g-loadings were derived from the correlation matrix provided in the papers from which the sample sizes are generally modest. In other words, these studies should probably be discarded from the final meta-analytic correlation, although I haven’t really excluded them. Of the 42 data points, only 10 use the g-loadings reported in the manuals or in high-quality studies. For 10 other data points, the g-loadings or unrotated factor loadings were directly reported in the respective papers. For the remaining 22 data points, the g-loadings were derived from the reported subtest intercorrelations.

But even a relatively high vector correlation does not ensure that r(g*h2) would be similar. The Olson et al. (2013) study illustrates best what I am saying here, along with Dale (2010) and Neubauer (2000). The authors provided us with the MZ and DZ twin correlations. From these numbers I calculated the heritability and environmentality of the IQ tests using Falconer’s formula. They also provide use with the Mx univariate estimates of ACE components. While the h2 vector correlation was 0.868, the g*h2 was -0.161 using the twin correlations and +0.314 using Mx univariate estimates of variance components. What accounted for this ? Most likely, the Rapid Naming subtest. The annoying fact is that Olson used a battery composed of 9 subtests, which is surely considered by most researchers as a large enough battery of tests. Of course, I don’t agree with. Generally, Jensen’s MCV is worthless because most batteries are far too small. One outlier or non-trivial deviation in just one number is just killing the correlation. Ideally, MCV requires probably a large battery just like the one administered in the MISTRA (k=42). But this one is just exceptional.

Results. Below is a simple scatterplot I produced in SPSS, plotting r(g*h2) against N, weighting by subtests so that studies with more subtests were given more weight (i.e., it is like a coefficient of importance). It seems that r(g*h2) increases with sample size, but the relationship is larger without using weight by subtests.

That scatterplot is like a Pearson correlation. Below is a correlation matrix showing the trends in h2 when N increases. I haven’t used weight this time. Also, these Jensen effects are not corrected for the aforementioned artifacts but there was no large difference in the correlation with N.

And finally we display the descriptive statistics for g*h2, g*c2 and g*e2, before and after correction (for unreliability of g vector, range restriction of g, and deviation from perfect construct validity). The first table shows the N which corresponds in fact to the real number of data points because I have not applied the weights (by sample size). The second table shows the weighted correlations with N being the sum of the sample sizes for each data points. No further comment is needed. The numbers speak for themselves.

The readers don’t need SPSS to reproduce these exact numbers. In the EXCEL Sheet #1, one could simply delete the lines corresponding to the studies that have not been highlighted. And the average correlations would automatically change.

By way of comparison, we can even delete all those studies that have been highlighted, leaving only the studies using unrepresentative batteries with a small number of subtests (N=17660). By doing so, we see that the weighted r(g*h2) is 0.349, r(g*c2) being 0.448, and r(g*e2) being -0.604. The corrected correlations were, respectively, 0.708, 1.124, -1.284.

Discussion. Now, one thing must be said here. The Davis et al. (2009) seems to have a more or less representative battery of tests with a good number of subtests (k=14) but because of his very huge sample size (N=5434, compared with 4010 for all of the 19 data points) that study had an extreme leverage on the results, rendering the individual contribution of all other studies completely worthless. That is clear when we look at the weighted/corrected r(g*c2) without Davis study, amounting to -0.17 while it jumped to 0.32 with Davis. The positive Jensen effect on shared environment depended solely on the Davis (2009) study. Therefore, I have not included Davis et al.

Some correlations were higher than 1. This can happen during the meta-analytic procedure. te Nijenhuis and colleagues (2007) found the same effect in their meta-analytic correlation of g-loadings with secular gains, with an estimated true correlation of -1.06. In that case, this should simply be interpreted as a correlation of -1.00, meaning that the correlation was a perfect one.

A matter that needs to be investigated further is about the non-trivial variability in the effect sizes between studies. Most likely, that is due to the two smallest samples which show a negative r(g*h2). This aside, there were obviously some corrections I haven’t applied (e.g., for sampling error using the N harmonic formula, and the unreliability of the second vector, namely, h2, c2, and e2) before moving to the higher step. In te Nijenhuis et al. (2007) meta-analysis, it has been shown that all the variance across studies was definitely explained by the artifacts they have taken into account, meaning that the other dimensions on which the studies differ (e.g., age, IQ samples, test type) play absolutely no role in the differential Flynn effects and that all the variance was solely due to those artifacts. As a matter of fact, the SD of effect sizes decreases, falling to almost zero. This procedure is a crucial one but it has not been completed here, partly due to the unknown (un)reliability of h2 or c2 (to the best of my knowledge at least).

Although the h2 vector reliabilities displayed in the Plomin 1994 and Pedersen 1992 study are probably underestimated due to the small sample size, I have computed the mean reliability, arriving at a value of 0.66. I took this value for the h2-reliability correction, applying it to all other studies. Using the same 19 data points, I re-calculated the weighted/corrected r(g*h2) by dividing the weighted correlation by the square root of 0.66*0.86, and then by 0.90 and finally by the range restriction specific to each data points. The found meta-analytic correlation between g and heritability was 0.879. For shared and nonshared environmentality, it was -0.209 and -0.525, respectively.

Studies considered in the meta-analysis.

1. Block, J. B. (1968). Hereditary components in the performance of twins on the WAIS.
2. Bratko, Butkovic, Chamorro-Premuzic (2010). The genetics of general knowledge: A twin study from Croatia.
3. Byrne Brian et al. (2007). Longitudinal twin study of early literacy development: Preschool through Grade 1.
4. Byrne, Coventry, Olson, Samuelsson, Corley, Willcutt, Wadsworth, DeFries (2009). Genetic and Environmental Influences on Aspects of Literacy and Language in Early Childhood: Continuity and Change from Preschool to Grade 2.
5. Dale, Harlaar, Hayiou-Thomas, Plomin (2010). The Etiology of Diverse Receptive Language Skills at 12 Years.
6. Davis, Haworth, and Plomin (2009). Learning abilities and disabilities: Generalist genes in early adolescence.
7. Friedman, Miyake, Young, DeFries, Corley, Hewitt (2008). Individual Differences in Executive Functions Are Almost Entirely Genetic in Origin.
8. Harlaar, Cutting, Deater-Deckard, DeThorne, Justice, Schatschneider, Thompson, Petrill (2010). Predicting individual differences in reading comprehension: a twin study.
9. Hart, Petrill, Thompson, Plomin (2009). The ABCs of Math: A Genetic Analysis of Mathematics and Its Links With Reading Ability and General Cognitive Ability.
10. Hart Sara A., Petrill Stephen A., Thompson Lee A. (2010). A factorial analysis of timed and untimed measures of mathematics and reading abilities in school aged twins.
11. Hayiou-Thomas ME, Kovas, Harlaar, Plomin, Bishop, & Dale (2006). Common aetiology for diverse language skills in 4½-year-old twins.
12. Jacobs, Van Gestel, Derom, Thiery, Vernon, Derom, & Vlietinck (2001). Heritability estimates of intelligence in twins: effect of chorion type.
13. Johnson Wendy et al. (2007). Genetic and environmental influences on the Verbal-Perceptual-Image Rotation (VPR) model of the structure of mental abilities in the Minnesota study of twins reared apart.
14. LaBuda, DeFries, & Fulker (1987). Genetic and environmental covariance structures among WISC-R subtests: A twin study.
15. Lemelin et al. (2007). The Genetic-Environmental Etiology of Cognitive School Readiness and Later Academic Achievement in Early Childhood.
16. Luo, D., Petrill, S. A., & Thompson, L. A. (1994). An exploration of genetic g: Hierarchical factor analysis of cognitive data from the Western Reserve Twin Project.
17. Martin Nicolas W. et al. (2009). Genetic Covariation Between the Author Recognition Test and Reading and Verbal Abilities: What Can We Learn from the Analysis of High Performance?.
18. Mather Patricia L., Black Kathryn N. (1984). Hereditary and Environmental Influences on Preschool Twins’ Language Skills.
19. Mosing Miriam A, Mellanby Jane, Martin Nicholas G, Wright Margaret J (2012). Genetic and Environmental Influences on Analogical and Categorical Verbal and Spatial Reasoning in 12-Year Old Twins.
20. Neubauer et al. (2000). Genetic and Environmental Influences on Two Measures of Speed of Information Processing and their Relation to Psychometric Intelligence Evidence from the German Observational Study of Adult Twins.
21. Olson, Hulslander, Christopher, Keenan, Wadsworth, Willcutt, Pennington, DeFries (2013). Genetic and environmental influences on writing and their relations to language and reading.
22. Owen David R., Sines Jacob O., (1970). Heritability of personality in children.
23. Pedersen et al. (1992). A Quantitative Genetic Analysis of Cognitive Abilities during the Second Half of the Life Span.
24. Petrill, Plomin, Berg, Johansson, Pederson, Ahern, McClearn (1998). The Genetic and Environmental Relationship between General and Specific Cognitive Abilities in Twins Age 80 and Older.
25. Petrill, Saudino, Wilkerson, Plomin (2001). Genetic and environmental molarity and modularity of cognitive functioning in 2-year-old twins.
26. Plomin & Vandenberg (1980). An Analysis of Koch’s (1966) Primary Mental Abilities Test Data for 5- to 7-Year-Old Twins.
27. Plomin et al. (1994). Variability and Stability in Cognitive Abilities Are Largely Genetic Later in Life.
28. Rietveld, Baal, Dolan, Boomsma (2000). Genetic Factor Analyses of Specific Cognitive Abilities in 5-Year-Old Dutch Children.
29. Rijsdijk, Vernon, & Boomsma (2002). Application of hierarchical genetic models to Raven and WAIS subtests: a Dutch twin study.
30. Samuelsson, Olson, Wadsworth, Corley, DeFries, Willcutt, & Byrne (2007). Genetic and environmental influences on prereading skills and early reading and spelling development in the United States, Australia, and Scandinavia.
31. Segal Nancy L. (1985). Monozygotic and Dizygotic Twins: A Comparative Analysis of Mental Ability Profiles.
32. Shikishima, Hiraishi, Yamagata, Sugimoto, Takemura, Ozaki, Okada, Toda, Ando (2009). Is g an entity? A Japanese twin study using syllogisms and intelligence tests.
33. Tambs, Sundet, Magnus (1984). Heritability Analysis of the WAIS Subtests: A Study of Twins.
34. Thompson, Detterman, Plomin (1991). Associations between Cognitive Abilities and Scholastic Achievement: Genetic Overlap but Environmental Differences.
35. Wainwright, Wright, Luciano, Geffen, Martin (2005). Multivariate Genetic Analysis of Academic Skills of the Queensland Core Skills Test and IQ Highlight the Importance of Genetic g.
36. Williams, F. (1975). Family resemblance in abilities: The Wechsler Scales.