Optimal sample size and data arrangement method in estimating correlation matrices with lesser collinearity: a statistical focus in maize breeding

Abstract

Information about data arrangement methodologies and optimal sample size in estimating the Pearson correlation coefficient (r) among maize traits are still limited. Furthermore, some data arrangement methodologies currently used may be increasing multicollinearity in multiple regression analysis. This study aimed to investigate the statistical behavior of the r and the multicollinearity of correlation matrices among maize traits in different data arrangement scenarios and different sample sizes. Data from 45 treatments [15 simple maize hybrids (Zea mays L.) conducted in three locations] were used. Eleven traits were accessed and three datasets (scenarios) were formed: (1) Coming from all the sampled observations (plants), n = 900; (2) Coming from the average of five plants per plot, n = 180; and (3) Coming from the average of treatments, n = 45. A thousand estimates of r were held in each scenario to 60 sample sizes by bootstrap simulations with replacement. Confidence intervals (CI) were estimated. One hundred eighty correlation matrices were estimated and the condition number (CN) calculated. Data coming from average values of plots and average values of treatments overestimates the r up to 24 and 34%, resulting in an increase of 24 and 131% in the matrices’ CN. Trait pairs with high r require a smaller number of plants, being the CI inversely proportional to the magnitude of the r. Two hundred and ten plants are sufficient to estimate the r in the CI of 95% \textless 0.30. Key words: Average values, bootstrap, confidence intervals, sample tracking, Zea mays L.

Publication
Next
Previous