Link Prediction Stacking

Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 548 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity via meta-learning to construct a series of ``stacked'' models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are also superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state-of-the-art for link prediction comes from combining individual algorithms, which achieves nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvement of these results.

(A) On synthetic networks, the mean link prediction performance (AUC) of selected individual predictors and all stacked algorithms across three forms of structural variability: (left to right, by subpanel) degree distribution variability, from low (Poisson) to high (power law); (top to bottom, by subpanel) fuzziness of community boundaries, ranging from low to high ($\epsilon=m_{\textrm{out}}/m_{\textrm{in}}$, the fraction of a node's edges that connect outside its community); and (left to right, within subpanel) the number of communities $k$. Across settings, the dashed line represents the theoretical maximum performance achievable by any link prediction algorithm (SI Appendix, section B). In each instance, stacked models perform optimally or nearly optimally, and generally perform better when networks exhibit heavier-tailed degree distributions and more communities with distinct boundaries. Table S11 lists the top five topological predictors for each synthetic network setting, which vary considerably. (B) On real-world networks, the mean link prediction performance for the same predictors over all domains, and by individual domain. Both overall and within domains, stacked models exhibit superior performance, particularly the across-family versions, and they achieve nearly perfect accuracy on social networks. Performance varies considerably across individual domains, with biological and technological networks exhibiting the lowest link predictability. More complete results for individual topological and model-based predictors are given in SI Appendix, Figs. S8 and S9. For ease of interpretability, each panel's results are partitioned into three columns, showing ($\Box$) the performance range for selected individual predictors in each family (see legend), ($\dagger$) the results for within-family stacking, and ($\heartsuit$) the results for across-family stacking.

Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speed up network data collection and improve network model validation. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 550 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity using network-based metalearning to construct a series of “stacked” models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state of the art for link prediction comes from combining individual algorithms, which can achieve nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvements.

References

2020