Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing-Edge Patterns
Real-world network datasets are typically obtained in ways that fail to capture all edges, and the patterns of missing data are often non-uniform as they reflect biases and shortcomings of different data collection methods. While uniform missing data is commonly assumed, this study investigates the impact of different missing-edge patterns on link prediction accuracy. We employ 9 link prediction algorithms from 4 different families (local similarity indices, network embedding, matrix completion, and ensemble learning) to analyze 20 different missing-edge patterns categorized into 5 groups: edge-based, node-based, depth-first search (DFS), neighbor-based, and jump-based methods. Our comprehensive simulation study spans 250 real-world network datasets from 6 different domains (biological, economic, informational, social, technological, and transportation). The results reveal significant variations in performance across different algorithms and settings, with the Top-Stacking ensemble method achieving best performance in over 90% of cases. Notably, DFS-based missing-edge patterns consistently lead to lower prediction accuracy across domains, while node-based patterns show the smallest variance in social networks. This study provides a practical guide for researchers to select appropriate link prediction algorithms based on their data collection process and application domain.
AUC scores over 5 runs on each network for 9 link prediction algorithms on samples obtained by 20 different methods. The 250 networks are grouped into 6 domains (arranged vertically): Biological, Economic, Informational, Social, Technological, and Transportation. Symbols indicate mean AUCs with standard deviations shown by vertical bars. The sampling methods span edge-based (Random Edge, Random Node-Edge, Hybrid Node-Edge, Random Edge with Induction), node-based (Degree Based, Random Node, PageRank Based), DFS, neighbor-based (Diffusion, Forest Fire, various Random Walk variants, BFS), and jump-based (Random Walk with Jump, Random Node-Neighbor, Shortest Path, Loop-Erased Random Walk) categories. Results demonstrate that Top-Stacking consistently achieves the highest AUC scores across most domains and sampling patterns, while performance varies significantly depending on the network domain and missing-edge pattern.
References
2024
Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing Edge Patterns
Xie He, Amir Ghasemian, Eun Lee, and 3 more authors
Real-world network datasets are typically obtained in ways that fail to capture all edges. The patterns of missing data are often non-uniform as they reflect biases and other shortcomings of different data collection methods. Nevertheless, uniform missing data is a common assumption made when no additional information is available about the underlying missing-edge pattern, and link prediction methods are frequently tested against uniformly missing edges. To investigate the impact of different missing-edge patterns on link prediction accuracy, we employ 9 link prediction algorithms from 4 different families to analyze 20 different missing-edge patterns that we categorize into 5 groups. Our comparative simulation study, spanning 250 real-world network datasets from 6 different domains, provides a detailed picture of the significant variations in the performance of different link prediction algorithms in these different settings. With this study, we aim to provide a guide for future researchers to help them select a link prediction algorithm that is well suited to their sampled network data, considering the data collection process and application domain.
@article{he2023sampling,title={Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing Edge Patterns},author={He, Xie and Ghasemian, Amir and Lee, Eun and Schwarze, Alice and Clauset, Aaron and Mucha, Peter J},journal={PloS one},volume={19},number={7},pages={e0306883},year={2024},publisher={Public Library of Science San Francisco, CA USA},project={/projects/NonUniformSampling/},}