Link Prediction Under Non-Uniform Missing Edges

Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing-Edge Patterns

Real-world network datasets are typically obtained in ways that fail to capture all edges, and the patterns of missing data are often non-uniform as they reflect biases and shortcomings of different data collection methods. While uniform missing data is commonly assumed, this study investigates the impact of different missing-edge patterns on link prediction accuracy. We employ 9 link prediction algorithms from 4 different families (local similarity indices, network embedding, matrix completion, and ensemble learning) to analyze 20 different missing-edge patterns categorized into 5 groups: edge-based, node-based, depth-first search (DFS), neighbor-based, and jump-based methods. Our comprehensive simulation study spans 250 real-world network datasets from 6 different domains (biological, economic, informational, social, technological, and transportation). The results reveal significant variations in performance across different algorithms and settings, with the Top-Stacking ensemble method achieving best performance in over 90% of cases. Notably, DFS-based missing-edge patterns consistently lead to lower prediction accuracy across domains, while node-based patterns show the smallest variance in social networks. This study provides a practical guide for researchers to select appropriate link prediction algorithms based on their data collection process and application domain.

AUC scores over 5 runs on each network for 9 link prediction algorithms on samples obtained by 20 different methods. The 250 networks are grouped into 6 domains (arranged vertically): Biological, Economic, Informational, Social, Technological, and Transportation. Symbols indicate mean AUCs with standard deviations shown by vertical bars. The sampling methods span edge-based (Random Edge, Random Node-Edge, Hybrid Node-Edge, Random Edge with Induction), node-based (Degree Based, Random Node, PageRank Based), DFS, neighbor-based (Diffusion, Forest Fire, various Random Walk variants, BFS), and jump-based (Random Walk with Jump, Random Node-Neighbor, Shortest Path, Loop-Erased Random Walk) categories. Results demonstrate that Top-Stacking consistently achieves the highest AUC scores across most domains and sampling patterns, while performance varies significantly depending on the network domain and missing-edge pattern.

References

2024

  1. Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing Edge Patterns
    Xie He, Amir Ghasemian, Eun Lee, and 3 more authors
    PloS one, 2024