Sequential Stacking

Link prediction algorithms are indispensable tools in many scientific applications by speeding up network data collection and imputing missing connections. However, in many systems, links change over time and it remains unclear how to optimally exploit such temporal information for link predictions in such networks. Here, we show that many temporal topological features, in addition to having high computational cost, are less accurate in temporal link prediction than sequentially stacked static network features. This sequential stacking link prediction method uses 41 static network features that avoid detailed feature engineering choices and is capable of learning a highly accurate predictive distribution of future connections from historical data. We demonstrate that this algorithm works well for both partially observed and completely unobserved target layers, and on two temporal stochastic block models achieves near-oracle-level performance when combined with other single predictor methods as an ensemble learning method. Finally, we empirically illustrate that stacking multiple predictive methods together further improves performance on 19 real-world temporal networks from different domains.

A diagrammatic explanation of the sequential stacking approach for link prediction in temporal networks. We use $q$ consecutive temporal layers in a stacked feature vector to predict links in the target layer; we call $q$ the "flow variable" (blue: $q = 2$ in the diagram; $q = 3$ throughout all of our experiments here). We train the prediction (see Methods) using u layers before the target ($u > q$); we call $u$ the "search variable" (green: $u = 4$ in the diagram; $u = 6$ throughout all of our experiments here). Features are generated for sampled dyads (node pairs) in each layer and stacked across $q$ consecutive layers, with edge presence/absence labels from the following layer (green for training and red for testing). We then use standard supervised learning algorithms to train and generate link predictions in the target layer. As diagrammed here, no network information is used from the target layer, only the edge presence/absence labels (the "completely-unobserved setting"). When we consider the "partially-observed setting", the sequentially stacked features include static topological features as calculated from the partiallyobserved target layer (throughout our partially-observed experiments here we 5-fold cross-validate the target layer, uniformly sampling 80% of node pairs to predict links on the remaining 20%).

Average AUC scores for predicting links in the final layer (the target) in each of the 19 real-world networks. When the target layer is partially observed, we benchmark Top-Sequential-Stacking against Tensorial-SBM, Time-Series, and E-LSTM-D, and then also include these predictors with the stacking to compare against Ensemble-Sequential-Stacking. In the paper, we also investigated the scenario where the target layer is entirely unobserved.

Link prediction algorithms are indispensable tools in many scientific applications by speeding up network data collection and imputing missing connections. However, in many systems, links change over time and it remains unclear how to optimally exploit such temporal information for link predictions in such networks. Here, we show that many temporal topological features, in addition to having high computational cost, are less accurate in temporal link prediction than sequentially stacked static network features. This sequential stacking link prediction method uses 41 static network features that avoids detailed feature engineering choices and is capable of learning a highly accurate predictive distribution of future connections from historical data. We demonstrate that this algorithm works well for both partially observed and completely unobserved target layers, and achieves near-optimal AUC on two temporal stochastic block models. Finally, we empirically illustrate that stacking multiple predictive methods together further improves performance on 19 real-world temporal networks from different domains.

References

2024