2025-03-28

tldr; Maybe. Preliminary results certainly are impressive.

Introduction

Crowd and flow predictions have been very popular topics in mobility data science. Traditional forecasting methods rely on classic machine learning models like ARIMA, later followed by deep learning approaches such as ST-ResNet.

More recently, foundation models for timeseries forecasting, such as TimeGPT, Chronos, and LagLlama have been introduced. A key advantage of these models is their ability to generate zero-shot predictions — meaning that they can be applied directly to new tasks without requiring retraining for each scenario.

In this post, I want to compare TimeGPT’s performance against traditional approaches for predicting city-wide crowd flows.

Experiment setup

The experiment builds on the paper “Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction” by Zhang et al. (2017). The original repo referenced on the homepage does not exist anymore. Therefore, I forked: https://github.com/topazape/ST-ResNet as a starting point.

The goals of this experiment are to:

Get an impression how TimeGPT predicts mobility timeseries.

Compare TimeGPT to classic machine learning (ML) and deep learning (DL) models.

Understand how different forecasting horizons impact predictive accuracy.

The paper presents results for two datasets (TaxiBJ and BikeNYC). The following experiment only covers BikeNYC.

You can find the full notebook at https://github.com/anitagraser/ST-ResNet/blob/079948bfbab2d512b71abc0b1aa4b09b9de94f35/experiment.ipynb

First attempt

In the first version, I applied TimeGPT’s historical forecast function to generate flow predictions. However, there was an issue: the built-in historic forecast function ignores the horizon parameter, thus making it impossible to control the horizon and make a fair comparison.

Refinements

In the second version, I therefore added backtesting with customizable forecast horizon to evaluate TimeGPT’s forecasts over multiple time windows.

To reproduce the original experiments as truthfully as possible, both inflows and outflows were included in the experiments.

I ran TimeGPT for different forecasting horizons: 1 hour, 12 hours, and 24 hours. (In the original paper (Zhang et al. 2017), only one-step-ahead (1 hour) forecasting is performed but it is interesting to explore the effects of the additional challenge resulting from longer forecast horizons.) Here’s an example of the 24-hour forecast:

The predictions pick up on the overall daily patterns but the peaks are certainly hit-and-miss.

For comparison, here are some results for the easier 1-hour forecast:

Not bad. Let’s run the numbers! (And by that I mean: let’s measure the error.)

Results

The original paper provides results (RMSE, i.e. smaller is better) for multiple traditional ML models and DL models. Addition our experiments to these results, we get:

Key takeaways

TimeGPT with a 1 hour horizon outperforms all ML and DL models.

For longer horizons, TimeGPT’s accuracy declines but remains competitive with DL approaches.

TimeGPT’s pre-trained nature made means that we can immediately make predictions without any prior training.

Conclusion & next steps

These preliminary results suggest that timeseries foundation models, such as TimeGPT, are a promising tool. However, a key limitation of the presented experiment remains: since BikeNYC data has been public for a long time, it is well possible that TimeGPT has seen this dataset during its training. This raises questions about how well it generalizes to truly unseen datasets. To address this, the logical next step would be to test TimeGPT and other foundation models on an entirely new dataset to better evaluate its robustness.

We also know that DL model performance can be improved by providing more training data. It is therefore reasonable to assume that specialized DL models will outperform foundation models once they are trained with enough data. But in the absence of large-enough training datasets, foundation models can be an option.

In recent literature, we also find more specific foundation models for spatiotemporal prediction, such as UrbanGPT https://arxiv.org/abs/2403.00813, UniST https://arxiv.org/abs/2402.11838, and UrbanDiT https://arxiv.org/pdf/2411.12164. However, as far as I can tell, none of them have published the model weights.

If you want to join forces, e.g. add more datasets or test other timeseries foundation models, don’t hesitate to reach out.

Show more