Open Summer Of Code ended a while ago already, but I still wanted to write a blogpost about the last two weeks of oSoc. I'm really happy with our final product (and Infrabel was very happy with it as well).
We ended up not using machine learning for the website, and instead opted for a simpler statistics approach. Machine learning is really cool and I still think it could be really useful for both commuters and Infrabel, but unfortunately it just wasn't really practical within the scope of our project. The process of training a model is very computationally intensive and thus time-consuming, making it very hard to optimize a model in just two weeks, especially when working on such a large dataset. Besides, the current statistics approach also allows us to show a histogram with the historical data, which we think is actually more practical for the user than just the delay predicted by a machine learning model. But we learned a lot and we do have a working model that can make actual delay predictions, which already is a very interesting proof of concept.
As mentioned in my previous post, the current model isn't very precise, although we did make some small changes since then. We also have some ideas for external data that can be added to improve the precision and already scraped the data for that. One of the first things we thought of was of course weather data, which indeed looks very promising: it boosted the Pearson correlation coefficient by roughly 20%. And of course, training the model on more data should improve the precision as well. During oSoc we trained the model on only 2 million datapoints, but the dataset has more than 70 million. But this would also mean rewriting a lot of our code or training the model on way more powerful hardware. Besides, the train IDs are not static, so the direction of train 1234 can change over the years, which would complicate things further. These problems can all be fixed, but it's not really something you could do in two weeks (and you would probably have to train it on some fancy GPU rig).
After dropping machine learning, things went very smoothly. The frontend was almost finished and just needed some small tweaks that most users probably wouldn't notice anyways. Implementing the statistics took maybe two days or so. The last week was just documenting everything and preparing for the demo day. The demo day itself went very smoothly as well, although I was pretty exhausted afterwards.
All in all, it was a really awesome experience. At the start of the project it all seemed really ambitious and I honestly wasn't sure if we could do it, but everything turned out so well, I'm really just amazed by our final product. I met so many cool people and did so much cool stuff. It hasn't even been a week since oSoc ended, but I already miss it (and judging by the Slack, I'm not the only one). And I'm already excited for next year!