Building an EuroMillion Prediction Model Using Apache Spark



Introduction

In this blog post, we will walk through an exciting journey where we built an Apache Spark-based machine learning model to predict potential winning numbers for the EuroMillions lottery. The objective was to analyze historical draw data and create a predictive system using machine learning models, specifically Random Forest classifiers.

Project Overview

The project consists of four main steps:

  • Loading historical EuroMillions data.
  • Preprocessing and feature engineering.
  • Training a multi-output Random Forest model.
  • Generating multiple predictions in one script execution.

Step 1: Loading EuroMillions Data

We started by collecting EuroMillions draw data in CSV format, spanning multiple years. Each file contained draw results with the following format:

        Date;Number 1;Number 2;Number 3;Number 4;Number 5;Star 1;Star 2
        2004-12-31;7;8;24;25;47;8;9
        2004-12-24;3;4;27;29;37;5;6
    

To load multiple CSV files efficiently into Apache Spark, we used the following script:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName("LoadEuroMillionsData").getOrCreate()

schema = StructType([
    StructField("Date", StringType(), True),
    StructField("Number_1", IntegerType(), True),
    StructField("Number_2", IntegerType(), True),
    StructField("Number_3", IntegerType(), True),
    StructField("Number_4", IntegerType(), True),
    StructField("Number_5", IntegerType(), True),
    StructField("Star_1", IntegerType(), True),
    StructField("Star_2", IntegerType(), True)
])

data_path = "EuroMillionsData/*.csv"
df = spark.read.csv(data_path, header=True, sep=";", schema=schema)

df.write.mode("overwrite").parquet("data/euromillions.parquet")

Step 2: Preprocessing the Data

Once the data was loaded, we performed preprocessing using Spark’s VectorAssembler. This transformation converts our raw number columns into a feature vector:

from pyspark.ml.feature import VectorAssembler

df = spark.read.parquet("data/euromillions.parquet")

assembler = VectorAssembler(inputCols=["Number_1", "Number_2", "Number_3", "Number_4", "Number_5", "Star_1", "Star_2"], outputCol="features")
df_prepared = assembler.transform(df).select("features", "Number_1", "Number_2", "Number_3", "Number_4", "Number_5", "Star_1", "Star_2")

df_prepared.write.mode("overwrite").parquet("data/euromillions_prepared.parquet")

Step 3: Training the Random Forest Model

We trained a separate Random Forest model for each number in the draw, meaning a total of 7 models.

from pyspark.ml.classification import RandomForestClassifier

train_data, test_data = df_prepared.randomSplit([0.8, 0.2], seed=42)

def train_and_save_model(label_col, model_name):
    rf = RandomForestClassifier(featuresCol="features", labelCol=label_col, numTrees=50, maxDepth=5)
    model = rf.fit(train_data)
    model.write().overwrite().save(f"models/{model_name}")
    return model

model_columns = ["Number_1", "Number_2", "Number_3", "Number_4", "Number_5", "Star_1", "Star_2"]
models = {col: train_and_save_model(col, f"euromillions_rf_{col}") for col in model_columns}

Step 4: Generating 10 Predictions

We designed a script to generate 10 independent predictions using our trained models.

import random

def generate_predictions():
    for _ in range(10):
        random_input = Vectors.dense([random.randint(1, 50) for _ in range(5)] + [random.randint(1, 12) for _ in range(2)])
        predictions = {col: models[col].transform(spark.createDataFrame([(random_input,)], ["features"])).select("prediction").collect()[0][0] for col in model_columns}
        print(f"Predicted Ticket: {[int(predictions[col]) for col in model_columns]}")

generate_predictions()

Final Thoughts

Throughout this project, we successfully loaded, processed, trained, and generated predictions using Apache Spark. While predicting lottery outcomes remains highly uncertain, this project showcases how machine learning can be used to analyze patterns in numerical data.

 https://github.com/OnlineSolutionsGroupBV/EuroMillionsPrediction

Comments