Introduction
In this blog post, we will walk through an exciting journey where we built an Apache Spark-based machine learning model to predict potential winning numbers for the EuroMillions lottery. The objective was to analyze historical draw data and create a predictive system using machine learning models, specifically Random Forest classifiers.
Project Overview
The project consists of four main steps:
- Loading historical EuroMillions data.
- Preprocessing and feature engineering.
- Training a multi-output Random Forest model.
- Generating multiple predictions in one script execution.
Step 1: Loading EuroMillions Data
We started by collecting EuroMillions draw data in CSV format, spanning multiple years. Each file contained draw results with the following format:
Date;Number 1;Number 2;Number 3;Number 4;Number 5;Star 1;Star 2 2004-12-31;7;8;24;25;47;8;9 2004-12-24;3;4;27;29;37;5;6
To load multiple CSV files efficiently into Apache Spark, we used the following script:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = SparkSession.builder.appName("LoadEuroMillionsData").getOrCreate()
schema = StructType([
StructField("Date", StringType(), True),
StructField("Number_1", IntegerType(), True),
StructField("Number_2", IntegerType(), True),
StructField("Number_3", IntegerType(), True),
StructField("Number_4", IntegerType(), True),
StructField("Number_5", IntegerType(), True),
StructField("Star_1", IntegerType(), True),
StructField("Star_2", IntegerType(), True)
])
data_path = "EuroMillionsData/*.csv"
df = spark.read.csv(data_path, header=True, sep=";", schema=schema)
df.write.mode("overwrite").parquet("data/euromillions.parquet")
Step 2: Preprocessing the Data
Once the data was loaded, we performed preprocessing using Spark’s VectorAssembler
. This transformation converts our raw number columns into a feature vector:
from pyspark.ml.feature import VectorAssembler
df = spark.read.parquet("data/euromillions.parquet")
assembler = VectorAssembler(inputCols=["Number_1", "Number_2", "Number_3", "Number_4", "Number_5", "Star_1", "Star_2"], outputCol="features")
df_prepared = assembler.transform(df).select("features", "Number_1", "Number_2", "Number_3", "Number_4", "Number_5", "Star_1", "Star_2")
df_prepared.write.mode("overwrite").parquet("data/euromillions_prepared.parquet")
Step 3: Training the Random Forest Model
We trained a separate Random Forest model for each number in the draw, meaning a total of 7 models.
from pyspark.ml.classification import RandomForestClassifier
train_data, test_data = df_prepared.randomSplit([0.8, 0.2], seed=42)
def train_and_save_model(label_col, model_name):
rf = RandomForestClassifier(featuresCol="features", labelCol=label_col, numTrees=50, maxDepth=5)
model = rf.fit(train_data)
model.write().overwrite().save(f"models/{model_name}")
return model
model_columns = ["Number_1", "Number_2", "Number_3", "Number_4", "Number_5", "Star_1", "Star_2"]
models = {col: train_and_save_model(col, f"euromillions_rf_{col}") for col in model_columns}
Step 4: Generating 10 Predictions
We designed a script to generate 10 independent predictions using our trained models.
import random
def generate_predictions():
for _ in range(10):
random_input = Vectors.dense([random.randint(1, 50) for _ in range(5)] + [random.randint(1, 12) for _ in range(2)])
predictions = {col: models[col].transform(spark.createDataFrame([(random_input,)], ["features"])).select("prediction").collect()[0][0] for col in model_columns}
print(f"Predicted Ticket: {[int(predictions[col]) for col in model_columns]}")
generate_predictions()
Final Thoughts
Throughout this project, we successfully loaded, processed, trained, and generated predictions using Apache Spark. While predicting lottery outcomes remains highly uncertain, this project showcases how machine learning can be used to analyze patterns in numerical data.
https://github.com/OnlineSolutionsGroupBV/EuroMillionsPrediction
Comments
Post a Comment