python – The meaningfulness of random seeds (random_state) in machine learning

This post aims to discuss the nuances of picking a random seed when splitting a dataset into subsets.

I was working on the titanic dataset and chose the number 69 as a random seed. I know, it’s such a culturally and undebatably cool number. After running a logistic regression, the score was .79 for the training data and .81 for the test data. ‘Fair enough‘, I said to myself, ‘I’m not looking to be published‘.

After finishing my project, a thought crossed my mind; why not just pick a number that is meaningful for me and use it consistently in my own career? Therefore, I picked the random seed 13 and reran the model. God, I was in for a ride! The results shifted to .81 for the training data and .27! This overfit had me scandalized!

As you might imagine, my mind was overflowed with questions. What the hell happened there? I went on a short quest on the world wide web just to come up empty-handed. Many of the answers were having a real: “eh, c’est la vie!” vibe to them, in the sense that that’s exactly what randomness should be about…

Apparently, there is no answer in how to chose a random seed because if there would be one then it wouldn’t be random anymore. What? Questions still gravitate around my limited mind like Saturn’s rings:

  • Should I just accept the fact that a hardly worked on study · project · career goes to waste because of a few “bad” seeds?
  • How can I trust myself NOT to pick a “good” seed in a difficult personal context just to keep afloat (e.g., career fallout, burnout, dead family member…)?
  • How can I trust the fact that others have not tinkered with their own seeds (I’m not going to run seed simulations on every study I read, am I?)?
  • Is there really no way of combining a systematic approach and a random one?
  • etc.

At the moment, I am playing with the idea of a system, but I have to admit, this whole process doesn’t feel right. It feels disingenuous. But I don’t know what to do and this is why I am turning to other minds.

All in all, I have run my initial model with 100 different seeds. The selection was made based on 2 criteria: 1) I have isolated the seeds that put the train and test set scores within a 10% range and 2) a “random” selection is made on those seeds and “chosen” seeds are only recommended if the number of iterations respecting the above-specified range is greater than “chance” i.e., 50%. In my case, 65 of 35 iterations had instances where the differences of score between the train and test set were within that 10% range.

I am eager to have your input on, what seems to me, an excessively ambiguous phenomenon.

For those who want to run the problem themselves, here is a reproduction of the source code:

import pandas as pd
import random
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression


class Ordinal_encoding(BaseEstimator, TransformerMixin):
    """Encode categorical variables numerically"""
    def __init__(self):
        print()
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x, y=None):
        for x_col in x.columns:
            x_ordmap = {}
            if x(x_col).dtype == object:
                x_ordmap = {key: val for val, key in enumerate(x(x_col).unique(), 0)}
                x(x_col) = x(x_col).map(x_ordmap)
                    
        return x


def exploring_seeds(df):
    # set up a seed dictionnary containing:
        # key: number of the iteration
        # values: train score, test score, fit difference between the two
    global seed_dict
    seed_dict = {}
    
    # make 101 different splits with seeds from 0 to 100
    for i in range(101):
        xtrain, xtest, ytrain, ytest = train_test_split(df(('embarked', 'pclass', 'sex', 'age')),
                                                        df('survived'),
                                                        test_size=0.2,
                                                        random_state=i,
                                                        stratify=df(('embarked', 'sex')))
        
        # encode categorical variables numerically
        ord_enc = Ordinal_encoding().fit(xtrain, ytrain)
        ord_enc.transform(xtrain)
        ord_enc.transform(xtest)
        
        # impute missing values
        imputer_miss = IterativeImputer(
        estimator=ExtraTreesRegressor(n_estimators=10, random_state=i), max_iter=20, random_state=i)
        imputer_miss.fit(xtrain)
        xtrain_miss = imputer_miss.transform(xtrain)
        xtest_miss = imputer_miss.transform(xtest)
        xtrain = pd.DataFrame(xtrain_miss, columns=xtrain.columns)
        xtest = pd.DataFrame(xtest_miss, columns=xtest.columns)
        
        # standardize variables 
        std = StandardScaler()
        std.fit(xtrain)
        xtrain_std = std.transform(xtrain)
        xtest_std = std.transform(xtest)
        xtrain = pd.DataFrame(xtrain_std, columns=xtrain.columns)
        xtest = pd.DataFrame(xtest_std, columns=xtest.columns)
        
        # run the logistic regression
        lr = LogisticRegression(C=0.1,
                       penalty='l2',
                       solver='newton-cg',
                       random_state=i)
        
        # fit and predict 
        model = lr.fit(xtrain, ytrain)
        ypred = model.predict(xtest)
        train = model.score(xtrain, ytrain)
        test = model.score(xtest, ytest)
        diff = train - test
        seed_dict(i) = train, test, diff
    
    return seed_dict


# import dataset
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl',
                   usecols=('embarked', 'fare', 'pclass', 'age', 'sex', 'survived'))

# get rid of outliers 
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1

df = df(~((df < (q1 - 1.5 * iqr))
                                | (df > (q3 + 1.5 * iqr))).any(axis=1))

# explore seeds
exploring_seeds(df)


# turn dict into a df 
seeds = pd.DataFrame.from_dict(seed_dict, orient='index',
                               columns=('Train_score', 'Test_score', 'Fit_diff'))

# get graphical info 
# print('Distribution plot:')
# sns.kdeplot(seeds.Train_score)
# sns.kdeplot(seeds.Test_score)

# describe stats 
print('Stats:')
print(seeds.describe())

# observe the best seeds 
print()
print('Ordered df by test score:')
print(seeds.sort_values(by=('Test_score'), ascending=False))


# count and extract the seeds that produce "adjusted" subset scores
seed_list = ()
within = 0
outside = 0
for i in seeds.iterrows():
    fit_index = i(1)(2)
    if -0.11 <= fit_index <= 0.11:
        within += 1
        seed_list.append(i(0))
    else: 
        outside += 1

# get seed info and recommend some "random" seeds
rec = ()
def decision():
    if within > 50:
        for i in range(11):
            pick = random.choice(seed_list)
            if pick not in rec:
                rec.append(pick)
        print('{:.0f}% seeds fall within a 10% range of each other.'.format(within))
        print('Recommended seeds:')
        print(rec)
    else:
        print('Sorry, model is too unstable')

decision()