Can I train scikit-learn model 100x faster using sk-dist?

6 min readJan 2, 2021

Recently I read an article titled Train sklearn 100x faster, which is about an open-source Python module named sk-dist. The module implements a "distributed scikit-learn" by extending it’s built-in parallelisation of meta-estimator, such as, pipeline.Pipeline, model_selection.GridSearchCV, feature_selection.SelectFromModel and ensemble.BaggingClassifier, etc., using spark.

It was 1AM in the morning. Wise-men and women have told me not to stay up late and use computers. However, I have too sedentary life to sleep early, I am too bored with netflix and chill, and I am too sober to dream about the next big thing since tiktok. So, I did the next best thing. Reading articles about programming and machine learning.

The article provided a sample code that basically implements a scaled-down MNIST’s digit recognition (image classification) problem using sk-dist. There were no mentions of library requirements, but the imports looked as follows:

from sklearn import datasets, svm
from skdist.distribute.search import DistGridSearchCV
from pyspark.sql import SparkSession

I work with both spark and scikit-learn quite a lot. I had a Dockerfile that builds container image with Spark 3.0.1, Python 3.7.9, and scikit-learn 0.24.0. I extended the requirements file with sk-dist 0.1.9 and build the image. Then I wrote down the sample code and triggered the command:

docker run -it -v ${PWD}:/opt/app spark3-dev python digit.py

Some debugging messages later the program failed spectacularly. I was a bit in the zone. So obviously, the ghost of Agatha Christie told me:

Everything must be taken into account. If the fact will not fit the theory — let the theory go.

So, I started my investigation. Can I really speedup scikit-learn training like the article suggests?

Does the code run?

The article was written in the Fall of 2019. It must have worked back then. I looked at the repo. The last commit with a successful build was made sometime in November 2020. The commit mentions both Spark 2.4 and Spark 3.0. So, I thought, if I build a docker image with a slightly older version of Spark, Python, and scikit-learn, it could work. After a bit of trial and error, the code worked with the following Dockerfile.

FROM python:3.7.4-stretch

RUN apt-get update && apt-get install -yq openjdk-8-jdk && apt-get clean
RUN apt-get update && apt-get install ca-certificates-java && apt-get clean && update-ca-certificates -f
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME

ENV HADOOP_VERSION 2.7
ENV APACHE_SPARK_VERSION 2.4.4
ENV APACHE_SPARK_HASH 2E3A5C853B9F28C7D4525C0ADCB0D971B73AD47D5CCE138C85335B9F53A6519540D3923CB0B5CEE41E386E49AE8A409A51AB7194BA11A254E037A848D0C4A9E5
ENV SPARK_HOME /usr/local/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}


RUN cd /tmp && \
        wget -q --show-progress https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && \
        echo "${APACHE_SPARK_HASH} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
        tar xzf spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /usr/local && \
        rm spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz

RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir --upgrade setuptools
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt
WORKDIR /opt/app

The requirements.txt file includes the following modules:

pyspark==2.4.4
sk-dist==0.1.9
scikit-learn==0.23.0

This time the container ran the code successfully. I manage to produce the similar performing model using the following:

sklearn.model_selection.GridSearchCV
skdist.distribute.search.DistGridSearchCV

Of course, since I executed the code in a single machine setup, I observed no real difference in speed up when the GridSearchCV called using 10 jobs.

Can sk-dist train models faster for not so big-data?

The article Train sklearn 100x faster suggested that sk-dist is applicable to small to medium-sized data (less than 1million records) and claims to give better performance than both parallel scikit-learn and spark.ml. I decided to compare the run time difference among scikit-learn, sk-dist, and spark.ml on classifying MNIST images. However, I went for larger dataset and a different algorithm to have a bit fairer comparison.

I decided to use Databricks to run the test, which is my go to platform to try out proper spark codes. The experiment setup is described in the following:

Cluster

The cluster contains 1 driver with 14.0 GB Memory, 4 CPU Cores and 1–8 workers with 14.0–112.0 GB Memory, 4–32 Cores. The cluster is equipped with Spark 2.4.5, Python 3.7.4, scikit-learn 0.23.0, and sk-dist 0.1.9.

Data

I used the full MNIST set that includes 60000 image representations for training and 10000 for testing. It can be easily loaded from databricks filesystem using the following code:

def get_mnist_data_databricks(is_train_data, cache):    
    data_path="/databricks-datasets/mnist-digits/data-001/"  
    if is_train_data:
        filename = "mnist-digits-train.txt" 
    else: 
        filename = "mnist-digits-test.txt"
    data = spark.read.format("libsvm")\
        .option("numFeatures", "784")\
        .load(os.path.join(data_path, filename))
    if cache:
        data.cache()
    return data

This code returns PySpark Dataframe with two columns: label and feature, where label indicates a specific image code and feature represents a 784-dimension representation of the image in SparseVector encoding. To convert this data to numpy array, I have created the following helper function.

def get_numpy_array_from_sparse_vector(data):
    dataset = data\
        .apply(lambda _ : np.array(_.toArray()))\
        .values.reshape(-1,1)
    ser_data = np.apply_along_axis(lambda _ : _[0], 1, dataset)
    return ser_data

Training scikit-learn model

I trained the scikit-learn model using the following code.

sk_estimator = sklearn.tree.DecisionTreeClassifier(random_state=0)
sklearn_model = GridSearchCV(
    estimator=sk_estimator,
    param_grid={"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10]},
    cv=10,
    scoring="f1_weighted",
    n_jobs=10
)

This model produced an f1 score of 0.8511 on the best training fold and 0.8655 on test data.

Training sk-dist model

I trained the scikit-learn model using the following code.

sd_estimator = sklearn.tree.DecisionTreeClassifier(random_state=0)
skdist_model = DistGridSearchCV(
    estimator=sd_estimator,
    param_grid={"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10]},
    cv=10,
    scoring="f1_weighted",
    sc=spark.sparkContext
)

The model gave the same score as the sklearn_model.

Training spark.ml model

To train the spark.ml model, I used the following code. The model produced an f1 score of 0.8554 on the best training fold and 0.8677 on test data. This means, the model gave similar accuracy as the other two models.

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluatordef get_pyspark_validator():
    idxr = StringIndexer(inputCol="label", outputCol="indexedLabel")
    clfr = DecisionTreeClassifier(labelCol="indexedLabel")
    estimator = Pipeline(stages=[idxr, clfr])
    mce = MulticlassClassificationEvaluator(labelCol="indexedLabel")
    grid = ParamGridBuilder()\
        .addGrid(classifier.maxDepth, [2, 3, 4, 5, 6, 7, 8, 9, 10])\
        .build()
    validator = CrossValidator(estimator=estimator,
                               evaluator=mce, 
                               estimatorParamMaps=grid, 
                               numFolds=10)
    return validator

Comparing run times

To compare the run times, I generated all three models using the above codes and fitted them on training dataset 50 times. I used Python’s timeit module to calculate execution times.

The scikit-learn model took about 261 seconds per trial, the spark.ml model took 391 seconds per trial, and the sk-dist model took 78 seconds. Clearly, sk-dist gave the best performance. However, it is definitely not 100x times faster.

Furthermore, Databricks enables automated tracking of model parameters and metrics in mlflow when CrossValidator is used. I have not found a way to disable the functionality, which contributes to the execution time. It is not easy to figure out how much is the contribution to execution due to automatic tracking.

After thoughts

Does sk-dist speeds up of scikit-learn training significantly? Yes.
Does sk-dist provides speed up as the article Train sklearn 100x faster states. Very difficult to say. The author have not provided sufficient details behind their results to replicate the same experiment. Please provide more details next time!
Is it interesting to look into sk-dist? Definitely.
Is such an imperfect experiment is useful? Yes. In real world, it is difficult and, often, impractical to make rigorous tests. A good enough test is much better than having no tests. However, results of such test should be taken with a grain of salt. They should open up more possibilities to further tests, not provide definite conclusions.
Was this late night/early morning exploration supported by Orange Juice and Baloney sandwich with the background noise of whatever happened in Netflix’s Marco Polo season 2 worthwhile? Hell yeah. I increased my knowledge in Spark and scikit-learn fractionally. What an adrenaline rush!

Disclaimer

The full source code can be obtained from here. I may have made mistakes. No unreviewed code guarantees absolute accuracy. Feel free to poke holes in work. It will be a pleasure to fix my mistakes.

Can I train scikit-learn model 100x faster using sk-dist?

Does the code run?

Can sk-dist train models faster for not so big-data?

After thoughts

Written by Misbah Uddin