Parallel computing in Python using Dask

Parallel computing is an architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing extensive calculations by dividing the workload between more than one processor, all of which work through the calculation simultaneously. The primary goal of parallel computing is to increase available computation power for faster application processing and problem solving. In sequential computing, all the instructions run one after another without overlaping with each other, whereas in parallel computing instructions run in parallel to complete the given task more faster.

Applications of Parallel Computing

Following are some of the applications of parallel computing :

Database and Data Mining
Extensive Graphics Processing like Video Games, Augmented Reality
Medical Imaging & Diagnosis
Web search engines
Electronic Circuit Design

Dask is a free and open-source library to achieve parallel computing in Python. It works well with all the popular python libraries like Pandas, Numpy, scikit-learns etc. In pandas, we can’t handle very large datasets(unless we have plenty of RAM) because they use a lot of memory. It is not possible to use Pandas or Numpy for our data processing tasks when our data scales. We may try other distributed computing libraries like Apache Spark but those are not written in Python so we won’t be able to utlize pandas, numpy or other machine learning libraries. But Dask provides dataframes which can handle large datasets with minimal code changes. Its scalability with our data in terms of processing time or memory usage gives it advantage over other libraries. Dask helps us to utilize the power of distributed computing with seamless integration with other data science libraries like pandas or numpy.

Comparison to other frameworks

Dask vs Spark - Spark is a quite popular name in the domain of distributed computing. In comparison to Spark, Dask is light weight and smaller that means it has limited features. one major difference is that Spark is written in Scala and supports python using wrapper library, whereas Dask is written in Python and couples with almost all python libraries for data analysis like pandas, numpy, and scikit-learn. Spark doesn’t support for multi-dimentional arrays whereas Dask supports scalable multi-dimentional arrays. Both scale from a single node to thousand-node clusters. If you want to write your project in Python and looking to add machine learning in it then you can go with Dask. If you are a scala or SQL developer, working on JVM infrastructure and doing mostly ETL part then you can go with Spark. In terms of ease of use Dask is far easier to implement than Spark.
Dask vs Ray vs Modin - Dask and Ray differ in their scheduling mechanisms. Dask uses a centralized scheduler which handles all tasks for a cluster. Ray is decentralized, that means every machine runs its own scheduler, so that scheduled task issues are handled at individual machine level, not at the level of the whole cluster. Dask has extensive high-level collections APIs (e.g., dataframes, distributed arrays, etc), whereas Ray does not. Modin on the other hand runs on top of Dask or Ray. We can easily scale our pandas workflow through Modin by just adding a single line of code import modin.pandas as pd Dask DataFrame does not scale the entire pandas API but Modin attempts to parallelize as much of the pandas API as is possible. Dask DataFrame has row-based partitioning, similar to Spark. Modin is more of a column-store, which we inherited from modern database systems. Amongst developers Ray and Dask are more popular as they are old and mature.

Use Cases of Dask

Dask use cases are divided into two parts -

Dynamic task scheduling - which helps us to optimize our computations.
“Big Data” collections - like parallel arrays and dataframes to handle large datasets.

Dask collections are used to create Task Graph which is nothing but the structure of our data processing tasks. It is a visual representation of our tasks. Dask schedulers executes this task graph.

Dask uses parallel programming to execute the tasks. Parallel programming means to execute multiple task at the same time. This way we can properly utlize our resources and finish multiple steps simultaneously.

Let’s see some of the data collections provided by Dask.

dask.array - It uses the Numpy interface and cuts down the large array into smaller ones so that we can compute on arrays larger than our system’s memory.
dask.bag - It provides operations like map, filter, fold, and groupby on collections of generic Python objects.
dask.dataframe - distributed dataframes like pandas. It is a large parallel dataframe made of many smaller dataframes.

Installation

Dask can be installed using following command -

pip install dask[complete]

Let’s see some examples of parallel computing using this library. Dask.delayed is used to achieve parallelism in our code. It delays function calls into a task graph with dependencies.

import time
import random

def profit(x,y):
    time.sleep(random.random())
    return x + y

def loss(x,y):
    time.sleep(random.random())
    return x - y

def total(x, y):
    time.sleep(random.random())
    return x + y

%%time
x = profit(45,10)
y = loss(20,15)
z = total(x, y)
z

Output will be -

CPU times: user 1.33 ms, sys: 2.16 ms, total: 3.49 ms
Wall time: 1.29 s

60

These functions will run sequencely one after another, though they are not dependent on each other. So we can run them parallely to save time.

import dask
profit = dask.delayed(profit)
loss = dask.delayed(loss)
total = dask.delayed(total)

%%time
x = profit(45,10)
y = loss(20,15)
z = total(x, y)
z

Output will be -

CPU times: user 903 µs, sys: 0 ns, total: 903 µs
Wall time: 557 µs

Delayed('total-b0bcbf09-79e4-4dfb-9b91-0f69af46f759')

Even in this small example also we can see the improvement of running time. We can also visualize the task graph like this -

z.visualize(rankdir='LR')

It will look something like this -

task graph

That’s all for now. You should definitely try this library in your next project. Happy coding :)