Apache Spark is a framework for distributed computing, and it has undergone a lot of development in recent years. It is now an irreplaceable part of working with big data in the modern age. How the release of Koalas combat some latest issues?
While Spark is written in Scala, most data scientists interact with Spark through Python given the nature of the data science toolkit. For many data scientists, this transition hasn’t always been easy.
The fundamental data structure in Spark is the RDD (resilient distributed datasets). For many who were used to nice tables in other frameworks, the syntax required to perform operations on these objects was difficult and not as intuitive. As such, Spark moved to the dataframe API – essentially creating dataframes for Spark.
However, there was still some friction. While the concept wasn’t entirely foreign to data scientists, slight syntactical changes (such as using
.withColumn for aggregate operations) slowed the adoption rate for those solely comfortable with Pandas in Python. To combat this, the company Databricks (closely affiliated with Spark) has just released Koalas: Pandas API on Apache Spark
Besides having a cute name, this library allows you to do everything in Spark you would normally do with pandas, while continuing to take advantage of the distributed computing environment.
It can be easily installed with
pip, and used comfortably within the Pyspark environment.
pip install koalas
from databricks import koalas as ks
Then, proceed as if you were using Pandas! If you’re operating on a Spark cluster, you should be able to perform all your standard operations in a fraction of the time for a single node!
This is definitely a great step in making big data more accessible for the Python data scientist, and I look forward to seeing this library in action at scale.
Read also: Web Scraping Tutorial in Python – Part 1