Introducing Koalas: Pandas API for Apache Spark

3rd May 2019


Apache Spark is a framework for distributed computing, and it has undergone a lot of development in recent years. It is now an irreplaceable part of working with big data in the modern age. How the release of Koalas combat some latest issues?

While Spark is written in Scala, most data scientists interact with Spark through Python given the nature of the data science toolkit. For many data scientists, this transition hasn’t always been easy.

The fundamental data structure in Spark is the RDD (resilient distributed datasets). For many who were used to nice tables in other frameworks, the syntax required to perform operations on these objects was difficult and not as intuitive. As such, Spark moved to the dataframe API – essentially creating dataframes for Spark.

However, there was still some friction. While the concept wasn’t entirely foreign to data scientists, slight syntactical changes (such as using .withColumn for aggregate operations) slowed the adoption rate for those solely comfortable with Pandas in Python. To combat this, the company Databricks (closely affiliated with Spark) has just released Koalas: Pandas API on Apache Spark

Besides having a cute name, this library allows you to do everything in Spark you would normally do with pandas, while continuing to take advantage of the distributed computing environment.

Usage

It can be easily installed with pip, and used comfortably within the Pyspark environment.

pip install koalas

from databricks import koalas as ks

Then, proceed as if you were using Pandas! If you’re operating on a Spark cluster, you should be able to perform all your standard operations in a fraction of the time for a single node!

df = ks.read_csv("my_data.csv")

This is definitely a great step in making big data more accessible for the Python data scientist, and I look forward to seeing this library in action at scale.

Read also: Web Scraping Tutorial in Python – Part 1

Tagged with:

Kyle Gallatin

A data scientist with an MS in molecular & biology. He currently aids to deploy AI and technology solutions within Pfizer’s Artificial Intelligence Center of Excellence using Python and other computer science frameworks

To give you the best possible experience, this site uses cookies. Continuing to use jamieai.com means you agree on our use of cookies.
Accept