Using Arrow & DuckDB to Handle Large Data

How to Use Arrow & DuckDB in RStudio
Published

May 14, 2024

1 Introduction

Many times, we deal with large datasets such that the memory (e.g., RAM) of the machine is not enough to load or analyze the data. In such circumstances, using a database like PostgreSQL comes very handy (One can go over here to learn how to set up a PostgreSQL server). However, using a database requires knowledge of setting up database server and SQL and many people are not familiar with those kinds of knowledge. In such circumstance, using Arrow and DuckDB could be very much helpful to derive the benefits of a database.

2 What is Arrow?

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. According to Apache Arrow - “Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.”

Apache Arrow is used for handling big data generated by the Internet of Things and large scale applications. Its flexibility, columnar memory format and standard data interchange offers the most effective way to represent dynamic datasets. Apache Arrow performance does more than just speed up a big data project — it can handle multiple projects by acting as a common data interchange mechanism.

3 What is DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. Hannes Muhleisen and Mark Raasveldt created DuckDB, with the first version released in 2019. DuckDB is a free, open-source embedded analytical database.

Using database systems was the better option in a data science role because they offer ACID properties, they offer fast and flexible query execution, and offer integrity checks. However, there are major issues when using Relational Database Management Systems (RDBMS), such as being difficult to install, set up, and maintain. Data transferring can be slow to transfer to and from the client, and they have a poor interface with the client application. DuckDB addresses all the problems that an RDBMS has so that a data scientist or data analyst can have a world without the difficulty in setup but still have faster queries.

4 Using Both Arrow and DuckDB in RStudio

I have a dataset (a csv file) containing more than 30 million (exactly 30,623,729 rows) rows and 17 variables. Now loading the dataset is memory-intensive; therefore, we can easily use arrow and DuckDB to process the dataset.

4.1 Loading Necessary R Packages

# Loading or installing necessary packages 
#install.packages('arrow')
library(arrow)
#install.packages('duckdb')
library(duckdb)
library(DBI)
library(dbplyr)
library(fs)
library(tidyverse)

4.2 Converting the Big Data Files into Arrow Object

#####################################################
# Arrow Object 
######################################################
df_arrow = arrow::open_dataset("path_to_the_file/name_of_the_file.csv",
                               format = 'csv',
                               col_types = schema(year = as.numeric())
                               )

df_arrow %>% object.size()

df_arrow %>% glimpse()

4.3 Sending the Arrow Object to DuckDB Database

#####################################################
# Using duckdb with arrow
#####################################################

# Create Directory 
duckdb_dir <- path(here::here('DATA'), "duckdb_df")
if (!dir.exists(duckdb_dir)) {
  dir.create(duckdb_dir)
}

# Create Connection on DuckDB
con_duckdb <- dbConnect(
  drv = duckdb(),
  dbdir = path(duckdb_dir, "duckdb.ddb")
)

# Sending Arrow Object in DuckDB Database
con_duckdb %>% 
  dbWriteTable("name_of_the_table_in_duckdb", collect(df_arrow), overwrite = TRUE)

## TO see all schemas in the Connection
DBI::dbListObjects(con_duckdb)

4.4 Manipulating (Extracting) Data from DuckDB Database

df_duckdb = con_duckdb %>% 
  tbl("name_of_the_table_in_duckdb")

df_duckdb %>% 
  glimpse()

df_duckdb %>% 
  count() %>% 
  pull() # 30,623,729 Observations 



df_duckdb %>% 
  select(year, cusip, ticker) %>% 
  filter(year %in% 2005:2022) %>% 
  filter(ticker == 'CBA') %>% 
  collect()

5 DIY Project using R, Arrow and DuckDB

You can download a dataset of item chekouts from Seattle Public Library. Click the “Export” button on the top right corner and save the file as csv file. The dataset contains more than 45 million rows and 12 columns. You can download the dataset and try to manipulate it using both Arrow and DuckDB.

6 Conclusion

Using Arrow and DuckDB along with R makes it far easier nowadays to manage and analyze large volume of data.