Avineon Tensing
FME

Why is DuckDB so fast and what this means for your FME workflows

Share

Following on from our exploration of FME 2024.1's DuckDB integration, exciting new developments in DuckDB are opening up even more possibilities for data processing workflows.

In our blog post last year we showed you how FME 2024.1's DuckDB integration into FME lets you query massive datasets like the 53-million Point of Interest Overture Maps dataset without downloading anything locally? Well, DuckDB has been busy getting even faster and more capable. The latest developments mean your FME workflows can now handle bigger datasets, run faster, and solve problems that were previously impossible.

Recap: what is DuckDB and why should it be on my radar?

In simple terms, DuckDB is often described as 'SQLite for analytics'. SQLite is a great database for transactional operations (OLTP) on a single machine but it's not built for complex analytical queries across millions of rows. This is where DuckDB steps in.

DuckDB is an in-process Online Analytical Processing (OLAP) database. Let's break down those terms:

  • In-Process: Like SQLite, it runs inside your application without needing a separate server. This eliminates complex setup, network latency and admin overhead.
  • OLAP: It is specifically designed for analytical queries - think aggregations (SUM, AVG), joins, and window functions over large datasets.

This unique combination means you get the power of a columnar analytical database with the simplicity and portability of a lightweight, serverless tool. For anyone working with huge datasets this is a big win.

Worth at this point also mentioning it's open source under an MIT license and it's already built into FME.

Where Does DuckDB Store Data?

DuckDB is very flexible - when you create a connection in FME, you have two choices:

In-Memory: DuckDB can run entirely in-memory. This is perfect for temporary, high-speed analysis where you don't need to persist the results to a database file.

File-Based: You can specify a file path (`.duckdb`). DuckDB will then create a single, portable database file that stores all your tables and data. You can easily share this file or connect to it later.

This dual-mode capability allows it to be used for both ephemeral analysis and as a persistent, serverless data store.

Why is DuckDB so fast and use little resource?

The two main reasons for the speed of DuckDB are because of:

Columnar Storage: Traditional databases store data in rows. To get the value from a single column, the database often has to read the entire row, including data from columns it doesn't need. DuckDB stores all the values for a column together. When your query only needs a few columns (which is common in analytics), DuckDB only reads the data for those specific columns, reducing I/O and speeding up the query.

Vectorised Query Execution: Instead of processing data one row at a time, DuckDB processes it in 'vectors' basically chunks or batches of values. This approach is more efficient for CPUs, minimising overhead and allowing for much higher throughput.

As a result DuckDB can process datasets 10× larger than available RAM, making DuckDB a great choice for the analytical tasks, even on old or limited hardware. The DuckDB team even ran a test using a 2012 MacBook Pro and showed it was able to handle and perform well querying from tables with Billions of rows and totaling over 265GB in size, read the blog here.

What This Means for FME Workflows

Enhanced Cloud-Native Processing

In our previous article, we showed how FME can query GeoParquet files stored on S3 without downloading them. The latest DuckDB improvements make this even more powerful:

  • Smarter Data Filtering: DuckDB now uses 'zone maps' to skip irrelevant data blocks entirely. When you're filtering datasets by date or region, DuckDB automatically ignores chunks of data that can't contain matches. As a result queries run faster and use less bandwidth.
  • Better Memory Management: Processing those large Overture Maps datasets is now smoother and more reliable, even when running multiple FME data requests simultaneously.
  • Improved Spatial Operations: DuckDB's spatial capabilities are growing and performance is improving - spatial joins are now 100x faster since DuckDB 1.3.

Streamlined ETL Processes

  • Pre-Processing at Scale: Use DuckDB within FME's SQLExecutor to perform complex aggregations and transformations on massive datasets. This reduces the load on expensive cloud resources and speeds up the FME Workspace.
  • Quality Control Gates: Implement data quality checks using DuckDB's analytical capabilities. Check for outliers, validate relationships between datasets, and ensure data completeness - all within your existing FME workflows.
  • Text Analytics: DuckDB can be used for text analytics by combining keyword, full-text, and semantic search techniques - perfect for quick text analysis in an FME data pipeline using an in-memory DuckDB database - great blog article here.

Data Pipeline Acceleration

Transform your existing FME workflows:

  • Replace complex multi-step transformations with single SQL operations.
  • Reduce data movement between systems by performing more processing in-place.
  • Implement sophisticated business logic using familiar SQL syntax.

One way to further accelerate the FME workflow could be to have the inlinequerier work with DuckDB as an alternative to SQLite - check out Oliver Morris' idea in the FME Ideas page and maybe upvote :)

When to use DuckDB in FME?

While DuckDB is powerful it's not a replacement for every database. Knowing when and where to deploy it is key to unlocking its potential. Below is a reminder of the different database types and user cases:

At Avineon Tensing, we integrate DuckDB in several key scenarios to enhance our clients' data processes:

  • Interactive Data Analysis: DuckDB offers a lightning-fast way to explore and analyse local Parquet, CSV, or JSON files using familiar SQL syntax.
  • ETL/ELT Pipeline Enhancement: We use DuckDB as a high-speed transformation engine within larger data pipelines. It can perform complex aggregations and joins on data in-flight before it's loaded into central data warehouses.
  • Data Validation and Quality Control: Before loading data into a production system or feeding it into an AI model, we can use DuckDB to run rapid quality checks and validation queries on raw files, ensuring data integrity without complex infrastructure.

Integrating Tools and AI for Smarter Processes

DuckDB is most powerful when it is used in a data workflow orchestrated by FME, becoming as a powerful component in a larger data pipeline. By combining DuckDB's speed with FME and our capabilities in AI and machine learning, we create pipelines that are more than just fast - they're smart. 

Imagine a workflow where we use DuckDB to instantly pre-process and validate geospatial data from multiple sources before feeding it into an AI model to augment and classify the data. These are the types of data pipelines we are developing.

If you're looking to modernise your data analytics capabilities and make your data processes more efficient, let's talk.