skip to content
Intentando.dev

Learning C++ through DuckDB

Jumping headfirst into C++ from Python and JS. What could go wrong?

Ok, so 99% of the time I work from high-level interpreted languages. Python and JavaScript get most of my attention and they have enough tools to solve my problems. One of these tools is DuckDB. It’s fast and really ergonomic, and I’ve been wanting to mess with its internals to test some things out. However, access to the inner workings is limited to the C++ “not designed as a stable user-facing” API.

Also, I have never written any C++ from scratch. Nor setup a compiler…1

Well, better late than never! Let’s try to figure this out so we can use some DuckDB APIs. I’ll be noting here some steps I take to set everything up and run a “hello world” example from DuckDB’s C++ API.

(Before) getting started

Apparently there are many methods for compiling C++ and each varies by system, so firstly, my system runs MacOS 12.4. Further, for this tutorial(?), I’ll be using the g++ compiler. Check if you have it handy by opening a terminal and running either which g++ or g++ -v. I don’t think this came pre-installed, and the internet is saying that you may or may not need to install Xcode to get g++ on a Mac. Do double-check any installation steps for your specific system and OS.

For my editor, I’m using VS Code but I haven’t prepared any fancy shortcuts to compile and run the code for us. The editor is mostly for the good ol’ syntax highlighting features. Our terminal will help us with the rest.

Create the project

Create a new folder named however you like (e.g. ‘duckdb_cpp_hello_world’). Next, open this folder in a new VS Code window. That’s it!

Getting DuckDB installed

It’s literally 3 files downloaded from the internet. There’s something relaxing about just copy-pasting files somewhere once and done, instead of hoping pip install duckdb sticks the landing2. Go to the DuckDB Installation page and download the stable release version for the ‘C/C++’ environment of your platform.

This download is a ZIP file which contains three files when extracted:

  1. duckdb.h
  2. duckdb.hpp
  3. libduckdb.dylib (or .so, depending on your platform)

Paste these files in the root of your project folder. That folder should look like this:

duckdb_cpp_hello_world/
├── duckdb.h
├── duckdb.cpp
└── libduckdb.dylib

The duckdb files are header files for C and C++, respectively. It’s a bit confusing at first, but the convention is sort of this: the more repeated letters the extension has, the likelier is that the file in question is for C++ instead of C. So, files ending with .h and .c are for the C language while .hpp, .cpp, and .cc are for use within C++. Since we are writing our script in C++, feel free to delete the C-based duckdb.h file so we don’t accidentally import it later.

This duckdb.hpp file contains declarations for the functions we call. It’s like a telephone book for everything we could possibly import and call. We’ll bring it into our script later on so the compiler knows what is available to it.

However, the library itself, the pre-compiled code that gets called, is within the libduckdb.dylib file. The “dylib” stands for ”dynamic library”. Other platforms may use the ‘.so’ extension which means “shared object”. The “dynamic” means that the library is kept separate from our final program and only linked when we run our program. If our program were a computer, then our dynamic library is like a computer mouse: you can connect or “link” it via USB after the computer is built. Whereas a “static” library is actually inside the computer case, such as the hard drive or RAM. It is “linked” into the monolith when the PC is built or “compiled”. Dynamic? a computer peripheral. Static? Inside the computer case3.

As far as “installing” this library goes, this is it. Copying the files is all that was needed. We’ll work on importing and linking the library after we write our first script.

Hello world!

Before we try importing DuckDB, let’s go over how “hello world!” is written. First, in Python, it’s a one-liner script:

# hello_world.py
print('hello world!')

In C++, however, there’s a bit more boilerplate to get going:

// hello_world.cpp
#include <iostream>

int main() {
	std::cout << "hello world!" << std::endl;
	return 0;
}

What is all this?? There are 3 pieces to this code: The #include directive, the main() function, and whatever the line with the std:: calls is doing to print “hello world!“.

The #include directive imports libraries into our scripts. Here, it imports the input/output streams (I/O streams) which are part of the C++ standard library’s <iostream> header. These are the variables that allow us to see our messages in the terminal.

For example, each part of std::cout has a meaning. The std means “standard library” and, specifically, the standard library’s namespace. The cout is short for the “C output stream” also known as “stdout”. When we “print” a message, we write it to the output stream that our terminal displays. So, std::cout refers to the output stream defined under the standard library’s namespace. What are namespaces? I don’t really know yet, except that they manage the scope of objects we can access. In Python, some unorthodox examples with a similar vibe could be:

  1. When importing a library
import os # 'os' is the "namespace"
os.listdir # "listdir" is the object under the 'os' "namespace"
  1. Class attributes
class my_namespace:
	some_object = 123
my_namespace.some_object # "some_object" is under "my_namespace"
  1. Dictionary key/values
my_namespace = {
	'some_object': 123
}
my_namespace['some_object']

Why do we use the std namespace instead of the <iostream> we actually included in our script? As far as I understand, it’s because <iostream> is just a shorthand for partially initializing the standard library std. We include only the input/output streams rather than everything under the standard library’s umbrella.

While we’re on the topic of streams, how does this line print a message?

std::cout << "hello world!" << std::endl;

We said std::cout is the output stream which prints to our terminal. What does the double “less-than” symbol (<<) mean? It’s called the “insertion operator” which “sends bytes to an output stream object”. I like to think of it as an arrow where information flows into what the insertion operator is pointing towards. Here, our “hello world!” message is printed out by “inserting” it into our output stream! Finally, the endl sounds like “end line” and acts like a carriage return: endl triggers the printing of the line containing everything inserted before endl “clears” the line, and then starts a new line where everything newer will be written. So to print out two lines you could use:

std::cout << "the first line" << std::endl << "another line!" << std::endl;

The messages we insert into the output stream can get stuck in a “waiting room” until an endl is sent to the output stream, so always close with one to be safe.

Ok, almost done! The last bit is the main() function. A barebones example would be:

int main() {
	return 0;
}

Every C++ script needs a main function!! All scripts automatically call this function when we run them. The int prefixing our function name corresponds to the return type. Here we return 0 so we prefix our main function with the integer type int. You can return any integer but usually zero means that the program concluded successfully and without errors. In Python, it could look like this:

def main() -> int:
	return 0

main()

Whereas in Python the int annotation is optional, in C++ it is necessary. Also note we need to call main() manually in Python.

Let’s review how all the pieces fit together:

// hello_world.cpp
#include <iostream>

int main() {
	std::cout << "hello world!" << std::endl;
	return 0;
}
  1. The #include <iostream> imports and initializes the standard library’s output stream we’ll be printing our messages to
  2. The main() function is the entry point of our program and must return an integer
  3. std::cout and std::endl write and send our message, respectively, to the output stream
  4. The return zero successfully ends the program

That’s it! Next, we’ll go over compiling and running the code.

Compiling “Hello world!”

C++ is a “compiled” language whereas Python is an “interpreted” one. This means our computers run their scripts very differently. Python opens our script and starts running lines one-by-one, “interpreting” whatever is going on as it progresses. Meanwhile, C++ needs to “compile” everything together before anything has actually happened at all.

If Python were a chef, it would be an impromptu cook. It would start doing steps in a recipe until it either finishes making the dish, or it stops because they didn’t have all the required ingredients. C++ would scrutinize the recipe and double-check they had the necessary ingredients and equipment. As in life, planning takes more time than winging it.

Here’s how we tell C++ to compile our script:

g++ hello_world.cpp -o hello_world

Run this in a terminal located in the same folder as our project. The g++ command calls our C++ compiler (specifically, the GNU C++ Compiler). Next, we specify that hello_world.cpp is the script we wish to compile. Finally, -o hello_world indicates that we want our output program executable (-o) to have hello_world as its filename.

Counterintuitive but important: When this compilation goes right, it prints nothing to the terminal! Unlike Python, which runs the code as it interprets it, compilation just builds the program executable. We will only see the message when we run this compiled program. As you iterate writing your own scripts, double-check whether you are compiling or running a program when you bump into something unexpected.

If this compilation completes successfully, you can run the finished program using:

./hello_world

which prints out:

hello world!

If you use the && operator, you can combine both steps into one:

g++ hello_world.cpp -o hello_world && ./hello_world

Notice that our command is getting messy already compared to python hello_world.py. The readability of compiler commands tends to go downhill pretty quickly, and I believe it is worth your time to familiarize yourself with what different arguments of g++ do.

And with that we have hello world! Next, we move onto “hello duckdb!”

Hello duckdb!

We are going to send a “hello duckdb!” message from inside a DuckDB query. First, we need to go over how we write and run the corresponding SQL query. Then, we must tell C++ to print the query result to the terminal.

To handily test SQL queries, use either the DuckDB CLI client or the online shell. Our message will use this query:

select 'hello duckdb!' as message;
┌───────────────┐
│ message       │
╞═══════════════╡
│ hello duckdb! │
└───────────────┘

Now, running this query and printing the result in C++ is done as follows:

// hello_duckdb.cpp
#include "duckdb.hpp"

int main () {
    duckdb::DuckDB db(nullptr);
    duckdb::Connection con(db);
    auto result = con.Query("SELECT 'hello duckdb!' as message");
    result->Print();
    return 0;
}

Before doing anything, we must include the DuckDB header file we downloaded earlier. Next, we create a new database called db and create a connection to it. We send our query through the connection and let C++ automatically set the result variable’s type. Finally, we ask the query result to print itself to the terminal. Then we return 0 and we’re done!

Compiling this script, however, is more complicated than our first script. We also need to tell the compiler where our DuckDB library files are located, and what version of C++ to use.

g++ -std=c++11 -L. -lduckdb hello_duckdb.cpp -o hello_duckdb

The -std=c++11 argument tells the compiler to use the C++11 standard which DuckDB uses. I believe DuckDB also supports newer versions of C++ but double-check that before attempting it. The -L. option specifies that there are library files (e.g. libduckdb.dylib) stored in the current directory (.). Last but not least, -lduckdb means that we should link the duckdb library. Specifically, it looks for a file with libduckdb as part of its name4.

Run the program with:

./hello_duckdb
message
VARCHAR
[ Rows: 1]
hello duckdb!

If MacOS complains about libduckdb.dylib that “macOS cannot verify that this app is free from malware”, then go into your Mac’s “System Preferences” > “Security & Privacy” > “General” and click on the button asking whether to allow libduckdb.dylib to run.

And remember you can use the && operator to just use one command:

g++ -std=c++11 -L. -lduckdb hello_duckdb.cpp -o hello_duckdb && ./hello_duckdb

That’s it! You’ve run and outputted a query using DuckDB’s C++ API. This should be enough to get you started. I’m planning to write a follow-up post after I experiment with these DuckDB functions more and find some fun use cases.

Footnotes

  1. They don’t have these in Python.

  2. Until we start linking libraries and it gets chaotic real quick.

  3. https://www.reddit.com/r/buildapc/comments/uy95m6/do_i_have_to_buy_an_anti_static_wrist_wrap/

  4. The extension of the library can be different depending on the system or type of library we are importing. Some extensions include .so, .dylib, .a