Maintaining DBI, 3/4

Summarizing the progress of 2020
Author

Kirill Müller

Published

January 20, 2021

What is DBI?

DBI stands for database interface. The DBI package connects R to database management systems (DBMS). The goal of DBI is to provide a common interface for database access, regardless of the specific underlying DBMS. DBI works with a variety of DBMS, such as Postgres, MariaDB, and SQLite, through dedicated backend packages. For first-time users I recommend starting with the new introductory tutorial.

The current version of DBI is 1.1.1. This blog post attempts to define the scope of the DBI project, summarizes recent developments in DBI and related packages, and showcases future work.

Scope of the DBI project

The DBI package is perfect for anyone looking to connect to a database, read/write entire tables, and/or execute SQL queries. DBI gives a direct access to the database driver, leaving more sophisticated data query and manipulation tasks to packages like dbplyr, dbx and rquery.

The core DBI project in R provides an interface for databases, specified in textual form and via automated tests. The DBI specification contains a detailed description of the methods provided by DBI. In summary, the interface covers:

  • Discovery of tables, also in schemas

  • Reading/writing/creating/removing tables

  • Executing queries, fetching data (with parameters)

  • Safe quoting: low-level composition of queries

  • Transactions

DBI should provide a way to ingest data of any type into R, at least in serialized form (e.g. string or blob). It should offer a robust reliable interface for dependent packages; anything beyond this scope should be left to packages that extend DBI:

  • arkdb: archival of database data

  • connection: integrate database connections with the RStudio IDE

  • dbplyr and rquery: generation of SQL queries

  • dbx: DBI extension for data manipulation

  • dittodb: mocking for databases

  • dm: relational data models (via dbplyr)

  • pool: connection pooling

  • sqlr: schema definition

and many more.

Recent developments in DBI

This section discusses:

  • the new DBI tutorials,

  • improvements for datetime data,

  • other notable changes,

  • the move to GitHub Actions.

The first three items directly affect DBI users, the last item much less so. It is nevertheless an important investment in the stability of the DBI infrastructure.

New tutorials

James Wondrasek substantially expanded the “Introduction to DBI” article and added a second article. DBI now features two tutorials. The introduction includes a walkthrough that describes connecting and querying a real database. The “Advanced DBI usage” tutorial shows more advanced examples of quoting and parameter binding. The tutorials are an important first-hand resource for new users.

Time zones

To date, it was only possible to work reliably with time zones when the database connection represented all times in UTC. This poses a few problems in practice:

  • Not all databases store timestamps as UTC or with time zone offset, often local time is assumed by the data model.

  • Other systems often use the default setting for time zone, this harms interoperability of DBI in these cases.

  • Conversion of timestamps to dates via the SQL function DATE is only correct when the session time zone is set correctly.

RMariaDB 1.1.0 and RPostgres 1.3.0 gained more robust support for datetime values. As proposed in the previous blog post, new arguments timezone and timezone_out were added. Both arguments should use Olson names such as Europe/Berlin or America/New_York, not time offsets like +01:00; the latter may change with daylight time savings season. If timezone is set to NULL, an attempt is made to detect the correct time zone on the database. Thanks to Philipp Schauberger for contributing the initial timezone argument for RMariaDB.

RSQLite does not natively support dates or times. A promising pull request is underway that implements support for treating numeric values as time offsets if the column type is declared in a specific way.

Notable changes to DBI backends

The following package versions were sent to CRAN since the last blog post:

  • DBI 1.1.0 -> 1.1.1 (NEWS)
  • RMariaDB 1.0.8 -> 1.1.0 (NEWS)
  • RPostgres 1.2.0 -> 1.3.1 (NEWS)
  • RSQLite 2.1.5 -> 2.2.2 (NEWS)

Highlights are:

  • DBI: Two new tutorials; minor improvements to dbQuoteLiteral(), this is relevant for backends that don’t provide their own implementation.

  • RMariaDB: Better handling of data types and character encoding; minor tweaks to dbBind() and dbQuoteLiteral().

  • RPostgres: The new Redshift() driver that allows downstream packages to distinguish between Postgres and Amazon RedShift (thanks Hadley Wickham); minor improvements for querying and passing date and time types, postgresWaitForNotify() contributed by Jamie Lentin.

  • RSQLite: dbAppendTable() is faster, strings and blobs can have virtually unlimited size (limit 2 GB), embedded SQLite library is now in version 3.34.

  • DBItest: understanding which tests failed is now simpler, also thanks to simpler backtraces; test_some() integrates with the dblog package and shows DBI methods called; established compatibility with testthat 3.0.0; better and more robust tests.

  • RKazam: Is now a template repository

Thanks to Jeroen Ooms for maintaining Windows versions for the database libraries.

QA and automation

Automated tests are a crucial part of modern software engineering. These are often augmented with continuous integration (CI) services that run these tests regularly or with every change to the code. When I started working on DBI, Travis CI offered excellent continuous integration services for open-source repositories. Unfortunately, this is no longer the case: the free tier introduced a limit on CI build time, rendering it effectively unusable for DBI.

GitHub Actions is a CI/CD platform tightly integrated with GitHub. It is somewhat simpler to set up, also for creating workflows that e.g. open a pull request. It is sufficient to add a YAML configuration file to a dedicated location in the repository. Each build automatically obtains a token that can be used to interact with the GitHub API. R support is provided by dedicated workflows and actions contributed by RStudio. Check status is conveniently reported in detail with each pull request, and the checks run considerably faster due to higher concurrency.

Continuous integration for all packages in the project has moved to GitHub Actions. Cross-platform checks for all backends on the major operating systems were a bit challenging, because the tests require a live database. Thanks to Andrew Kane for providing GitHub actions that install database engines on all platforms, this greatly simplified the move.

Three more parts of the infrastructure were updated as part of the move:

  1. The odbc and duckdb packages are now also checked when the DBItest package updates. This ensures that new or amended specifications do not break these packages. If you maintain a DBI backend that uses DBItest, get in touch for integrating your backend with these checks.

  2. The list of DBI backends is now continuously updated. Updates to backends are applied automatically. Every time a new backend is found, a pull request is opened.

  3. A new pull request is opened in RSQLite when a new version of the SQLite library is available. This makes it much easier to keep the bundled SQLite version up to date.

Future work

The last blog post already identified major milestones:

  • query cancellation

  • testing on remote databases

A triage of the contributed issues has identified the following additional major topics:

  • immediate argument to dbSendQuery() and dbSendStatement() for RMariaDB and RPostgres

  • performance of table import

  • reconnect if a database connection is lost

Other minor issues include:

  • SSL connections

  • authentication plugins

  • support for more data types: arrays, JSON, …

I’m planning to resolve most of the remaining issues in a final sprint. Some of these issues can be outsourced to other packages, according to the scope outlined in the previous sections, priority should be given to issues that must be resolved in the core packages. Future work might shift towards providing or improving useful extensions.

Acknowledgments

I’d like to thank James Wondrasek for creating the DBI tutorials and for a review of this blog post, Angelica Becerra for reviewing the material, and the numerous contributors to the packages in the “Maintaining DBI” project (DBI¹, RSQLite², RPostgres³, RMariaDB⁴, and DBItest⁵):

@abalter³, @alanpaulkwan², @AllenSuttonValocity³, @altay-oz³, @anderic1², @andybeet¹, @arencambre⁴, @artemklevtsov³, @bastianilso⁴, @bczernecki¹, @Byggvir⁴, @Chrisjb¹, @clementbfeyt⁴, @colearendt¹, @daattali¹, @datawookie¹, @dpprdan³⁵, @elfatherbrown⁴, @EntwicklR², @ericemc3⁴, @formix³, @fproske¹, @georgevbsantiago¹, @GitHunter0¹, @hadley²³, @hmeleiro¹, @hpages², @imlijunda³, @inferiorhumanorgans³, @jarauh⁴, @jawond¹, @jeroen³, @jimhester¹, @jjesusfilho³, @jsilve24², @kforner⁴, @kmishra9², @Kodiologist¹², @LaugeGregers³, @luispuerto², @martinstuder⁵, @matteodelucchi⁴, @MaximumV¹, @mbannert³, @mbedward³, @mgirlich², @mlamias¹, @mllg¹, @mmuurr³, @momeara³, @MonteShaffer⁴, @Mosk915⁴, @nfultz², @norquanttech³, @OMalytics³, @oriolcmp⁴, @Osc2wall⁴, @psychobas², @randyzwitch⁵, @rcfree², @rnorberg¹, @rodriguesk², @rossholmberg⁴, @Sahil308⁴, @samuel-cs4⁴, @schuemie², @shutinet², @splaisan², @Trowic⁴, @verajosemanuel⁴, @VictorYammouni¹, @vigyoyo⁴, @vikram-rawat³, @vspinu³, @warnes³, @wiligl², @ycphs⁴, and @zyxdef¹.