Building data science tools

Darllenwch y dudalen hon yn Gymraeg

This blog is the third in a series introducing some of the work we have been doing in the Welsh Government’s Data Science Unit. A lot of the work we do as data scientists involves analysing data in some way, but we also like to develop tools that help people work with data. Here are a couple of examples of things we’ve been developing.

Developing software packages to make analysis easier

Open-source programming languages like R and Python are great tools for working with data. Most of the work we do involves using packages, which are pieces of software built by programmers to do specific tasks. R and Python packages let us do lots of incredible things, like cleaning and structuring data, making beautiful plots, or building machine learning models.

Back in January 2020, I developed an R package for downloading data from StatsWales, the Welsh Government’s statistical data repository. I gave it the very creative title statswalesr and published it on GitHub (an online platform for developing and sharing code). Plenty of software packages sit happily on GitHub, but R and Python have their own “official” networks for software packages. Getting a package published on an official network shows that it meets a certain standard, and it’s a good way to publicise it too. One of the main R networks is the Comprehensive R Archive Network (also known as CRAN). In October, with some encouragement from others in the data science team, I made some improvements to statswalesr, submitted it to CRAN, and after a few tweaks it was accepted!

Some R packages we’ve been using in our development work

Building packages is quite different to the typical analysis work you do as a data scientist. For example, packages have to be built to work on multiple operating systems and have to play nicely with other packages the user has on their computer. You also have to test your code thoroughly, and try to imagine all the scenarios in which your package could fail. This is important because you want your package to give helpful error messages when something goes wrong.

Why do we want to develop packages like this? Simply, they speed up the time-consuming parts of working with data. The statswalesr package reduces the time you spend trying to download the data you need. This means you spend more time analysing.

I’m excited about the potential we have in the unit to build open-source packages. We’re already planning more packages for the future – the first on the list is a Python equivalent of statswalesr!

Saving time with interactive documents

In 2021 we’re expecting that data science tools will become more widely available and used by Welsh Government analysts. We’ve been looking at how teams can make their workflows easier using reactive documents in R. A reactive document does exactly what you’d expect – it reacts to the changes that the user makes to it. This could be something like uploading a file or clicking a button. The R package shiny lets you easily build reactive elements with some R code, which you can embed in a document built with the rmarkdown package. Your user then just opens the file, clicks “run”, and the document appears in their local browser (as long as they have R installed on their computer).

When might a team want an interactive document? Analytical teams could replace time-consuming repetitive tasks, like writing database queries or making Excel workbooks, with things like checkboxes and dropdown menus. This could be helpful for dealing with repeated analytical requests. The other benefit of interactive R documents over Excel is that you can easily build extras on top of the analysis you’re doing, like automated charts or data quality checks.

We’ve already shown that automated documents in R with rmarkdown can save us a lot of time and adding interactivity could be a great way to make us more efficient at what we do.

If you want to get in touch with us please email

Jamie Ralph, Data Scientist