tidyods

An R package to extract cells from ODS files in a tidy format.

About this project

Project date: 2022-Present

Project status: In development

See also

{tidyods} (tidy-oh-dee-ess) is a package for the R software language that extracts information from OpenDocument Spreadsheet (ODS) files in a “tidy” format. In addition to their values, tidyods extracts location and metadata information about cells. The package is heavily inspired by the {tidyxl} and {unpivotr} packages by Duncan Garmonsway and his book, Spreadsheet Munging Strategies.

Data science projects heavily benefit from “tidy data” but often datasets are often messy (Wickham 2014¹). However, data stored and published in spreadsheets is often “messy”, such as (but not limited to):

the table not starting in the top-left most column;
multi-level and multi-layer column headers;
other header information included in rows above the main content;
groups of related rows only having content in the lead cell for the identifying column, often with merged cells;
multiple tables of information being contained on the same sheet;
data being contained in a non-tabular format, such as pro-forma.

Conventional packages for reading spreadsheets will read in the values from these sheets but it can then require complex wrangling by an analyst to turn this into a tidy dataset.

Spreadsheet applications also allow users to make comments or notes associated with particular cells or to use formulas to calculate cell values; spreadsheet may also use cell formatting (e.g. font weight or colour, or cell background colours) to signify important information. The conventional R packages for reading spreadsheets do not provide any means for accessing this information.

Like the {tidyxl} package, {tidyods} extracts a wide range of information about cells contained in a spreadsheet and provides this back in a tidy dataset where each cell is described in its own separate row.

# Example of tidyods output
tidyods::read_ods_cells(path = "example_file.ods", sheet = 1)
# A tibble: 28 × 28
   sheet    address   row   col cell_type is_empty value_type cell_content  
   <chr>    <chr>   <dbl> <dbl> <chr>     <lgl>    <chr>      <chr>         
 1 penguins A1          1     1 cell      FALSE    string     species       
 2 penguins B1          1     2 cell      FALSE    string     female        
 3 penguins C1          1     3 cell      FALSE    string     bill_length_mm
 4 penguins D1          1     4 cell      FALSE    string     body_mass_g   
 5 penguins A2          2     1 cell      FALSE    string     Adelie        
 6 penguins B2          2     2 cell      FALSE    boolean    FALSE         
 7 penguins C2          2     3 cell      FALSE    float      40.4          
 8 penguins D2          2     4 cell      FALSE    float      4043          
 9 penguins A3          3     1 cell      FALSE    string     Adelie        
10 penguins B3          3     2 cell      FALSE    boolean    TRUE          
# ℹ 18 more rows
# ℹ 20 more variables: base_value <chr>, numeric_value <dbl>,
#   currency_symbol <chr>, boolean_value <lgl>, date_value <chr>,
#   time_value <chr>, has_formula <lgl>, formula <chr>, has_error <lgl>,
#   error_type <dbl>, has_annotation <lgl>, annotation <chr>, is_merged <lgl>,
#   merge_colspan <dbl>, merge_rowspan <dbl>, merge_shape <chr>,
#   cell_style <chr>, row_style <chr>, col_style <chr>, …

Analysts are then able to easily filter and interrogate the output to extract the information they need, or to use a package like {unpivotr} to convert it into a table of “tidy data” that can be used for analysis.

Supporting the use of ODS files

The OpenDocument Format is increasingly being adopted by governments and civic society bodies as the standard format for document publication. In particular UK government departments have started to publish their Official Statistics publications as ODS files. While there is a degree of standardisation to these types of files they are not automatically “tidy data”, they often have column headers derived from a variable (such as categories or dates) and have multiple rows of header information. Therefore, there is an emerging need for a {tidyxl} equivalent for ODS files.

Moreover, prior to the existence of {tidyods} there was only one package in the R ecosystem for reading ODS files, {readODS}. The development of {tidyods} was motivated by the discovery of a bug in how {readODS} handled a particular oddity of Microsoft Excel’s encoding of spreadsheet information into the ODS format. There are seven packages on CRAN that independently extract information from Excel files into R, meaning that if I’d found this bug in one of those packages I could have tried something else. With {readODS} being the only package for reading ODS files there is not the same redundancy and resilience in the R ecosystem, creating a secondary rationale for the development of {tidyods}².

Using tidyods

For detailed advice on using {tidyods} please refer to the documentation website.

At present {tidyods} is not on CRAN, you can install it in R using the following code.

# install the remotes package if necessary
install.packages("remotes")
remotes::install_github("mattkerlogue/tidyods")

Wickham, H (2014) Tidy Data. Journal of Statistical Software 59(10):1–23 https://doi.org/10.18637/jss.v059.i10 ↩︎
These Excel reading packages tend to have differing underlying approaches to working with Excel files, being a wrapper for different C++ or Java libraries. However, both {tidyods} and {readODS} rely on the {xml2} package at present to interact with the ODS file’s underlying XML content. ↩︎