tidyods
An R package to extract cells from ODS files in a tidy format.
About this project
{tidyods}
(tidy-oh-dee-ess) is a
package for the R software language that extracts information from OpenDocument
Spreadsheet (ODS) files in a “tidy” format. In addition to their values,
tidyods extracts location and metadata information about cells. The package is
heavily inspired by the {tidyxl}
and
{unpivotr}
packages by Duncan
Garmonsway and his book,
Spreadsheet Munging Strategies.
Data science projects heavily benefit from “tidy data” but often datasets are often messy (Wickham 20141). However, data stored and published in spreadsheets is often “messy”, such as (but not limited to):
- the table not starting in the top-left most column;
- multi-level and multi-layer column headers;
- other header information included in rows above the main content;
- groups of related rows only having content in the lead cell for the identifying column, often with merged cells;
- multiple tables of information being contained on the same sheet;
- data being contained in a non-tabular format, such as pro-forma.
Conventional packages for reading spreadsheets will read in the values from these sheets but it can then require complex wrangling by an analyst to turn this into a tidy dataset.
Spreadsheet applications also allow users to make comments or notes associated with particular cells or to use formulas to calculate cell values; spreadsheet may also use cell formatting (e.g. font weight or colour, or cell background colours) to signify important information. The conventional R packages for reading spreadsheets do not provide any means for accessing this information.
Like the {tidyxl}
package, {tidyods}
extracts a wide range of information
about cells contained in a spreadsheet and provides this back in a tidy dataset
where each cell is described in its own separate row.
# Example of tidyods output
tidyods::read_ods_cells(path = "example_file.ods", sheet = 1)
# A tibble: 28 × 28
sheet address row col cell_type is_empty value_type cell_content
<chr> <chr> <dbl> <dbl> <chr> <lgl> <chr> <chr>
1 penguins A1 1 1 cell FALSE string species
2 penguins B1 1 2 cell FALSE string female
3 penguins C1 1 3 cell FALSE string bill_length_mm
4 penguins D1 1 4 cell FALSE string body_mass_g
5 penguins A2 2 1 cell FALSE string Adelie
6 penguins B2 2 2 cell FALSE boolean FALSE
7 penguins C2 2 3 cell FALSE float 40.4
8 penguins D2 2 4 cell FALSE float 4043
9 penguins A3 3 1 cell FALSE string Adelie
10 penguins B3 3 2 cell FALSE boolean TRUE
# ℹ 18 more rows
# ℹ 20 more variables: base_value <chr>, numeric_value <dbl>,
# currency_symbol <chr>, boolean_value <lgl>, date_value <chr>,
# time_value <chr>, has_formula <lgl>, formula <chr>, has_error <lgl>,
# error_type <dbl>, has_annotation <lgl>, annotation <chr>, is_merged <lgl>,
# merge_colspan <dbl>, merge_rowspan <dbl>, merge_shape <chr>,
# cell_style <chr>, row_style <chr>, col_style <chr>, …
Analysts are then able to easily filter and interrogate the output to extract
the information they need, or to use a package like {unpivotr}
to convert it
into a table of “tidy data” that can be used for analysis.
Supporting the use of ODS files
The OpenDocument Format is
increasingly being adopted
by governments and civic society bodies as the standard format for document
publication. In particular UK government departments have started to publish
their Official Statistics publications as ODS files. While there is a degree
of standardisation to these types of files they are not automatically
“tidy data”, they often have column headers derived from a variable
(such as categories or dates) and have multiple rows of header information.
Therefore, there is an emerging need for a {tidyxl}
equivalent for ODS files.
Moreover, prior to the existence of {tidyods}
there was only one package in
the R ecosystem for reading ODS files, {readODS}
.
The development of {tidyods}
was motivated by the discovery of a bug in how
{readODS}
handled a particular oddity of Microsoft Excel’s encoding of
spreadsheet information into the ODS format. There are seven packages on
CRAN that independently extract information from Excel files into R,
meaning that if I’d found this bug in one of those packages I could have tried
something else. With {readODS}
being the only package for reading ODS files
there is not the same redundancy and resilience in the R ecosystem, creating
a secondary rationale for the development of {tidyods}
2.
Using tidyods
For detailed advice on using {tidyods}
please refer to the
documentation website.
At present {tidyods}
is not on CRAN, you can install it in R using the
following code.
# install the remotes package if necessary
install.packages("remotes")
remotes::install_github("mattkerlogue/tidyods")
Wickham, H (2014) Tidy Data. Journal of Statistical Software 59(10):1–23 https://doi.org/10.18637/jss.v059.i10 ↩︎
These Excel reading packages tend to have differing underlying approaches to working with Excel files, being a wrapper for different C++ or Java libraries. However, both
{tidyods}
and{readODS}
rely on the{xml2}
package at present to interact with the ODS file’s underlying XML content. ↩︎