It is almost not surprising to find most of the summarised data is presented in the form of a report - whose format is mainly Portable Document Format (PDF). The challenge is when you would like to access that data in a dynamic format and form - where it can be analysed, reformatted and reshaped to your desire; a requirement which is hard, if not impossible to achieve with data presented in a PDF report. Trying to do so would be like wishing to extract water from a rock, which is an endeavour in futility.
The good news is, technology seem to run on a brain of its own. While one side of the technology spectrum impedes, another end liberate. One such solution to extracting the dead and static PDF presented data, is to turn to the powerful and versatile R package, tabulizer
. The tabulizer package is an R wrapper for the powerful PDF extractor Java library Tabula. This package allows one to extract with ease, data presented in tables in a PDF document. For as long as the data is in a clean and uncluttered format. The extract_tables()
function will try to guess the delimiters for the data and extract the data in the format which maintains close to the original data outline.
Installation
For the installation and usage, the package depends on Java. The appropriate Java Development Kit can be downloaded straight from the Oracle website here. Installation instructions are platform specific. Follow the instructions depending on your OS. I am on Windows, so I installed Java, running the jdk-8u144-windows-x64.exe
executable file.
Installing tabulizer package, this can be installed from github. There is only the development version of the package, you will not find it on CRAN.
if(!require("ghit")){
install.packages("ghit")
}# on 64-bit Windows
::install_github(c("leeper/tabulizerjars",
ghit"leeper/tabulizer"),
INSTALL_opts = "--no-multiarch"
)# elsewhere
::install_github(c("leeper/tabulizerjars",
ghit"leeper/tabulizer"))
This will download and install other Java related packages tabulizer depends on.
Demo
For demonstration purpose, I will use the report from the Central Statistics Office (CSO), Zambia, on Zambia Census Projection 2011-2035. Below is the outline of the sample data as presented in the PDF report.
We call the tabulizer package with the following command.
library("tabulizer")
The main function is the extract_tables()
. The first argument is the PDF file
or report where the targeted table(s) is/are. The second argument is the pages
, where you specify the page number the table of data is. There are other arguments such as area
, which you can specify the targeted area(s) to extract. columns
which matches with the number of pages to be extracted. This argument allows for each page extracted to be stored in its own separate column.The guess
argument, which by default is =TRUE
, allows for the function to guess the location of the table(s) on each page. For a list of all the arguments: run ?extract_tables
in the R console. By default, the data is extracted as a list. Lists in R can be thought of as a vector containing other objects. We can zoom in on a particular object using the double square brackets,[[]]
. For instance, the first object in the variable is indexed by the number 1, and the second object by 2, and so on. Since,only one table is being extracted, the variable below contain one column; extracted with this command,cso_table[[1]]
.
The default way, extract_table()
extracts the data as a list of character matrices. This helps in cases where the data is irregular and cannot be properly coerced to a data frame (row by column format). To change this behaviour so that the extracted data is coerced to a data frame, we supply the method
argument, and have data.frame
as the value.
<- ("https://goo.gl/d2xMwS")
cso # This is the shortened version of the original URL.
<- extract_tables(cso, pages = 24,
cso_table method = "data.frame")
# We are going to pass the cso variable to the extract_tables() function
<- cso_table[[1]]
cso_column
# The table of interest is on page 24, the other arguments are left as defaults
From the extracted results, it can be seen the output is not in a “tidy” format, to allow any meaning analyses to be done. The next step would be reshaping and reordering the extracted results into a neat data frame.
Tidying the data
Two approaches can be implemented here: the easy way or the hard way.
- Firstly, the easy way. We can write the data to a
CSV
file and clean the data in Microsoft Excel. The solution is to use thewrite.csv()
function. The first argument in the function is the data object. Thefile
argument, you define the output file name together with the file extension - in our case it is a.CSV
extension. Therow.names
specifies whether to include the default index R attaches to the data, which spans the length of your data.
# I have passed a relative path where I want the CSV file to be stored
write.csv(cso_column, file = "cso_data.csv",
row.names = FALSE)
After cleaning the data in Excel, it can be re-imported to aid in analysis.
- Second choice, the hard way. The
tidyverse
package has a suite of packages built specifically to handle such tasks. Thedplyr
package, is one such package, which represents the grammar of data manipulation. Using well crafted verbs, one can transform, order, filter etc.. data with ease.
Welcome to the tidyverse
First step is to clean the data, eliminating unwanted variables and title headers. That is in addition to transforming the data into a “tidy” format - A variable per column, observation per row, and a value per cell. The command below eliminates the first, second, and the last three row of the extracted data.
tidyr
package is used to gather the observations in the columns into rows and combine all the observations across 2 columns. The function gather()
achieves this in the tidyr
package.
After gathering the data from the columns to rows, the second issue is to index the numbers by the corresponding provinces. This is achieved by replicating the provinces to span the length of the numbers. Combining the row names with their corresponding numbers completes our simple data extraction exercise.
<- cso_data %>%
cso_data as_tibble()
<- cso_data %>%
cso_provincial filter(sex == "Total") %>%
select(`2011`:`2035`) %>%
gather(key = "year", value = "census_proj")
<- rep(
province c(
"central",
"copperbelt",
"eastern",
"luapula",
"lusaka",
"muchinga",
"northern",
"north.western",
"southern",
"western"
)
,6
)
<- cbind(cso_provincial, province) %>%
cso_transformed select(year, province, census_proj) %>%
as_tibble()
cso_transformed
# A tibble: 60 × 3
year province census_proj
<chr> <chr> <chr>
1 2011 central 1,355,775
2 2011 copperbelt 2,143,413
3 2011 eastern 1,628,880
4 2011 luapula 1,015,629
5 2011 lusaka 2,362,967
6 2011 muchinga 749,449
7 2011 northern 1,146,392
8 2011 north.western 746,982
9 2011 southern 1,642,757
10 2011 western 926,478
# … with 50 more rows
For the full data table view, see the table below.
::kable(cso_transformed, booktabs = TRUE,
knitrcaption = "Census data per Province")
year | province | census_proj |
---|---|---|
2011 | central | 1,355,775 |
2011 | copperbelt | 2,143,413 |
2011 | eastern | 1,628,880 |
2011 | luapula | 1,015,629 |
2011 | lusaka | 2,362,967 |
2011 | muchinga | 749,449 |
2011 | northern | 1,146,392 |
2011 | north.western | 746,982 |
2011 | southern | 1,642,757 |
2011 | western | 926,478 |
2015 | central | 1515086 |
2015 | copperbelt | 2362207 |
2015 | eastern | 1813445 |
2015 | luapula | 1127453 |
2015 | lusaka | 2777439 |
2015 | muchinga | 895058 |
2015 | northern | 1304435 |
2015 | north.western | 833818 |
2015 | southern | 1853464 |
2015 | western | 991500 |
2020 | central | 1734601 |
2020 | copperbelt | 2669635 |
2020 | eastern | 2065590 |
2020 | luapula | 1276608 |
2020 | lusaka | 3360183 |
2020 | muchinga | 1095535 |
2020 | northern | 1520004 |
2020 | north.western | 950789 |
2020 | southern | 2135794 |
2020 | western | 1076683 |
2025 | central | 1979202 |
2025 | copperbelt | 3016344 |
2025 | eastern | 2344980 |
2025 | luapula | 1439877 |
2025 | lusaka | 4004276 |
2025 | muchinga | 1326222 |
2025 | northern | 1763638 |
2025 | north.western | 1080072 |
2025 | southern | 2445929 |
2025 | western | 1173598 |
2030 | central | 2254435 |
2030 | copperbelt | 3402007 |
2030 | eastern | 2655422 |
2030 | luapula | 1623991 |
2030 | lusaka | 4704135 |
2030 | muchinga | 1587414 |
2030 | northern | 2040926 |
2030 | north.western | 1227481 |
2030 | southern | 2793523 |
2030 | western | 1286880 |
2035 | central | 2565450 |
2035 | copperbelt | 3823642 |
2035 | eastern | 3001152 |
2035 | luapula | 1834667 |
2035 | lusaka | 5465775 |
2035 | muchinga | 1879642 |
2035 | northern | 2355007 |
2035 | north.western | 1397137 |
2035 | southern | 3184855 |
2035 | western | 1416331 |
We can finally take a breather, and enjoy!
Citation
@online{simumba2017,
author = {Aaron Simumba},
title = {Breathing Life Back into {PDF} Presented {Data}},
date = {2017-10-12},
url = {https://asimumba.rbind.io//blog/pdf-data},
langid = {en}
}