Breathing life back into PDF presented Data – R . Python . Data Science . Business Intelligence

It is almost not surprising to find most of the summarised data is presented in the form of a report - whose format is mainly Portable Document Format (PDF). The challenge is when you would like to access that data in a dynamic format and form - where it can be analysed, reformatted and reshaped to your desire; a requirement which is hard, if not impossible to achieve with data presented in a PDF report. Trying to do so would be like wishing to extract water from a rock, which is an endeavour in futility.

The good news is, technology seem to run on a brain of its own. While one side of the technology spectrum impedes, another end liberate. One such solution to extracting the dead and static PDF presented data, is to turn to the powerful and versatile R package, tabulizer. The tabulizer package is an R wrapper for the powerful PDF extractor Java library Tabula. This package allows one to extract with ease, data presented in tables in a PDF document. For as long as the data is in a clean and uncluttered format. The extract_tables() function will try to guess the delimiters for the data and extract the data in the format which maintains close to the original data outline.

Installation

For the installation and usage, the package depends on Java. The appropriate Java Development Kit can be downloaded straight from the Oracle website here. Installation instructions are platform specific. Follow the instructions depending on your OS. I am on Windows, so I installed Java, running the jdk-8u144-windows-x64.exe executable file.

Installing tabulizer package, this can be installed from github. There is only the development version of the package, you will not find it on CRAN.

if(!require("ghit")){
                       install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("leeper/tabulizerjars", 
                       "leeper/tabulizer"), 
                         INSTALL_opts = "--no-multiarch"
                     )
# elsewhere
ghit::install_github(c("leeper/tabulizerjars", 
                       "leeper/tabulizer"))

This will download and install other Java related packages tabulizer depends on.

Demo

For demonstration purpose, I will use the report from the Central Statistics Office (CSO), Zambia, on Zambia Census Projection 2011-2035. Below is the outline of the sample data as presented in the PDF report.

We call the tabulizer package with the following command.

library("tabulizer")

The main function is the extract_tables(). The first argument is the PDF file or report where the targeted table(s) is/are. The second argument is the pages, where you specify the page number the table of data is. There are other arguments such as area, which you can specify the targeted area(s) to extract. columns which matches with the number of pages to be extracted. This argument allows for each page extracted to be stored in its own separate column.The guess argument, which by default is =TRUE, allows for the function to guess the location of the table(s) on each page. For a list of all the arguments: run ?extract_tables in the R console. By default, the data is extracted as a list. Lists in R can be thought of as a vector containing other objects. We can zoom in on a particular object using the double square brackets,[[]]. For instance, the first object in the variable is indexed by the number 1, and the second object by 2, and so on. Since,only one table is being extracted, the variable below contain one column; extracted with this command,cso_table[[1]].

The default way, extract_table() extracts the data as a list of character matrices. This helps in cases where the data is irregular and cannot be properly coerced to a data frame (row by column format). To change this behaviour so that the extracted data is coerced to a data frame, we supply the method argument, and have data.frame as the value.

cso <- ("https://goo.gl/d2xMwS")
# This is the shortened version of the original URL.

cso_table <- extract_tables(cso, pages = 24,
                            method = "data.frame")
# We are going to pass the cso variable to the extract_tables() function
cso_column <- cso_table[[1]]

# The table of interest is on page 24, the other arguments are left as defaults

From the extracted results, it can be seen the output is not in a “tidy” format, to allow any meaning analyses to be done. The next step would be reshaping and reordering the extracted results into a neat data frame.

Tidying the data

Two approaches can be implemented here: the easy way or the hard way.

Firstly, the easy way. We can write the data to a CSV file and clean the data in Microsoft Excel. The solution is to use the write.csv() function. The first argument in the function is the data object. The file argument, you define the output file name together with the file extension - in our case it is a .CSV extension. The row.names specifies whether to include the default index R attaches to the data, which spans the length of your data.

# I have passed a relative path where I want the CSV file to be stored
write.csv(cso_column, file = "cso_data.csv",
          row.names = FALSE)

After cleaning the data in Excel, it can be re-imported to aid in analysis.

Second choice, the hard way. The tidyverse package has a suite of packages built specifically to handle such tasks. Thedplyr package, is one such package, which represents the grammar of data manipulation. Using well crafted verbs, one can transform, order, filter etc.. data with ease.

Welcome to the tidyverse

First step is to clean the data, eliminating unwanted variables and title headers. That is in addition to transforming the data into a “tidy” format - A variable per column, observation per row, and a value per cell. The command below eliminates the first, second, and the last three row of the extracted data.

tidyr package is used to gather the observations in the columns into rows and combine all the observations across 2 columns. The function gather() achieves this in the tidyr package.

After gathering the data from the columns to rows, the second issue is to index the numbers by the corresponding provinces. This is achieved by replicating the provinces to span the length of the numbers. Combining the row names with their corresponding numbers completes our simple data extraction exercise.

cso_data <- cso_data %>%
  as_tibble()

cso_provincial <- cso_data %>%
  filter(sex == "Total") %>%
  select(`2011`:`2035`) %>%
  gather(key = "year", value = "census_proj")

province <- rep(
  c(
    "central",
    "copperbelt",
    "eastern",
    "luapula",
    "lusaka",
    "muchinga",
    "northern",
    "north.western",
    "southern",
    "western"
  )
  ,
  6
)

cso_transformed <- cbind(cso_provincial, province) %>%
  select(year, province, census_proj) %>%
  as_tibble()

cso_transformed

# A tibble: 60 × 3
   year  province      census_proj
   <chr> <chr>         <chr>      
 1 2011  central       1,355,775  
 2 2011  copperbelt    2,143,413  
 3 2011  eastern       1,628,880  
 4 2011  luapula       1,015,629  
 5 2011  lusaka        2,362,967  
 6 2011  muchinga      749,449    
 7 2011  northern      1,146,392  
 8 2011  north.western 746,982    
 9 2011  southern      1,642,757  
10 2011  western       926,478    
# … with 50 more rows

For the full data table view, see the table below.

knitr::kable(cso_transformed, booktabs = TRUE,
             caption = "Census data per Province")

Census data per Province
year	province	census_proj
2011	central	1,355,775
2011	copperbelt	2,143,413
2011	eastern	1,628,880
2011	luapula	1,015,629
2011	lusaka	2,362,967
2011	muchinga	749,449
2011	northern	1,146,392
2011	north.western	746,982
2011	southern	1,642,757
2011	western	926,478
2015	central	1515086
2015	copperbelt	2362207
2015	eastern	1813445
2015	luapula	1127453
2015	lusaka	2777439
2015	muchinga	895058
2015	northern	1304435
2015	north.western	833818
2015	southern	1853464
2015	western	991500
2020	central	1734601
2020	copperbelt	2669635
2020	eastern	2065590
2020	luapula	1276608
2020	lusaka	3360183
2020	muchinga	1095535
2020	northern	1520004
2020	north.western	950789
2020	southern	2135794
2020	western	1076683
2025	central	1979202
2025	copperbelt	3016344
2025	eastern	2344980
2025	luapula	1439877
2025	lusaka	4004276
2025	muchinga	1326222
2025	northern	1763638
2025	north.western	1080072
2025	southern	2445929
2025	western	1173598
2030	central	2254435
2030	copperbelt	3402007
2030	eastern	2655422
2030	luapula	1623991
2030	lusaka	4704135
2030	muchinga	1587414
2030	northern	2040926
2030	north.western	1227481
2030	southern	2793523
2030	western	1286880
2035	central	2565450
2035	copperbelt	3823642
2035	eastern	3001152
2035	luapula	1834667
2035	lusaka	5465775
2035	muchinga	1879642
2035	northern	2355007
2035	north.western	1397137
2035	southern	3184855
2035	western	1416331

We can finally take a breather, and enjoy!

via GIPHY

Citation

BibTeX citation:

@online{simumba2017,
  author = {Simumba, Aaron},
  title = {Breathing Life Back into {PDF} Presented {Data}},
  date = {2017-10-12},
  url = {https://asimumba.rbind.io/blog/pdf-data/},
  langid = {en}
}

For attribution, please cite this work as:

Simumba, Aaron. 2017. “Breathing Life Back into PDF Presented Data.” October 12, 2017. https://asimumba.rbind.io/blog/pdf-data/.