Data Extraction from PDF/Word

I used Node.js scripts to extract research data from Word/PDF files and import it into Google spreadsheets. This helped the Foundation for Middle East Peace (FMEP) to more easily analyse and share the data with journalists, researchers, and activists.
Screenshot of the Lawfare data hub website
The online data portal is a one-stop shop where journalists, researchers and activists can access all of FMEP's resources on lawfare.

For several years, FMEP has tracked efforts in the US to supress opposition to America's support for Israeli policies. Much of the data on state and federal legislation, however, was distributed in tables that were buried inside PDF files. As a result, the information was difficult for journalists and researchers to analyse.

Screenshot of a data table inside a PDF file
The data was stored in cramped tables inside of Word/PDF files.

I wrote Node.js scripts to parse and extract the data from these tables. Much of the data could be extracted through basic text parsing and regex scripts. However, some important information was expressed as colour-coding within the table. To extract this, I converted the document into a Google Doc and used googleapis to walk through the table cells in the document structure. This allowed me to compile the records into a CSV file that I could then import into a new Google Sheet.

A snippet of the CSV file that I generated from the source document. This file can be imported into Google Sheets or Excel documents.
"State","Bill Number","Companion Bill","Year","Bill Title","Initial Date (Introduced, Proposed or Filed)","Initial Date Description","Passed Into Law","Last Action","Includes Blacklist","Details","Link"
"Alabama","=HYPERLINK("""", ""SB 81"")","",2016,"Public contracts, governmental entities precluded from entering into contracts with entities that boycott certain persons or entities with whom this state enjoys open trade","2016/02/02","",true,"Passed Into law & signed by the governor 5/10/16",true,"On 7/29, the Alabama Comptroller issued guidance regarding implementation of SB 81, [here](
"Alabama","=HYPERLINK("""", ""HB 239"")","",,"","2016/02/16","",,"4/28/16 – indefinitely postponed",false,"__Official Description__: “This bill would prohibit a governmental entity from entering into certain contracts with business entities unless the contract includes a representation that the business entity is not currently engaged in, and an agreement that the business entity will not engage in, the boycott of a person or an entity based in or doing business with a jurisdiction with which this state can enjoy open trade.”
"Alabama","=HYPERLINK("""", ""SJR 6"")","",2016,"Boycott, Divestment, and Sanctions (BDS) Movement against Israel, denounced","2016/02/02","","UNCERTAIN","Adopted in Senate 2/2, adopted in House 2/9",false,"",""
"Alaska","=HYPERLINK("""", ""HB 2"")","",2023,"""An Act relating to contracts with public agencies; and relating to the State of Israel.""","2023/01/09","",false,"pre-file release 1/9/23",false,"__Official Description__: Short Title CONTRACTS: PROHIBIT ISRAEL DISCRIMINATION
"Alaska","=HYPERLINK("""", ""HB 239"")","",2022,"CONTRACTS: PROHIBIT ISRAEL DISCRIMINATION (Short title)","2022/01/07","",false,"pre-filed",false,"__Excerpt__: “A public agency shall include in a contract to acquire or dispose of services or supplies for the public agency a provision stating that the person with whom the public agency is contracting is not engaging in, and will not engage in during the contract, a business activity that…is a refusal to do business, a termination of business, or another action that is intended to limit business relations with the State of Israel, with a person doing business in or with the State of Israel, or with a person authorized, licensed, or organized to do business by the State of Israel l; and (3) is performed in (A) response to a request to withdraw from or not participate in business with the State of Israel; or (B) a manner that discriminates on the basis of nationality, national origin, or religion.” [includes exception for businesses with fewer than 10 employees and for contracts worth less than $100k)",""
"Alaska","=HYPERLINK("""", ""HB 394"")","",2022,"An Act relating to the investment of state money by public agencies; and relating to the divestment of certain investments by public agencies.","2022/02/22","",false,"Referred to State Affairs Committee 2/22/22",false,"__Description__: this appears to be a new kind of hybrid anti-boycott bill, seeking in one fell swoop to target for divestment any company that engaged in boycotts effecting Alaska’s fossil fuel industry, against Israel, Taiwan, and against pretty much anyone else the State supports [targeting: “an organization that, pursuant to a boycott or divestment campaign, including an environmental, social, or governance campaign, or similar project targeting this state, or targeting the continued existence of a foreign country or autonomous region, refuses to deal with the state, the State of Israel, Ukraine, or another foreign country, or Taiwan, or an entity that does business with these organizations or in these places…”",""
"Arizona","=HYPERLINK("""", ""SB 1250"")","",2022,"","2022/01/20","",true,"passed by Senate 2/15, passed by House 3/16, sent to governor 3/17/22",false,"__Purpose:__ To explicitly include state universities and community colleges (and their employee retirement funds) under Arizona’s existing Israel anti-boycott/divestment law.",""

Once I had extracted the information into a Google Sheet, I configured data validation rules to help FMEP audit and maintain the data. I created dropdown controls and colour-coded some of the columns, like whether or not legislation had passed into law, to make it easier to analyse and update the information. Data was extracted from four separate documents: two on anti-boycott legislation (state, federal) and two on attempts to outlaw criticism of Israel as antisemitism (state, federal).

Screenshot of one of the Google Sheets
Four data tables were consolidated into two Google Sheets. One on anti-boycott legislation and another on attempts to outlaw criticism of Israel.

Finally, I built a small site to act as a data portal for all of their resources on lawfare. The site provides quick access to all of the new spreadsheets, as well as additional resources hosted elsewhere, such as their podcast episodes.

Screenshot of the resources in the lawfare data portal
The clean design helps users quickly find and access the resources they are looking for.

FMEP is a small team and I didn't want them to have to learn and maintain a new software service. To avoid this, the site is configured and built directly from a private Google Sheet rather than a self-hosted content management system. When they want to add a new resource, they add a new row to the spreadsheet and trigger a rebuild of the site.

This setup makes it easy for FMEP to manage the site. At the same time, the technical infrastructure ensures that the service almost never suffers from downtime or poor performance. All of the assets are hosted on fast and secure servers, either through their organisation's Google Workspace account or Netlify's global "serverless" infrastructure. Most importantly, their team won't rely on me or an expensive digital agency to keep their site up and running.