Liberating Milton Transit Data: Part 1
Part 1: Escape from PDF
It’s been almost a month since we reached out to Milton Transit (MT) with hopes of getting hold of their “raw” transit data. While we got a reply promising an eventual response, we can’t make progress on the Transit Project unless we have data to work with. So, we sprung into action to try and “liberate” at least some data sets, while we are hopeful to get more from MT directly.
The end-goal for the Transit Project is a mash-up app that would provide a more intuitive and interactive bus route map showing stops and integrated bus schedule. For data this means we need to extract the data from MT’s published maps & schedules to load it into a database (or structured format like XML, CSV, or perhaps Google Transit Feed GTFS), and add missing pieces (like the bus stop coordinates) by gathering the data manually if MT’s contacts don’t come through.
Step 1 to set the data free is getting the schedule data out of PDF into a more useful format. Developers would probably use tools to extract the data, and write scripts to transform and load the data into a database. I, unfortunately, am not one of those people, and tend to rely on things like Excel to manipulate and “shape” the data.
Depending on how the PDF is produced the data may be easier to “extract”, particularly if it wasn’t created from an image. MT’s new bus schedule data looked like mostly tables,
so I tried a few approaches & tools (listed on the wiki here) to get the data out:
- Basic Copy + Paste from PDF to Excel: each time table row transformed into one continuous string with each time separated from one-another by a space (or two). This may be an acceptable situation if no other method worked.
- Online PDF-to-??? / PDF-to-Excel tool like Zamzar: http://www.zamzar.com/ : I was impressed by the quality of conversion. It generated a pretty clean XLS that needed a few tweaks, like removing extra columns but otherwise was useful.
- Offline PDF to ??? / PDF-to-Word tool like SmartSoft Free PDFtoWord Converter (didn’t work, thought our PDF was password protected), or SolidDocuments Solid PDF Converter (which worked pretty well in “FLOWING” mode, table recognition “on”)
The key thing in converting from PDF to any format is consistency.
As long as you end up in a format that you can write a formula or script against to transform the data, you can then export the data to a structured format like XML or CSV or a database. In my case I wanted to get the data into a set of Excel tables that I could write a few formulas against, with consistent header / cell structure.
The end-result of Part 1 is this Excel doc (first 3 schedules took about 30 mins from start to finish), or you can see the HTML version of it. Parts 2 and 3 will deal with transforming the Excel table data into a more structured “true table” route, stop, schedule, etc. format, following the GTFS schema.
Pingback: Open Data in 5 Steps | Port 25