Monthly Archives: June 2014

Web scraping with PowerShell (Getting a package trace from a postal service)

Building an advanced function that can consume information on the web is pretty powerfull and I use it for all kinds of things.

In this post I will try to guide you through the process on how to build one for more or less any service, but the example will be the Swedish postal service.

I usually start with a web browser that has some developer features, for example Google Chrome. Go to the website and press Ctrl+Shift+i, select the “network tab” and enter whatever information you need to send to the service, in this case the ID of the package I want to trace.

In this example it should look like this: (I have chosen to use the English version of the website):
ChromeCtrlAltI

Press the submit button and look at the beginning of the network trace. You usually find a GET or POST request there, in this case it is a GET-request.
In this example it looks like this:
ChromePackageTrace

You can right click that row and select “Copy link address”, which in this case is “http://www.posten.se/en/Pages/Track-and-trace.aspx?search=MyPackageID”.

Now open whatever PowerShell script environment you prefer, for example the PowerShell ISE. Start with sending the same request from PowerShell, that can be done by using Invoke-WebRequest (if you are using PowerShell v3 or higher). Start with putting a variable where “MyPackageId” is.

For example:

$Id = "MyPackageId"
$PackageTrace = Invoke-WebRequest -Uri "http://www.posten.se/en/Pages/Track-and-trace.aspx?search=$Id" -UseBasicParsing

The “UseBasicParsing” switch is not mandatory here, but if you don’t need the html returned to be parsed into different objects it is a bit quicker.

We now need to parse the html-code stored in the “Content”-property to get what we want. This can be a bit time consuming, but with a little help from Chrome it gets easier.

Press the magnifier button and hover the mouse over parts of the site or parts of the HTML-code (if you select the “Elements-tab”) and you will soon find what part of the HTML code you need.

In this example the table-tag. Screenshot:
FindWhatYouNeed

Now we need to do some string manipulation to get the parts we need properly formatted. In this case we want to split the HTML to get the parts between the start of the table and the end of it. What we have left is the rows with all the package events, find something that splits them up in to nice pieces, in this case the “tr class=” tag. The first of the rows that gets returned are some table information (containing a unique ID that might change) and the table columns, so we want to skip those. A oneliner that does all of this looks like:

$TraceItems = ((($PackageTrace.Content -split "<table class=`"PWP-moduleTable nttEventsTable`"")&#91;1&#93; -split "</TABLE>")[0]) -split "<tr class=" | Select-Object -Skip 2</code>

We can now loop through these items, parse them and build an object out of them. Each one of these items has three columns; a date, a location and a comment/tracking event. The columns are enclosed in the “TD”-tags so we can split them up at those.

When you have all the values we need we create the object and send it to the pipeline. Could look something like this:

foreach ($TraceItem in $TraceItems) {

    $EventDate = (($TraceItem -split "<td>")[1] -split "</td>")[0]
    $Location = (($TraceItem -split "<td>")[2] -split "</td>")[0]
    $Comment = (($TraceItem -split "<td>")[3] -split "</td>")[0]
    $PackageId = $Id

    $returnObject = New-Object System.Object
    $returnObject | Add-Member -Type NoteProperty -Name EventDate -Value $EventDate
    $returnObject | Add-Member -Type NoteProperty -Name Location -Value $Location
    $returnObject | Add-Member -Type NoteProperty -Name Comment -Value $Comment
    $returnObject | Add-Member -Type NoteProperty -Name Id -Value $PackageId

    Write-Output $returnObject
}

We now have “objectified” a website and made it useful in PowerShell! When we have come this far it’s a good idea to create an advanced function around it to make it really useful.

There are many good posts explaining how that is done, for example this one by Don Jones, so please refer to that if you need some help on getting started.

I have made a quick example of an advanced function out of the code written in this post which is available here.

This is how the function looks in PowerShell (MyPackageId actually seems to be a valid Id, but it looks a bit weird. The output in PowerShell matches the site though):
Get-PackageTrace-dump

Good luck automating anything!

And if you want to learn more, checkout my webscrape guide in this post!