Ordering pizza with PowerShell (web scraping guide) – Part 2

So, we have created our Connect-OnlinePizza function and now have access to parts of the site that are only available when logged in. But how?

Remember the Invoke-WebRequest-cmdlet in the last post?
We specified a session variable in the Global scope, and that variable contains cookies and data to keep our session with the site consistent over multiple webrequests, and that’s what we’ll use in our next function, Get-MyOnlinePizzaAccountInfo.

Get-MyOnlinePizzaAccountInfo
First of all, we need to find what page holds the information we want. In this case, the page containing the account information was located at http://onlinepizza.se/?view=andraKonto (it requires you to be logged in).

Make sure you ran the “Connect-OnlinePizza”-function first, that way the “$OnlinePizzaSession”-variable will be available and make it possible for us to reach this page and see the details of our account.

To fetch the page and load it into a variable you could do this (we save it to file because of the issue with the encoding name, see part 1 of this guide):

Invoke-WebRequest -Uri "http://onlinepizza.se/?view=andraKonto" -Method Get -WebSession $Global:OnlinePizzaSession -OutFile .\dump.htm
$AccountInfo = Get-Content .\dump.htm -Encoding UTF8

If this worked, we should start looking in the “dump.htm”-file for where the name is, use Select-String or just open the file in notepad and search for it.

When you’ve found the line, you need to figure out how to trim away all the parts of the line that you don’t want. In my case, it looks like this:

<input type=text name=namn id=namn class="input-medium" maxlength=100 value="Anders Wahlqvist"/>

I’m by no means an expert in string manipulation or regex, so there is probably a better way of doing this, but I usually use the Split-operator to get the part I want. In this case we need to split the string after value=” and before “/> (or remove it). We also need to fetch this particular line from the sites html code.

Take a look at this line:

$AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]

That might look like a complete mess, but we’ll break it down! We first need to fetch the correct line that contains the name which we can do with:

$AccountInfo | Select-String -Pattern "name=namn id=namn"

We want to do the “splitting” on the results of that, and therefore we need to put parentheses around that command before we add the “-split” operator. So let’s split that up and see what happens:

PS> ($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`""
input type=text name=namn id=namn class="input-medium" maxlength=100 
Anders Wahlqvist"/>

As you can see, we get two tokens back, and we need the second one. This can easily be done by putting everything in another pair of parentheses and then just specify which one we want. Since the first one will be identified as 0, and the one we want 1, we will end up with this:

PS> (($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1]
Anders Wahlqvist"/>

To get rid of that last part, we could either use the "replace"-operator or do another split. In this case, the "replace"-operator might be the better choice, but in my experience the split-operator will provide a more robust and consistent result. The site might change and add something else after "/> on the same line, or there might be some white space that you didn't see, so let's just do another split, wrap that up in a new set of parentheses and
and select token 0 (first one), which will get us our original line:

$AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]

Hopefully this line doesn't seem as messy anymore πŸ™‚

Now we repeat that for all the information we want, like this:

$Username = ((($AccountInfo | Select-String -Pattern "name=username id=username") -split "value=`"")[1] -split "`" />")[0]
$AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderMail = ((($AccountInfo | Select-String -Pattern "name=epost id=epost") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderStreet = ((($AccountInfo | Select-String -Pattern "name=adress1 id=adress1") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderPostalCode = ((($AccountInfo | Select-String -Pattern "name=postnummer id=postnummer") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderPhone = ((($AccountInfo | Select-String -Pattern "name=telefon id=telefon") -split "value=`"")[1] -split "`"/>")[0]

And finally, we create an object for it and send it to the pipeline:

$returnObject = New-Object System.Object
$returnObject | Add-Member -Type NoteProperty -Name Username -Value $Username
$returnObject | Add-Member -Type NoteProperty -Name Name -Value $AccountHolderName
$returnObject | Add-Member -Type NoteProperty -Name Email -Value $AccountHolderMail
$returnObject | Add-Member -Type NoteProperty -Name Address -Value $AccountHolderStreet
$returnObject | Add-Member -Type NoteProperty -Name PostalCode -Value $AccountHolderPostalCode
$returnObject | Add-Member -Type NoteProperty -Name Phone -Value $AccountHolderPhone

Write-Output $returnObject

So far so good, time to wrap this up in a function, we've already looked at that in the last post, so I'll just add the complete code here:

function Get-MyOnlinePizzaAccountInfo
{
    [cmdletbinding()]
    param()

    BEGIN {
        if ($OnlinePizzaSession -eq $null) {
            Write-Error "You must first connect using the Connect-OnlinePizza cmdlet"
            break
        }
    }

    PROCESS {

        Invoke-WebRequest -Uri "http://onlinepizza.se/?view=andraKonto" -Method Get -WebSession $Global:OnlinePizzaSession -OutFile .\dump.htm

        $AccountInfo = Get-Content .\dump.htm -Encoding UTF8

        Remove-Item .\dump.htm -Force -Confirm:$false -ErrorAction SilentlyContinue

        $Username = ((($AccountInfo | Select-String -Pattern "name=username id=username") -split "value=`"")[1] -split "`" />")[0]
        $AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderMail = ((($AccountInfo | Select-String -Pattern "name=epost id=epost") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderStreet = ((($AccountInfo | Select-String -Pattern "name=adress1 id=adress1") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderPostalCode = ((($AccountInfo | Select-String -Pattern "name=postnummer id=postnummer") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderPhone = ((($AccountInfo | Select-String -Pattern "name=telefon id=telefon") -split "value=`"")[1] -split "`"/>")[0]

        $returnObject = New-Object System.Object
        $returnObject | Add-Member -Type NoteProperty -Name Username -Value $Username
        $returnObject | Add-Member -Type NoteProperty -Name Name -Value $AccountHolderName
        $returnObject | Add-Member -Type NoteProperty -Name Email -Value $AccountHolderMail
        $returnObject | Add-Member -Type NoteProperty -Name Address -Value $AccountHolderStreet
        $returnObject | Add-Member -Type NoteProperty -Name PostalCode -Value $AccountHolderPostalCode
        $returnObject | Add-Member -Type NoteProperty -Name Phone -Value $AccountHolderPhone

        Write-Output $returnObject

    }

    END { }
}

Take a look at line 7 through 10, here we check if there is a variable called "$OnlinePizzaSession" available, if not, the user running this function probably didn't run the "Connect-OnlinePizza"-function, and this function won't work. Therefor, we write an error and exit the function. This is a pretty good method to ensure that the functions are used correctly.

So, finally time for our last function!

Get-PizzaRestaurant
Most parts of this function will be created more or less in the exact same way as the last one, so I'll just go through the differences.

First of all, we want these cmdlets to work together in a good way to give them that "module"-feeling πŸ™‚

One way of doing that is to add pipeline support, but how?

Well, this function will return a list of restaurants based on our location, and the location is based on our postal code (zip code). If you check our last function we actually return a property value called "PostalCode" which would be perfect for pipelining, and it's really easy to do!

All we need is "ValueFromPipelineByPropertyName=$true" when declaring the parameter, like this:

    param(
          [Parameter(Mandatory=$True,ValueFromPipelineByPropertyName=$true)]
          [int] $PostalCode)

And we need to verify that the property in object we output match the parameter name:
pipeline_pizza

Also, as you can see, we are declaring the parameter data type as an int, this way, no one will give as a postal code with spaces in it. If we want to, we could also validate that it really is a postal code, but again, this guide is not as much about writing advanced functions in general but has more to do with web scraping, so we'll just let it be.

Let's look at the rest of this function:

function Get-PizzaRestaurant
{
    [cmdletbinding()]
    param(
          [Parameter(Mandatory=$True,ValueFromPipelineByPropertyName=$true)]
          [int] $PostalCode)

    BEGIN {
        if ($OnlinePizzaSession -eq $null) {
            Write-Error "You must first connect using the Connect-OnlinePizza cmdlet"
            break
        }
    }

    PROCESS {

        Invoke-WebRequest -Uri "http://onlinepizza.se/postnummer/$PostalCode" -Method Get -WebSession $Global:OnlinePizzaSession -OutFile .\dump.htm

        $ResturantList = ((Get-Content .\dump.htm) -join "`n") -split "<UL>" | select -Skip 1

        Remove-Item .\dump.htm -Force -Confirm:$false -ErrorAction SilentlyContinue

        foreach ($Restaurant in $ResturantList) {

            $RestaurantName = (($Restaurant -split "<h4>")[1] -split "</h4>")[0]

            if ($RestaurantName -eq '') {
                Continue
            }

            $RestaurantStreet = (($Restaurant -split "<address>")[1] -split "</address>")[0]
            $OpeningHoursDelivery = ((($Restaurant -split "Utkörning:</strong><br />")[1] -split "<br />")[0]).Trim()
            $OpeningHoursTakeAway = ((($Restaurant -split "Avhämtning:</strong><br />")[1] -split "<br />")[0]).Trim()
            $RestaurantLink = ((($Restaurant -split "meny")[0] -split "href=`"")[1] -split "`"")[0]

            $returnObject = New-Object System.Object
            $returnObject | Add-Member -Type NoteProperty -Name RestaurantName -Value $RestaurantName
            $returnObject | Add-Member -Type NoteProperty -Name RestaurantStreet -Value $RestaurantStreet
            $returnObject | Add-Member -Type NoteProperty -Name OpeningHoursDelivery -Value $OpeningHoursDelivery
            $returnObject | Add-Member -Type NoteProperty -Name OpeningHoursTakeAway -Value $OpeningHoursTakeAway
            $returnObject | Add-Member -Type NoteProperty -Name RestaurantLink -Value $RestaurantLink

            Write-Output $returnObject

            Remove-Variable RestaurantName, RestaurantStreet, OpeningHoursDelivery, OpeningHoursTakeAway, RestaurantLink -ErrorAction SilentlyContinue
        }
    }

    END { }
}

A few more comments might be needed here, if you look at line 19, we use the opposite of split, the join-operator. Why? Well, when looking at the html-code of the site the information is spanning over multiple lines, by joining on linefeeds (`n = linefeed) we can get all the information for each restaurant as "one part" instead of multiple lines, which helps a lot!

Also, at line 32 and 33, we call a method called Trim(), this method removes all leading and trailing white-space characters from the string we're working on.

Finally, at line 45 we remove all the variables to prevent them from being "reused" on the next iteration of the loop if the next restaurants data is different or missing. Clear-Variable would work perfectly here aswell.

And that's it!

Result
We have now created functions to connect to a site, utilize functions that are only available when logged in and we have also made the functions work together in a nice way.

This is how they look in action:
finally

Pretty neat, huh? πŸ™‚

The code for all of these functions have been uploaded to PoshCode.org here.

I hope you enjoyed this little guide, and if you have any questions, feel free to ask them in the comments or drop me an e-mail!

And keep automating anything πŸ™‚

2 thoughts on “Ordering pizza with PowerShell (web scraping guide) – Part 2

  1. lazylama

    have a question, i tried the first part of this tutorial and it works, but now i’m trying to figure out once you have logged in, how do i click a link inside of that session. if it puts everything in the dump.htm is the session still open

    1. Anders Post author

      Hi,
      Thank you for reading and commenting!

      As long as you keep your “WebSession” variable specified, you should be able to reach any page using that session (this depends a bit on the site though).

      To make sure it works, you basically have to check the site reply/contents (ie. the dump.htm-file) for “proof” that you are still logged in.

      I hope this helped you, if not, please post another comment and I’ll be glad to assist you! πŸ™‚

Comments are closed.