Tag Archives: webscraping

Ordering pizza with PowerShell (web scraping guide) – Part 2

So, we have created our Connect-OnlinePizza function and now have access to parts of the site that are only available when logged in. But how?

Remember the Invoke-WebRequest-cmdlet in the last post?
We specified a session variable in the Global scope, and that variable contains cookies and data to keep our session with the site consistent over multiple webrequests, and that’s what we’ll use in our next function, Get-MyOnlinePizzaAccountInfo.

Get-MyOnlinePizzaAccountInfo
First of all, we need to find what page holds the information we want. In this case, the page containing the account information was located at http://onlinepizza.se/?view=andraKonto (it requires you to be logged in).

Make sure you ran the “Connect-OnlinePizza”-function first, that way the “$OnlinePizzaSession”-variable will be available and make it possible for us to reach this page and see the details of our account.

To fetch the page and load it into a variable you could do this (we save it to file because of the issue with the encoding name, see part 1 of this guide):

Invoke-WebRequest -Uri "http://onlinepizza.se/?view=andraKonto" -Method Get -WebSession $Global:OnlinePizzaSession -OutFile .\dump.htm
$AccountInfo = Get-Content .\dump.htm -Encoding UTF8

If this worked, we should start looking in the “dump.htm”-file for where the name is, use Select-String or just open the file in notepad and search for it.

When you’ve found the line, you need to figure out how to trim away all the parts of the line that you don’t want. In my case, it looks like this:

<input type=text name=namn id=namn class="input-medium" maxlength=100 value="Anders Wahlqvist"/>

I’m by no means an expert in string manipulation or regex, so there is probably a better way of doing this, but I usually use the Split-operator to get the part I want. In this case we need to split the string after value=” and before “/> (or remove it). We also need to fetch this particular line from the sites html code.

Take a look at this line:

$AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]

That might look like a complete mess, but we’ll break it down! We first need to fetch the correct line that contains the name which we can do with:

$AccountInfo | Select-String -Pattern "name=namn id=namn"

We want to do the “splitting” on the results of that, and therefore we need to put parentheses around that command before we add the “-split” operator. So let’s split that up and see what happens:

PS> ($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`""
input type=text name=namn id=namn class="input-medium" maxlength=100 
Anders Wahlqvist"/>

As you can see, we get two tokens back, and we need the second one. This can easily be done by putting everything in another pair of parentheses and then just specify which one we want. Since the first one will be identified as 0, and the one we want 1, we will end up with this:

PS> (($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1]
Anders Wahlqvist"/>

To get rid of that last part, we could either use the "replace"-operator or do another split. In this case, the "replace"-operator might be the better choice, but in my experience the split-operator will provide a more robust and consistent result. The site might change and add something else after "/> on the same line, or there might be some white space that you didn't see, so let's just do another split, wrap that up in a new set of parentheses and
and select token 0 (first one), which will get us our original line:

$AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]

Hopefully this line doesn't seem as messy anymore 🙂

Now we repeat that for all the information we want, like this:

$Username = ((($AccountInfo | Select-String -Pattern "name=username id=username") -split "value=`"")[1] -split "`" />")[0]
$AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderMail = ((($AccountInfo | Select-String -Pattern "name=epost id=epost") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderStreet = ((($AccountInfo | Select-String -Pattern "name=adress1 id=adress1") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderPostalCode = ((($AccountInfo | Select-String -Pattern "name=postnummer id=postnummer") -split "value=`"")[1] -split "`"/>")[0]
$AccountHolderPhone = ((($AccountInfo | Select-String -Pattern "name=telefon id=telefon") -split "value=`"")[1] -split "`"/>")[0]

And finally, we create an object for it and send it to the pipeline:

$returnObject = New-Object System.Object
$returnObject | Add-Member -Type NoteProperty -Name Username -Value $Username
$returnObject | Add-Member -Type NoteProperty -Name Name -Value $AccountHolderName
$returnObject | Add-Member -Type NoteProperty -Name Email -Value $AccountHolderMail
$returnObject | Add-Member -Type NoteProperty -Name Address -Value $AccountHolderStreet
$returnObject | Add-Member -Type NoteProperty -Name PostalCode -Value $AccountHolderPostalCode
$returnObject | Add-Member -Type NoteProperty -Name Phone -Value $AccountHolderPhone

Write-Output $returnObject

So far so good, time to wrap this up in a function, we've already looked at that in the last post, so I'll just add the complete code here:

function Get-MyOnlinePizzaAccountInfo
{
    [cmdletbinding()]
    param()

    BEGIN {
        if ($OnlinePizzaSession -eq $null) {
            Write-Error "You must first connect using the Connect-OnlinePizza cmdlet"
            break
        }
    }

    PROCESS {

        Invoke-WebRequest -Uri "http://onlinepizza.se/?view=andraKonto" -Method Get -WebSession $Global:OnlinePizzaSession -OutFile .\dump.htm

        $AccountInfo = Get-Content .\dump.htm -Encoding UTF8

        Remove-Item .\dump.htm -Force -Confirm:$false -ErrorAction SilentlyContinue

        $Username = ((($AccountInfo | Select-String -Pattern "name=username id=username") -split "value=`"")[1] -split "`" />")[0]
        $AccountHolderName = ((($AccountInfo | Select-String -Pattern "name=namn id=namn") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderMail = ((($AccountInfo | Select-String -Pattern "name=epost id=epost") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderStreet = ((($AccountInfo | Select-String -Pattern "name=adress1 id=adress1") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderPostalCode = ((($AccountInfo | Select-String -Pattern "name=postnummer id=postnummer") -split "value=`"")[1] -split "`"/>")[0]
        $AccountHolderPhone = ((($AccountInfo | Select-String -Pattern "name=telefon id=telefon") -split "value=`"")[1] -split "`"/>")[0]

        $returnObject = New-Object System.Object
        $returnObject | Add-Member -Type NoteProperty -Name Username -Value $Username
        $returnObject | Add-Member -Type NoteProperty -Name Name -Value $AccountHolderName
        $returnObject | Add-Member -Type NoteProperty -Name Email -Value $AccountHolderMail
        $returnObject | Add-Member -Type NoteProperty -Name Address -Value $AccountHolderStreet
        $returnObject | Add-Member -Type NoteProperty -Name PostalCode -Value $AccountHolderPostalCode
        $returnObject | Add-Member -Type NoteProperty -Name Phone -Value $AccountHolderPhone

        Write-Output $returnObject

    }

    END { }
}

Take a look at line 7 through 10, here we check if there is a variable called "$OnlinePizzaSession" available, if not, the user running this function probably didn't run the "Connect-OnlinePizza"-function, and this function won't work. Therefor, we write an error and exit the function. This is a pretty good method to ensure that the functions are used correctly.

So, finally time for our last function!

Get-PizzaRestaurant
Most parts of this function will be created more or less in the exact same way as the last one, so I'll just go through the differences.

First of all, we want these cmdlets to work together in a good way to give them that "module"-feeling 🙂

One way of doing that is to add pipeline support, but how?

Well, this function will return a list of restaurants based on our location, and the location is based on our postal code (zip code). If you check our last function we actually return a property value called "PostalCode" which would be perfect for pipelining, and it's really easy to do!

All we need is "ValueFromPipelineByPropertyName=$true" when declaring the parameter, like this:

    param(
          [Parameter(Mandatory=$True,ValueFromPipelineByPropertyName=$true)]
          [int] $PostalCode)

And we need to verify that the property in object we output match the parameter name:
pipeline_pizza

Also, as you can see, we are declaring the parameter data type as an int, this way, no one will give as a postal code with spaces in it. If we want to, we could also validate that it really is a postal code, but again, this guide is not as much about writing advanced functions in general but has more to do with web scraping, so we'll just let it be.

Let's look at the rest of this function:

function Get-PizzaRestaurant
{
    [cmdletbinding()]
    param(
          [Parameter(Mandatory=$True,ValueFromPipelineByPropertyName=$true)]
          [int] $PostalCode)

    BEGIN {
        if ($OnlinePizzaSession -eq $null) {
            Write-Error "You must first connect using the Connect-OnlinePizza cmdlet"
            break
        }
    }

    PROCESS {

        Invoke-WebRequest -Uri "http://onlinepizza.se/postnummer/$PostalCode" -Method Get -WebSession $Global:OnlinePizzaSession -OutFile .\dump.htm

        $ResturantList = ((Get-Content .\dump.htm) -join "`n") -split "<UL>" | select -Skip 1

        Remove-Item .\dump.htm -Force -Confirm:$false -ErrorAction SilentlyContinue

        foreach ($Restaurant in $ResturantList) {

            $RestaurantName = (($Restaurant -split "<h4>")[1] -split "</h4>")[0]

            if ($RestaurantName -eq '') {
                Continue
            }

            $RestaurantStreet = (($Restaurant -split "<address>")[1] -split "</address>")[0]
            $OpeningHoursDelivery = ((($Restaurant -split "Utkörning:</strong><br />")[1] -split "<br />")[0]).Trim()
            $OpeningHoursTakeAway = ((($Restaurant -split "Avhämtning:</strong><br />")[1] -split "<br />")[0]).Trim()
            $RestaurantLink = ((($Restaurant -split "meny")[0] -split "href=`"")[1] -split "`"")[0]

            $returnObject = New-Object System.Object
            $returnObject | Add-Member -Type NoteProperty -Name RestaurantName -Value $RestaurantName
            $returnObject | Add-Member -Type NoteProperty -Name RestaurantStreet -Value $RestaurantStreet
            $returnObject | Add-Member -Type NoteProperty -Name OpeningHoursDelivery -Value $OpeningHoursDelivery
            $returnObject | Add-Member -Type NoteProperty -Name OpeningHoursTakeAway -Value $OpeningHoursTakeAway
            $returnObject | Add-Member -Type NoteProperty -Name RestaurantLink -Value $RestaurantLink

            Write-Output $returnObject

            Remove-Variable RestaurantName, RestaurantStreet, OpeningHoursDelivery, OpeningHoursTakeAway, RestaurantLink -ErrorAction SilentlyContinue
        }
    }

    END { }
}

A few more comments might be needed here, if you look at line 19, we use the opposite of split, the join-operator. Why? Well, when looking at the html-code of the site the information is spanning over multiple lines, by joining on linefeeds (`n = linefeed) we can get all the information for each restaurant as "one part" instead of multiple lines, which helps a lot!

Also, at line 32 and 33, we call a method called Trim(), this method removes all leading and trailing white-space characters from the string we're working on.

Finally, at line 45 we remove all the variables to prevent them from being "reused" on the next iteration of the loop if the next restaurants data is different or missing. Clear-Variable would work perfectly here aswell.

And that's it!

Result
We have now created functions to connect to a site, utilize functions that are only available when logged in and we have also made the functions work together in a nice way.

This is how they look in action:
finally

Pretty neat, huh? 🙂

The code for all of these functions have been uploaded here.

I hope you enjoyed this little guide, and if you have any questions, feel free to ask them in the comments or drop me an e-mail!

And keep automating anything 🙂

Ordering pizza with PowerShell (web scraping guide) – Part 1

Since I’ve gotten some positive feedback regarding the web scrape related posts on this blog, I thought I should write a guide on how to build PowerShell functions that interacts with a website. To make it a bit more fun, I thought it should be about ordering pizza!

The site in question is a Swedish site called OnlinePizza.

Since the actual code won’t be used for anything serious, you will see a few shortcuts here and there, the goal is to give you an idea on how to do something similar, not to create the perfect pizza automation module. In fact, we’ll only create a few functions, one for logging in, one for checking our account information and one for listing restaurants. Hopefully, the process and steps we will go through when creating these functions will teach you the basics of web scraping. To create a whole module that can handle the complete process of ordering pizzas seems a bit overkill, but feel free to keep working on it if you want to 🙂

So with no further ado, let’s get to it…

Figuring out how the site works and logging in
The first thing we need to do is figuring out how to log in. The easiest way to do this is to open a web browser with some developer features, since I’ve used Chrome before I thought I could use Internet Explorer in this example.

Begin with locating the loginpage for the site, which in this example is: http://onlinepizza.se/loggain as shown below:
press_login
After you’ve browsed to that page and filled in your username/password, you press F12 (Ctrl+Shift+i if you’re using Chrome) and select the “Network”-tab and press the “Start capturing”-button.

Should look something like this if using IE:
network_capture_started

Go back to the site and press the “login” button (“Logga in” in Swedish), and when it’s done, go back to the “network”-tab again and check the first request which is usually a Get or a Post request, in this case a Post-request:
login_pressed

Double click on the first row and select the “Request body”-tab. This is how the webrequest looked when it got sent to the webserver, so this is what we need to mimic from PowerShell:

login_post

This can be done in different ways, you could either basically download the site, try to manipulate the fields and then post the form, or create a hashtable with the required keys (username, password and action) and send that. The later is quickest since it only needs one request, so that’s what we’ll do here (this might not always work though).

To create the hashtable you can do this:

$Request = @{'username' = 'MyUsername'
             'password'= 'MyPassword'
             'action'= 'loggain'}

We now need to send it to the web server. The easiest way to do this is usually to use the “Invoke-WebRequest”-cmdlet that came with PowerShell v3, so that’s what we’ll do here. There are a few scenarios where this will give you issues, as is the case with this site, so we’ll need to use a workaround. When trying to download the site with the “Invoke-WebRequest”-cmdlet it will give you the error: “Invoke-WebRequest : ‘”UTF-8″‘ is not a supported encoding name.”

This leaves you two options as far as I know, you either skip the “Invoke-WebRequest”-cmdlet altogether and use the .Net WebClient Class instead, or you can fix this error by sending the output of the cmdlet to a file instead of the pipeline. We’ll do the latter here and save the .Net-method for another post.

Note: As stated above, this is mostly to give you and idea on how to create a “web scraping function” in PowerShell so we’ll do a few shortcuts. If this was a to become a serious module that would later be used in production, make sure the output-file is written to a place where the user will have write access and that it has a name that won’t damage (overwrite) anything.

The command itself is pretty simple, it looks like this:

Invoke-WebRequest -Uri "https://onlinepizza.se/loggain" -Method Post -Body $Request -SessionVariable Global:OnlinePizzaSession -OutFile .\dump.htm

I’ll break it down a bit so you’ll know what each parameter does:

  1. Uri – This specifies where to send the request. This should be the URL you saw in the screenshot of the “Request body” above, this can differ from the actual loginpage depending on the site.
  2. Method – This should match the method you saw in the loginrequest, in this case it was a “Post”-request.
  3. Body – This is what the request will actually contain, in our case the hashtable we created.
  4. SessionVariable – The variable we specify here (no leading $!) will contain cookies and other data needed to keep the session consistent through the rest of the commands we will run (this will for example keep us logged in). I’ve specified it in the “Global” scope since we want to use it together with other functions later on, and to make that work, it can’t be in the functions scope (since that will be gone before the next function will execute).
  5. OutFile – Our workaround. Specify a file where the output from the command (the html of the site) should be saved.

Alright, so the request is sent and we should now be logged in, you can verify this by looking in the “dump.htm”-file, it will usually contain a “You have been logged in!”-message of some sort. In this case that message is in Swedish though.

So, we have figured out how to log in, we now need to wrap a function around this, which will be our next step in this process!

Creating the function
To create this function we need to ask for a parameter, the only one we need in this case is the credential, which should be of the type PSCredential.

The code for defining the function name and the parameter looks like this:

function Connect-OnlinePizza
{
    [cmdletbinding()]
    param(
          [Parameter(Mandatory=$True)]
          [System.Management.Automation.PSCredential] $Credential)

As you can see, we also add the almost magical “cmdletbinding”-keyword aswell to get all the wonderful features that gives us. We also state the Credential-parameter as mandatory, and we define its data type, which will make the function ask for the credential in the same way as “Get-Credential” works if the user didn’t specify any.

We now need to place the credential in our request, which we can do this way:

$Username = $Credential.UserName
$Password = $Credential.GetNetworkCredential().Password

$Request = @{'username' = $Username
             'password'= $Password
             'action'= 'loggain'}

We then add our “Invoke-WebRequest”-command again:

Invoke-WebRequest -Uri "https://onlinepizza.se/loggain" -Method Post -Body $Request -SessionVariable Global:OnlinePizzaSession -OutFile .\dump.htm

And we should be logged in. When building something like this we should make sure though. This can easily be done by looking for that “You have been logged in!”-text in the dump.htm-file, for example with the “Select-String”-cmdlet.

And while we’re at it, why not delete the dump.htm-file to clean up a bit. That would look like:

$LoggedIn = Select-String -Path .\dump.htm -Pattern "inloggad som $Username" -Quiet

Remove-Item .\dump.htm -Force -Confirm:$false -ErrorAction SilentlyContinue

if ($LoggedIn) {
    Write-Verbose "You are now logged in!"
}
else {
    Write-Error "Login failed!"
}

And that’s it! When put together, the code looks like this:

function Connect-OnlinePizza
{
    [cmdletbinding()]
    param(
          [Parameter(Mandatory=$True)]
          [System.Management.Automation.PSCredential] $Credential)

    $Username = $Credential.UserName
    $Password = $Credential.GetNetworkCredential().Password

    $Request = @{'username' = $Username
                 'password'= $Password
                 'action'= 'loggain'}

    Invoke-WebRequest -Uri "https://onlinepizza.se/loggain" -Method Post -Body $Request -SessionVariable Global:OnlinePizzaSession -OutFile .\dump.htm

    $LoggedIn = Select-String -Path .\dump.htm -Pattern "inloggad som $Username" -Quiet

    Remove-Item .\dump.htm -Force -Confirm:$false -ErrorAction SilentlyContinue

    if ($LoggedIn) {
        Write-Verbose "You are now logged in!"
    }
    else {
        Write-Error "Login failed!"
    }
}

And this is how it looks in action:
Connect-OnlinePizza_screenshot

That’s it for this post. In the next one, we’ll create the function for fetching our account information and one for getting a list of what restaurants are available in our location.