This is the third and final post in a three-part series.
- Part 1:
- Useful methods on the String class
- Introduction to Regular Expressions
- The Select-String cmdlet
- Part 2:
- the -split operator
- the -match operator
- the switch statement
- the Regex class
- Part 3:
- a real world, complete and slightly bigger, example of a switch-based parser
- General structure of a switch-based parser
- The real world example
- a real world, complete and slightly bigger, example of a switch-based parser
In the previous posts, we looked at the different operators what are available to us in PowerShell.
When analyzing crashes at DICE, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft’s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols.
In some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.
So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!
This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.
The example will show how a switch
-based parser would look when the input data isn’t as tidy as it normally is in examples, but messy – as the real world data often is.
General Structure of a switch Based Parser
Depending on the structure of our input, the code must be organized in slightly different ways.
Input may have a record start that differs by indentation or some distinct token like
Foo <- Record start - No whitespace at the beginning of the line Prop1=Staffan <- Properties for the record - starts with whitespace Prop3 =ValueN Bar Prop1=Steve Prop2=ValueBar2
If the data to be parsed has an explicit start record, it is a bit easier than if it doesn’t have one.
We create a new data object when we get a record start, after writing any previously created object to the pipeline.
At the end, we need to check if we have parsed a record that hasn’t been written to the pipeline.
The general structure of a such a switch-based parser can be as follows:
$inputData = @" Foo Prop1=Value1 Prop3=Value3 Bar Prop1=ValueBar1 Prop2=ValueBar2 "@ -split 'r?n' # This regex is useful to split at line endings, with or without carriage return class SomeDataClass { $ID $Name $Property2 $Property3 } # map to project input property names to the properties on our data class $propertyNameMap = @{ Prop1 = "Name" Prop2 = "Property2" Prop3 = "Property3" } $currentObject = $null switch -regex ($inputData) { '^(S.*)' { # record start pattern, in this case line that doesn't start with a whitespace. if ($null -ne $currentObject) { $currentObject # output to pipeline if we have a previous data object } $currentObject = [SomeDataClass] @{ # create new object for this record Id = $matches.1 # with Id like Foo or Bar } continue } # set the properties on the data object '^s+([^=]+)=(.*)' { $name, $value = $matches[1, 2] # project property names $propName = $propertyNameMap[$name] if ($propName = $null) { $propName = $name } # assign the parsed value to the projected property name $currentObject.$propName = $value continue } } if ($currentObject) { # Handle the last object if any $currentObject # output to pipeline }
ID Name Property2 Property3 -- ---- --------- --------- Foo Value1 Value3 Bar ValueBar1 ValueBar2
Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.
commitId=1234 <- In this case, a commitId is first in a record description=Update readme.md <- the blank line separates records user=Staffan <- For this record, a user property comes first commitId=1235 description=Fix bug.md
In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it’s dirty or not.
If we get to the end with a dirty object, we must output it.
$inputData = @" commit=1234 desc=Update readme.md user=Staffan commit=1235 desc=Bug fix "@ -split "r?n" class SomeDataClass { [int] $CommitId [string] $Description [string] $User } # map to project input property names to the properties on our data class # we only need to provide the ones that are different. 'User' works fine as it is. $propertyNameMap = @{ commit = "CommitId" desc = "Description" } $currentObject = [SomeDataClass]::new() $objectDirty = $false switch -regex ($inputData) { # set the properties on the data object '^([^=]+)=(.*)$' { # parse a name/value $name, $value = $matches[1, 2] # project property names $propName = $propertyNameMap[$name] if ($null -eq $propName) { $propName = $name } # assign the projected property $currentObject.$propName = $value $objectDirty = $true continue } '^s*$' { # separator pattern, in this case any blank line if ($objectDirty) { $currentObject # output to pipeline $currentObject = [SomeDataClass]::new() # create new object $objectDirty = $false # and mark it as not dirty } } default { Write-Warning "Unexpected input: '$_'" } } if ($objectDirty) { # Handle the last object if any $currentObject # output to pipeline }
CommitId Description User -------- ----------- ---- 1234 Update readme.md 1235 Bug fix Staffan
The Real World Example
I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same.
The following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:
# we need to muck around with the console output encoding to handle the trademark chars # imagine no encodings # it's easy if you try # no code pages below us # above us only sky [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1") $proc = Start-Process notepad -passthru Start-Sleep -seconds 1 $cdbOutput = cdb -y 'srv*c:symbols*http://msdl.microsoft.com/download/symbols' -c ".reload -f;lmv;q" -p $proc.ProcessID
The output of the command above is here for those who want to follow along but who aren’t running windows or don’t have cdb.exe installed.
The (abbreviated) output looks like this:
Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64 Copyright (c) Microsoft Corporation. All rights reserved. *** wait with pending attach ************* Path validation summary ************** Response Time (ms) Location Deferred srv*c:symbols*http://msdl.microsoft.com/download/symbols Symbol search path is: srv*c:symbols*http://msdl.microsoft.com/download/symbols Executable search path is: ModLoad: 00007ff6`e9da0000 00007ff6`e9de3000 C:Windowssystem32notepad.exe ... ModLoad: 00007ffe`97d80000 00007ffe`97db1000 C:WINDOWSSYSTEM32ntmarta.dll (98bc.40a0): Break instruction exception - code 80000003 (first chance) ntdll!DbgBreakPoint: 00007ffe`9cd53050 cc int 3 0:007> cdb: Reading initial command '.reload -f;lmv;q' Reloading current modules ..................................................... start end module name 00007ff6`e9da0000 00007ff6`e9de3000 notepad (pdb symbols) c:symbolsnotepad.pdb2352C62CDF448257FDBDDA4081A8F9081notepad.pdb Loaded symbol image file: C:Windowssystem32notepad.exe Image path: C:Windowssystem32notepad.exe Image name: notepad.exe Image was built with /Brepro flag. Timestamp: 329A7791 (This is a reproducible build file hash, not a timestamp) CheckSum: 0004D15F ImageSize: 00043000 File version: 10.0.17763.1 Product version: 10.0.17763.1 File flags: 0 (Mask 3F) File OS: 40004 NT Win32 File type: 1.0 App File date: 00000000.00000000 Translations: 0409.04b0 CompanyName: Microsoft Corporation ProductName: Microsoft??? Windows??? Operating System InternalName: Notepad OriginalFilename: NOTEPAD.EXE ProductVersion: 10.0.17763.1 FileVersion: 10.0.17763.1 (WinBuild.160101.0800) FileDescription: Notepad LegalCopyright: ??? Microsoft Corporation. All rights reserved. ... 00007ffe`9ccb0000 00007ffe`9ce9d000 ntdll (pdb symbols) c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb Loaded symbol image file: C:WINDOWSSYSTEM32ntdll.dll Image path: C:WINDOWSSYSTEM32ntdll.dll Image name: ntdll.dll Image was built with /Brepro flag. Timestamp: E8B54827 (This is a reproducible build file hash, not a timestamp) CheckSum: 001F20D1 ImageSize: 001ED000 File version: 10.0.17763.194 Product version: 10.0.17763.194 File flags: 0 (Mask 3F) File OS: 40004 NT Win32 File type: 2.0 Dll File date: 00000000.00000000 Translations: 0409.04b0 CompanyName: Microsoft Corporation ProductName: Microsoft??? Windows??? Operating System InternalName: ntdll.dll OriginalFilename: ntdll.dll ProductVersion: 10.0.17763.194 FileVersion: 10.0.17763.194 (WinBuild.160101.0800) FileDescription: NT Layer DLL LegalCopyright: ??? Microsoft Corporation. All rights reserved. quit:
The output starts with info that I’m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line
start end module name
that I care about the output.
Also, at the end there is a line that we need to be aware of:
quit:
that is not part of the module output.
To skip the parts of the debugger output that we don’t care about, we have a boolean flag initially set to true.
If that flag is set, we check if the current line, $_
, is the module header in which case we flip the flag.
$inPreamble = $true switch -regex ($cdbOutput) { { $inPreamble -and $_ -eq "start end module name" } { $inPreamble = $false; continue }
I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn’t matter, since the parser is decoupled from the data retrieval.
After the sample, there is a breakdown of the more complicated regular expressions used, so don’t despair if you don’t understand them at first.
Regular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.
# define an class to store the data class ExecutableModule { [string] $Name [string] $Start [string] $End [string] $SymbolStatus [string] $PdbPath [bool] $Reproducible [string] $ImagePath [string] $ImageName [DateTime] $TimeStamp [uint32] $FileHash [uint32] $CheckSum [uint32] $ImageSize [version] $FileVersion [version] $ProductVersion [string] $FileFlags [string] $FileOS [string] $FileType [string] $FileDate [string[]] $Translations [string] $CompanyName [string] $ProductName [string] $InternalName [string] $OriginalFilename [string] $ProductVersionStr [string] $FileVersionStr [string] $FileDescription [string] $LegalCopyright [string] $LegalTrademarks [string] $LoadedImageFile [string] $PrivateBuild [string] $Comments } <# .SYNOPSIS Runs a debugger on a program to dump its loaded modules #> function Get-ExecutableModuleRawData { param ([string] $Program) $consoleEncoding = [Console]::OutputEncoding [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1") try { $proc = Start-Process $program -PassThru Start-Sleep -Seconds 1 # sleep for a while so modules are loaded cdb -y srv*c:symbols*http://msdl.microsoft.com/download/symbols -c ".reload -f;lmv;q" -p $proc.Id $proc.Close() } finally { [Console]::OutputEncoding = $consoleEncoding } } <# .SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects. #> function ConvertTo-ExecutableModule { [OutputType([ExecutableModule])] param ( [Parameter(ValueFromPipeline)] [string[]] $ModuleRawData ) begin { $currentObject = $null $preamble = $true $propertyNameMap = @{ 'File flags' = 'FileFlags' 'File OS' = 'FileOS' 'File type' = 'FileType' 'File date' = 'FileDate' 'File version' = 'FileVersion' 'Product version' = 'ProductVersion' 'Image path' = 'ImagePath' 'Image name' = 'ImageName' 'FileVersion' = 'FileVersionStr' 'ProductVersion' = 'ProductVersionStr' } } process { switch -regex ($ModuleRawData) { # skip lines until we get to our sentinel line { $preamble -and $_ -eq "start end module name" } { $preamble = $false; continue } #00007ff6`e9da0000 00007ff6`e9de3000 notepad (deferred) #00007ffe`9ccb0000 00007ffe`9ce9d000 ntdll (pdb symbols) c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb '^([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?' { # see breakdown of the expression later in the post # on record start, output the currentObject, if any is set if ($null -ne $currentObject) { $currentObject } $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5] # create an instance of the object that we are adding info from the current record into. $currentObject = [ExecutableModule] @{ Start = $start End = $end Name = $module SymbolStatus = $pdbKind PdbPath = $pdbPath } continue } '^s+Image was built with /Brepro flag.' { $currentObject.Reproducible = $true continue } '^s+Timestamp:s+[^(]+((?<timestamp>.{8}))' { # see breakdown of the regular expression later in the post # Timestamp: Mon Jan 7 23:42:30 2019 (5C33D5D6) $intValue = [Convert]::ToInt32($matches.timestamp, 16) $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue) continue } '^s+TimeStamp:s+(?<value>.{8}) (This' { # Timestamp: E78937AC (This is a reproducible build file hash, not a timestamp) $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16) continue } '^s+Loaded symbol image file: (?<imageFile>[^)]+)' { $currentObject.LoadedImageFile = $matches.imageFile continue } '^s+Checksum:s+(?<checksum>S+)' { $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16) continue } '^s+Translations:s+(?<value>S+)' { $currentObject.Translations = $matches.value.Split(".") continue } '^s+ImageSize:s+(?<imageSize>.{8})' { $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16) continue } '^s{4}(?<name>[^:]+):s+(?<value>.+)' { # see breakdown of the regular expression later in the post # This part is any 'name: value' pattern $name, $value = $matches['name', 'value'] # project the property name $propName = $propertyNameMap[$name] $propName = if ($null -eq $propName) { $name } else { $propName } # note the dynamic property name in the assignment # this will fail if the property doesn't have a member with the specified name $currentObject.$propName = $value continue } 'quit:' { # ignore and exit break } default { # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser # Write-Warning "missing case for '$_'. Unexpected output format from cdb.exe" continue # skip lines that doesn't match the patterns we are interested in, like the start/end/modulename header and the quit: output } } } end { # this is needed to output the last object if ($null -ne $currentObject) { $currentObject } } } Get-ExecutableModuleRawData Notepad | ConvertTo-ExecutableModule | Sort-Object ProductVersion, Name Format-Table -Property Name, FileVersion, Product_Version, FileDescription
Name FileVersionStr ProductVersion FileDescription ---- -------------- -------------- --------------- PROPSYS 7.0.17763.1 (WinBuild.160101.0800) 7.0.17763.1 Microsoft Property System ADVAPI32 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Advanced Windows 32 Base API bcrypt 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Windows Cryptographic Primitives Library ... uxtheme 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Microsoft UxTheme Library win32u 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Win32u WINSPOOL 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Windows Spooler Driver KERNELBASE 10.0.17763.134 (WinBuild.160101.0800) 10.0.17763.134 Windows NT BASE API Client DLL wintypes 10.0.17763.134 (WinBuild.160101.0800) 10.0.17763.134 Windows Base Types DLL SHELL32 10.0.17763.168 (WinBuild.160101.0800) 10.0.17763.168 Windows Shell Common Dll ... windows_storage 10.0.17763.168 (WinBuild.160101.0800) 10.0.17763.168 Microsoft WinRT Storage API CoreMessaging 10.0.17763.194 10.0.17763.194 Microsoft CoreMessaging Dll gdi32full 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 GDI Client DLL ntdll 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 NT Layer DLL RMCLIENT 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 Resource Manager Client RPCRT4 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 Remote Procedure Call Runtime combase 10.0.17763.253 (WinBuild.160101.0800) 10.0.17763.253 Microsoft COM for Windows COMCTL32 6.10 (WinBuild.160101.0800) 10.0.17763.253 User Experience Controls Library urlmon 11.00.17763.168 (WinBuild.160101.0800) 11.0.17763.168 OLE32 Extensions for Win32 iertutil 11.00.17763.253 (WinBuild.160101.0800) 11.0.17763.253 Run time utility for Internet Explorer
Regex pattern breakdown
Here is a breakdown of the more complicated patterns, using the ignore pattern whitespace
modifier x
:
([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)? # example input: 00007ffe`9ccb0000 00007ffe`9ce9d000 ntdll (pdb symbols) c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb (?x) # ignore pattern whitespace ^ # the beginning of the line ([0-9a-f`]{17}) # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them s # a space ([0-9a-f`]{17}) # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them s+ # skip any number of spaces (S+) # capture until we get a space - this would match the 'ntdll' part s+ # skip one or more spaces ( # start parenthesis ([^)]) # capture anything but end parenthesis ) # end parenthesis s* # skip zero or more spaces (.+)? # optionally capture any symbol file path
Breakdown of the name-value pattern:
^s+(?<name>[^:]+):s+(?<value>.+) # example input: File flags: 0 (Mask 3F) (?x) # ignore pattern whitespace ^ # the beginning of the line s+ # require one or more spaces (?<name>[^:]+) # capture anything that is not a `:` into the named group "name" : # require a comma s+ # require one or more spaces (?<value>.+) # capture everything until the end into the name group "value"
Breakdown of the timestamp pattern:
^s{4}Timestamp:s+[^(]+((?<timestamp>.{8})) #example input: Timestamp: Mon Jan 7 23:42:30 2019 (5C33D5D6) (?x) # ignore pattern whitespace ^ # the beginning of the line s+ # require one or more spaces Timestamp: # The literal text 'Timestamp:' s+ # require one or more spaces [^(]+ # one or more of anything but a open parenthesis ( # a literal '(' (?<timestamp>.{8}) # 8 characters of anything, captured into the group 'timestamp' ) # a literal ')'
Gotchas – the Regex Cache
Something that can happen if you are writing a more complicated parser is the following:
The parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven’t seen before, so you add a 16th regex.
All of a sudden, the performance of your parser tanks. WTF?
The .net regex implementation has a cache of recently used regex
s. You can check the size of it like this:
PS> [regex]::CacheSize 15 # bump it [regex]::CacheSize = 20
And now your parser is fast(er) again.
Bonus tip
I frequently use PowerShell to write (generate) my code:
Get-ExecutableModuleRawData pwsh | Select-String '^s+([^:]+):' | # this pattern matches the module detail fields Foreach-Object {$_.matches.groups[1].value} | Select-Object -Unique | Foreach-Object -Begin { "class ExecutableModuleData {" }` -Process { " [string] $" + ($_ -replace "s.", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }` -End { "}" }
The output is
class ExecutableModuleData { [string] $LoadedSymbolImageFile [string] $ImagePath [string] $ImageName [string] $Timestamp [string] $CheckSum [string] $ImageSize [string] $FileVersion [string] $ProductVersion [string] $FileFlags [string] $FileOS [string] $FileType [string] $FileDate [string] $Translations [string] $CompanyName [string] $ProductName [string] $InternalName [string] $OriginalFilename [string] $ProductVersion [string] $FileVersion [string] $FileDescription [string] $LegalCopyright [string] $Comments [string] $LegalTrademarks [string] $PrivateBuild }
It is not complete – I don’t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear.
But it is a very good starting point. And way more fun than typing it
Note that this example is using a new feature of the -replace
operator – to use a ScriptBlock to determine what to replace with – that was added in PowerShell Core 6.1.
Bonus tip #2
A regular expression construct that I often find useful is non-greedy matching
.
The example below shows the effect of the ?
modifier, that can be used after *
(zero or more) and +
(one or more)
# greedy matching - match to the last occurrence of the following character (>) if("<Tag>Text</Tag>" -match '<(.+)>') { $matches }
Name Value ---- ----- 1 Tag>Text</Tag 0 <Tag>Text</Tag>
# non-greedy matching - match to the first occurrence of the the following character (>) if("<Tag>Text</Tag>" -match '<(.+?)>') { $matches }
Name Value ---- ----- 1 Tag 0 <Tag>
See Regex Repeat for more info on how to control pattern repetition.
Summary
In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline.
We have also looked at a few slightly more complicated regular expressions in some detail.
As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions.
My personal experience has been that the time I’ve invested in understanding the regex language was well invested.
Hopefully, this gives you a good start with the parsing tasks you have at hand.
Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.
Staffan Gustafsson, @StaffanGson, github
Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.
The post Parsing Text with PowerShell (3/3) appeared first on Powershell.