All Entries Tagged With: "multiple"
Joining Up
From: COMMAND LINE KUNG FU: PaulDotCom, Ed Skoudis, Hal Pomeranz, byte_bucket
Hal fields a question from IRC
Mr. Bucket passed along the following query from the PaulDotCom IRC channel:
What functionality is available to loop through multiple files, and write the output to a single file with some values on the same line? Ex: If one program gives me the hash of a file, and the other program outputs the name/size/etc of a file, can I output to the same file HASH-FileName-Size
I couldn’t resist chortling with glee when this question came up, because it’s another one of those “easy for Unix, hard for Windows” kinds of tasks. I never can resist sharing these “learning experiences” with my fellow co-authors.
First let’s review our inputs. I’m going to use the openssl utility for generating checksums, since it’s fairly generic to lots of different flavors of Unix at this point:
$ openssl sha1 *
SHA1(001.jpg)= a088531884ee5eb520e98b3e9e18283f29e13d25
SHA1(002.jpg)= 77febb1498b2926ee6a988c97f3457e38736456d
SHA1(003.jpg)= 922bcb001d025d747c2ee56328811a4270b62079
...
As you can see, it’s pretty easy to generate a set of checksums over my directory of image files, but there’s a bunch of cruft around the filename that’s not really helpful. So let me get rid of that with some quick sed action:
$ openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/'
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d
003.jpg 922bcb001d025d747c2ee56328811a4270b62079
...
That’s better! In the sed expression I’m using the “(.*)” sub-expressions to match the file name and the checksum in each line, and the substitution operator is replacing the original line with just the values of the sub-expressions. Slick.
Now that we’ve got the checksums, how do we produce the file sizes? I could just use “ls -l” of course. But since the questioner seems to only want “HASH-FileName-Size”, I may as well just use “wc -c” to produce simpler output:
$ wc -c *
4227504 001.jpg
4600982 002.jpg
4271719 003.jpg
...
Now that I know what my inputs are going to be, the question is how to stitch them together? Luckily, Unix includes the join command for putting files together on arbitrary fields (we last saw the join command back in Episode #43). Now I could save the checksum output and the file sizes to separate files and then join the contents of the two files, but bash actually gives us a cooler way to handle this:
$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *)
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25 4227504
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d 4600982
003.jpg 922bcb001d025d747c2ee56328811a4270b62079 4271719
...
See the “<(…)” syntax? That’s a little bit of bash file descriptor magic that allows us to substitute the output of a command in a place where a program would normally be looking for a file name. In this case it saves us the hassle of having to create intermediate output files to join together. The join command itself is pretty simple. We’re telling the program to join the output of the two commands using the file names in the first field of input #1 and the second field of input #2. The only problem is that the join command isn’t producing the “HASH-FileName-Size” output that the original questioner wanted. That’s because join always outputs the joined field first, followed by the remaining fields from the first input (the checksum in this case), followed by the remaining fields from the second input (the file size). We’ll have to use a little awk fu to re-order the fields:
$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *) \ | awk '{print $2 " " $1 " " $3}' a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504 77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982 922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
… Mmmm, that’s a tasty little bit of shell magic, isn’t it? Let’s see what Ed and Tim are cooking up.
Ed retorts snidely:
Choosing a topic just because you think it’s hard for us Windows guys, huh, Hal? Well, aren’t you just a big ball of sunshine, a command-line Scrooge this holiday season? When I first read this one, I though… “Ugh… this is gonna be hard.” Perhaps I was psyched out by your juvenile trash talk. Or, maybe I’ve just been hanging around in cmd.exe too long, and have gotten used to hard problems.
But, this one turned out to be surprisingly straight-forward and even non-ugly (well, beauty is in the eye of the beholder, I suppose). Here’s the skinny:
C:\> FOR /f "tokens=1-2" %a in (name-hash.txt) do @for /f "tokens=1,2" %m
in (length-name.txt) do @if %a==%n echo %b %a %m
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
I’m assuming that name-hash.txt contains, well, names and hashes, one pair per line. Likewise, length-name.txt contains lengths and names, again one pair per line.
As we know, FOR /F loops can parse through all kinds of crap, including the contents of files. I use a FOR /F loop with two tokens (giving me two variables) of %a (for the file name) and %b (allocated automagically, holding the hash). For each of the files described in name-hash.txt, I then construct the body of my FOR loop. It contains another FOR /F loop, again with two variables (the original question mentioned “etc” for extra stuff there… if you have more stuff, just up the number of tokens and echo the proper variables at the end). My inner FOR /F loop iterates through the length-name.txt file, placing its values in the variables %m (length) and %n (name).
Now, if I just echoed out %a %b %m %n, I’d be making all of the possible combinations of every pair of two lines in the original files. But, we want to pare that down. We only want to generate some output if the name from name-hash.txt (%a) matches the name from length-name.txt (%n). We do this with a little IF operation comparing the two variables. If they match, we then echo out hash (%b), name (%n), and size (%m).
Admittedly, the performance of this little command isn’t great, as I have to run through every line of name-hash.txt, comparing the name by running through the entirety of length-name.txt. I don’t stop when I’ve found a match, because, well, there could be another match somewhere. Also, if there is no match of the name between the two files, my command ignores that name, not issuing any output. But, I think that makes sense given what the questioner asks.
So, Tim… does PowerShell have a nifty little built-in or something to make this easier than running through a couple of FOR loops? Inquiring minds what to know.
Tim tags in for Ed:
For loops! We don’t need no stinking For loops!
The first thing to do is import the files. Since there is a space between the columes we can use Import-CSV with a delimiter of the space character. Also, there is no header information so we have to specify it.
PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data
File Data
---- ----
001.jpg 4227504
002.jpg 4600982
003.jpg 4271719
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d
003.jpg 922bcb001d025d747c2ee56328811a4270b62079
...
We have all the data, so now it can be grouped by the file name using Group-Object (alias group).
PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data | group file
Count Name Group
----- ---- -----
2 001.jpg {@{File=001.jpg; Data=4227504}, @{File=001.jpg; Data=a088531884ee5eb520e98b3e9e18283f29e13d25}}
2 002.jpg {@{File=002.jpg; Data=4600982}, @{File=002.jpg; Data=77febb1498b2926ee6a988c97f3457e38736456d}}
2 003.jpg {@{File=003.jpg; Data=4271719}, @{File=003.jpg; Data=922bcb001d025d747c2ee56328811a4270b62079}}
...
We have the data grouped like we want, but we still need to massage it a bit so we can get the formate we want.
PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data |
group file | Select @{Name="Hash";Expression={$_.Group[1].Data}}, Name,
@{Name="Length";Expression={$_.Group[0].Data}}
Hash Name Length
---- ---- ------
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...
The Select-Object (alias select) cmdlet allows for custom expressions which was used to get the hash and the length. The “Group” object contains multiple items and each can be access by its index value, 0 is the length and 1 is the hash.
Fileless PowerShell
The initial task was to get the file name, length, and hash from separate files and combine them in to one. Let’s try this again without using files.
This would be very easy if powershell just had a hashing cmdlet, but it doesn’t. However, we can do hashing by using the .NET library and some very ugly PowerShell. Maybe in v3 we will get a Get-Hash cmdlet, but it seems as likely as the addition of Get-Unicorn or Get-MillionDollars.
So we need some hash, but not the kind that is illegal in 49 states, we need the hash of a file. Here is how we get it.
PS C:\> PS C:\> gci 001.jpg | % { (New-Object System.Security.Cryptography
.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) }
We use the SHA1CryptoServiceProvider .NET class, but it adds another bump since it doesn’t take files as input and will only take a stream. It isn’t hard to get the stream though, all we need to use is the OpenRead method of our file object. If that wasn’t enough, there is another problem, the output.
PS C:\> PS C:\> gci 001.jpg | % { (New-Object System.Security.Cryptography
.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) }
160
136
83
24
...
The result is an array of bytes. So we have to convert that to hex and combine it together.
PS C:\> gci 001.jpg | % {$hash=""; (New-Object System.Security.Cryptography
.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) | % { $hash += $_.ToString("X2") }; $hash}
a088531884ee5eb520e98b3e9e18283f29e13d25
We use the ToString method with the format string X2 to convert each byte to hex. The X converts it to hex, and the 2 will make sure the output is two characters wide (0A vs A). We then use the variable $hash to stitch our bytes together to get the full hash.
Now let’s see the full command.
PS C:\> gci *.* | select @{Name="Hash";Expression={$hash=""; (New-Object
System.Security.Cryptography.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) |
% { $hash += $_.ToString("X2") }; $hash}}, name, length
Hash Name Length
---- ---- ------
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...
The first thing we do is get all the files in the currect directory using Get-ChildItem (aliased as gci or dir). That is piped in to Select-Object (aliased as select) to get the hash, filename, and size. The Select-Object cmdlet allows us to get properties of the pipeline object as well as creating a custom expression. In our case we will use the custom expression to calculate the hash.
Our results are in object form and can be piped to a file with Out-File or Out-Csv.
So the task is complete, but let’s pretend for a second we had the fictional Get-Hash cmdlet. If we had our leprachaun our command might look something like this:
PS C:\> gci *.* | select @{Name="Hash";Expression={Get-Hash $_ sha1}, name, length
If only getting hash was easier in Windows.
Download and unzip a file to multiple directories.
From Nukeit.org
This script was written because some plugins used on more than one blog get updated every day, yet don’t have a public source control. To use, create a file called dirfile.txt in the same directory as you save the script. Add the full paths to each of your target directories, and save. Run the script and paste in the url to the zip you want to download and extract. That’s it!
Download & unzip a file to multiple directories. (62)Replacing Strings in Multiple Files
From: COMMAND LINE KUNG FU: PaulDotCom, Ed Skoudis, Hal Pomeranz, byte_bucket
Hal Starts Off:
Wow, our last several Episodes have been really long! So I thought I’d give everybody a break and just show you a cool little sed idiom that I use all the time:
# sed -i.bak 's/foo/bar/g' *
Here we’re telling sed to replace the all instances string “foo” with the string “bar” in all files in the current directory. The useful trick is the “-i.bak” option which causes sed to make an automatic backup copy of each file as .bak before doing the global search and replace.
By the way, you can even do this across an entire directory structure, with a little help from the find and xargs commands:
# find . -type f | xargs sed -i.bak 's/foo/bar/g'
Of course, you could use other search criteria than just “-type f” if you wanted to be more selective about which files you ran sed against.
Oh dear, I hope this isn’t one of those “easy for Unix, hard for Windows” things again. Ed gets so grumpy when I do that.
Ed jumps in:
You nailed it, Hal, with that characterization. Unfortunately, cmd.exe doesn’t include the ability to do find and replace of strings within lines of a file using a built-in command. We can search for strings using the find command, and even process regex with findstr. But, the replacement part just doesn’t exist there.
Thus, most reasonable people will either rely on a separately installed tool to do this, or use Powershell.
For a separately installed tool, my first approach would be use Cygwin, the free Linux-like environment for Windows, and then just run the sed command Hal uses above. Nice, easy, and sensical.
Alternatively, you could download and install a tool called replace.exe.
Or, there’s another one called Find And Replace Text, which, as you might guess, is called FART for short.
To do this in Powershell efficiently, I asked Tim Medin, our go-to guy for Powershell, to comment.
Tim (our Powershell Go-To Guy) says:
This morning when Ed asked me to do a “quick” write up for Powershell, I thought to myself, “This won’t be too bad…” I was wrong.
By default there are aliases for many of the command in Powershell, so I’ll show both the long and short version of the commands (yes, even the short command is long relative to sed).
The Long Way
PS C:\> Get-ChildItem -exclude *.bak | Where-Object {$_.Attributes -ne "Directory"} |
ForEach-Object { Copy-Item $_ "$($_).bak"; (Get-Content $_) -replace
"foo","bar" | Set-Content -path $_ }
The Short Way (using built in aliases)
PS C:\> gci -ex *.bak | ? {$_.Attributes -ne "Directory"} | % { cp $_ "$($_).bak";
(gc $_) -replace "foo","bar" | sc -path $_ }
This command is rather long, so let’s go through it piece by piece.
gci -ex *.bak | ? {$_.Attributes -ne "Directory"}
The first portion gets all files that don’t end in .bak. Without this exclusion, it will process file1.txt and the new file1.txt.bak. Processing file1.txt.bak results in file1.txt.bak.bak, but it doesn’t do this endlessly, just twice.
The Where-Object (with an alias of ?) ensures that we only work with files and not directories because Get-Content on a directory throws an error.
ForEach-Object { Copy-Item $_ "$($_).bak"; (Get-Content $_) -replace "foo","bar" |
Set-Content -path $_ }
Once we get the files, not directories, we want, we then act on each file with the ForEach-Object (alias %). For those of you haven’t yet fallen asleep, I’ll further break down the inner portion of the ForEach-Object:
Copy-Item $_ "$($_).bak"
First, we copy the file to our backup .bak file. We have to use the $() in order to use our variable in a string so we can append .bak.
Finally, we get to the search and replace (and it’s about time, too!).
(Get-Content $_) -replace "foo","bar" | Set-Content -path $_
Get-Content (gc) gets the contents of the file. We wrap it in parentheses so we can act on its output in order to do our replace. The output is then piped to Set-Content (sc) and written back to our file.
We could make this work a little better if we used variables, but then we are more in script-land instead of shell-land which probably violates the almighty laws of this blog. The use of variables turn this more into a scripting exercise instead of shell (OK, we may already be there). For kicks, I’ll show you how we can use variables show you so you can add it to your big bloated belt of windows-fu.
$a = (gci | ? {$_.Attributes -ne "Directory"}); $a | % { cp $_ "$($_).bak";
(gc $_) -replace "foo","bar" | sc -path $_ }
The difference between our original command and this command is that the $a variable grabs a snapshot of the directory before we copy files, so we won’t operate on the new .bak files.
After all this work we have done the same thing as the mighty sed. Sadly even the power of Powershell is no match for efficiency of sed.
Ed closes it out:
Thanks for that, Tim. Nice stuff!





