开发者

bash/awk script extracts html metadata, need to remove whitespace and write to file

开发者 https://www.devze.com 2023-03-28 13:28 出处:网络
I\'ve got a directory of html files courtesy of wget and I need to extract title tag and all metadata from each file -- but separately, so I can copy/paste into a spreadsheet (ok, if I were better at

I've got a directory of html files courtesy of wget and I need to extract title tag and all metadata from each file -- but separately, so I can copy/paste into a spreadsheet (ok, if I were better at scripting this wouldn't be a requirement). I've got a script with two problems -- it produces lots of extra white space on the extraction and when I tried to write it to a file, the file was 600 GBs (no kidding, good thing I routed it to my external). I'm open to any solution native to *NIX. TIA for any help.

    #!/bin/bash
for LINE in `cat htmllist.txt`
do
   awk 'BEGIN{IGNORE开发者_JAVA技巧CASE=1;FS="<title>|</title>";RS=EOF} {print $2}' $LINE
done


First off, you should get rid of all the lines with just white space. You can do this by using awk like so:

cat <file> | awk '{ if (NF > 0) printf("%s\n", $0); }'

In your case, you could just pipe the last awk command into this one. You could also get rid of multiple whitespaces in a row using awk. Since they are the default separators you could do this:

cat <file> | awk '{
                    for (i = 1; i <= NF; i++) {
                         printf("%s ", i);
                    }
                    printf("\n");
             }'
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号