开发者

using AWK how to remove these kind of Duplicates?

开发者 https://www.devze.com 2023-04-06 05:09 出处:网络
I am new to AWK, I have some basic ideas in AWK. I want to remove duplicates in a file, for example: 0008.ASIA. NS AS2.DNS.ASIA.CN.

I am new to AWK, I have some basic ideas in AWK. I want to remove duplicates in a file, for example:

    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.

This is a sample file, from that using this command I got the output like this:

awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'

0008.ASIA.

anish.asia.

ANISH.asia

But I want output like this

  008.ASIA
  anish.asia

or

008.ASIA
ANISH.asia

How do I remove these kind of duplicates?

Thanks in Advance Anish kumar.V

Thanks for your immediate reponse, Actually I wrote a complete script in bash, now I am in final stage. How to invoke python in that :-(

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log

mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date 

exec 6>&1 >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
    echo Download Not Success Domain count Failed With Error
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.as开发者_高级运维ia

###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count

a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)

echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"

exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" 07anis@gmail.com

In that

 awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia

in this part only now I am searching how to get the distinct values so any suggestions using AWK is better for me. Thanks again for your immediate response.


kent$  cat a
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.


kent$  awk -F' NS' '{ gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[tolower($1)]++;}END{for (x in b)print x}' a
anish.asia
0008.asia

btw, it is interesting, that I gave you a solution at http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html, and you add something new in your file, then I added the tolower() function here. :D


By putting your AWK script into a separate file, you can tell what's really going on. Here's a simple approach to your "filter out the duplicates" problem:

# For each line in the file
{

  # Decide on a unique key (eg. case insensitive without trailing period)
  unique_key = tolower($1)
  sub(/\.$/, "", unique_key)

  # If this line isn't a duplicate (it hasn't been found yet)
  if (!(unique_key in already_found)) {

    # Mark this unique key as found
    already_found[unique_key] = "found"

    # Print out the relevant data
    print($1)
  }
}

You can run AWK files by passing the -f option to awk.

If the above script isn't recognizable as an AWK script, here it is in inline form:

awk '{ key = tolower($1); sub(/\.$/, "", key); if (!(key in found)) { found[key] = 1; print($1) } }'


Or, just use the shell:

echo '    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.' |
while read domain rest; do
    domain=${domain%.}
    case "$domain" in
        (*.*.*) : ;;
        (*.[aA][sS][iI][aA]) echo "$domain" ;;
    esac
done |
sort -fu

produces

0008.ASIA
anish.asia


Don't use AWK. Use Python

import readlines
result= set()
for line in readlines:
    words = lines.split()
    if "asia" in words[0].lower():
        result.add( words[0].lower() )
for name in result:
    print name

That might be easier to work with than AWK. Yes. It's longer. But it may be easier to understand.


Here is an alternative solution. Let sort create your cased-folded and uniq list (and it will be sorted!)

  {
   cat - <<EOS
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   ns1.0008.asia. NS AS2.DNS.ASIA.CN.
   www.0008.asia. NS AS2.DNS.ASIA.CN.
   anish.asia NS AS2.DNS.ASIA.CN.
   ns2.anish.asia NS AS2.DNS.ASIA.CN
   ANISH.asia. NS AS2.DNS.ASIA.CN.

EOS
 } |   awk '{
      #dbg print "$0=" $0
      targ=$1
      sub(/\.$/, "", targ)
      n=split(targ,tmpArr,".")
      #dbg print "n="n
      if (n > 2) targ=tmpArr[n-1] "." tmpArr[n]
      print targ 
     }' \
 | sort -f -u

output

0008.ASIA
anish.asia

Edit: fixed sort -i -u to sort -f -u. Many other unix utilties use '-i' to indcate 'ignorecase'. My test showed me I need to fix it, and I forgot to fix the final posting.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号