生命物理科学研究室・高須昌子のHP

List of class taught by takasu thermodynamics perl code to analyze engine access by takasu

検索エンジンのアクセスを解析する
perlスクリプト: engine.pl

            by 高須    97.8.8    → ホームページ

   Perlのページ  コンピュータのページ

１。機能

検索エンジンからホームページを参照している場合、
「どのような検索語を使っているか？」
「よく使われる検索エンジンは何か？」の
統計をとります。

・検索エンジンリストは、URLをつけてhtml化してます。

・日本語の検索語の場合、
検索した人の環境により、JIS, SJIS, EUCの様々なコードで
入力されてますが、jcode.plを使ってなるべく同じコードに変換してから、
統計をとってます。

但し、SJISの半角カナは判別が一意的でないので、
字化けしてます。
EUCの半角カナと判断できるものは、とりあえず全角に変換してますが、
副作用がひょっとしたらあるかもしれません。

２。準備

(1)referer_logの出力方法を調べる。

私の使っているサーバーでは、例えば次のように出力されます。

http://www.goo.ne.jp/?MT=%A5%E2%A5%D0%A5%A4%A5%EB%A5%AE%A5%A2&SM=
MC&WTS=ntt&DE=2&DC=100&AT0=words&AM0=MC&AW0=&AT1=words&AM1=MC
&AW1=&DV=6&DU=months&DY=96&DM=1&DD=1&DR=newer&date=WH&FS=&OPs=
MDT&RG=JP&_v=2&act.search.x=43&act.search.
y=12 ->/~takasu/link/mo.html == [12/Jul/1997:06:30:59 +0900]

この場合、-> で切れることがわかります。
他の出力方法の場合は、それに合わせて、
プログラムを変えます。

(2)referer_logから、どんな検索エンジンが来ているか、把握する。

私のホームページに来ている、７種類のエンジンのみ
統計をとってます。

他のエンジンが来ている場合は、プログラムに
つけ加えて下さい。

変更する場所は、サブリーチン&initialの@grep、@start, @addの３か所です。

@grepは、検索エンジンを区別するのに使うアドレス、
@startは検索語の始まりを示す記号、@addはその字数です。
検索語の始まりの記号は、エンジンによって違うので、
指定する必要があります。

(3)referer_logの検索語がコード化されているかを見る。

検索語が、

%A5%E2%A5%D0%A5%A4%A5%EB%A5%AE%A5%A2&SM

などの記号になっている場合、日本語に変換する必要があります。
検索語の日本語変換をする時は、
jcode.plを用意します。
jcode.plの入っているディレクトリを、
メインプログラム中のrequire文で指定します。
$NIHONGO_HENKAN = "yes"にします。

すでに、日本語になっている場合は、

    $NIHONGO_HENKAN = "no"

にします。こうしないと、字化けします。

(4)referer_log からいらない行を除いておく。

例えば、私の場合は、

 cat referer_log |grep July |grep takasu > ref5

などとして、いる所だけとり出しています。

３。実行

以下のソースを例えば、engine.pl というファイルにセーブし、

      chmod 755 engine.pl

によって、実行可能にする。
入力ファイルref5
を同じディレクトリに入れておく。

     engine.pl > out1

により、出力はファイルout1に出ます。

４。結果の例

５。スクリプトの問題点

・EUCの半角カナはほぼ変換できますが、
SJISの半角カナは変換できません。
これは、コードが重なっているためです。

６。スクリプト本体

コメント行の色分けは、perlスクリプトcolor.plを使いました。

#!/usr/bin/perl
#   engine.pl               by M. Takasu,  August 8, 1997                     
#
#   analyze referer_log and get the search words and engine statistics
#   also works when kanji words are in ascii. 
#                 e.g., %A5%E2%A5%D0%A5%A4%A5%EB%A5%AE%A5%A2&SM
####   user input   
$NIHONGO_HENKAN = "yes";    # yes if henkan is necessary. no otherwise
$NIHONGO_CODE   = "euc";  # jis, euc or sjis (small letter) for output code   

$filename = "ref5";

##### don't change below unless necessary

if($NIHONGO_HENKAN eq "yes"){
   require '/home/usr1/takasu/bin/jcode.pl';
}

$find = 0;
&initial(*engine_grep, *engine_start, *engine_add, *engine_end
         ,*engine_name, *engine_url);

open(inputfile, $filename) || die "can not open file \n";
while (<inputfile>) {

   ($where, $file) = split(/->/, $_);

    for ($i = 0 ; $i < @engine_grep  ; $i++){

       next if index($where, $engine_grep[$i]) <0  ;  # try next engine

       $checka = index($where, $engine_start[$i]);
       last if $checka < 0 ;   # engine came but no search word

               # found engine and search word
       $count_engine{$engine_grep[$i]}++ ;    # counter for engines

       $longword = substr($where, $checka + $engine_add[$i] , 200);
       ($word1, $tail) = split(/$engine_end[$i]/, $longword);

       $worda = &check_all($word1, $longword, $i);

       $count_word{$worda}++;     # counter for words
       last;
    }
}

&sort_engines(*count_engine, *engine_grep, *engine_name, *engine_url);
&sort_words(*count_word);


### subroutines

##############
# initialize arrays for engine
sub initial{
    local (*grep, *start, *add, *end, *name, *url) = @_;

    @grep   = ("goo.ne.jp", "infoseek",  "odin.cgi", "hole-in-one",
           "kuamp.kuamp.kyoto-u.ac.jp",
           "altavista.digital.com",  "jp.excite.com");

    @start = ("MT=", "qt=","key=", "query=","key=", "&q=","&s=");
                                    # primary start words

    @add   = (3, 3, 4, 6, 4, 3, 3);     # number of letters in @start

    @end  = ("&[_asSW]", "&[cqs]", "&f", "&", "&am=", "&", "&c=");


    @name = ("goo",  "infoseek", "odin", "hole-in-one", 
         "rcaau", "altavista", "excite.com");
                  # name to put on a list         
    @url = ("www.goo.ne.jp",    "japan.infoseek.com",
       "kichijiro.c.u-tokyo.ac.jp/odin",   "hole-in-one.com",
        "www.kuamp.kyoto-u.ac.jp/labs/infocom/mondou",
        "www.altavista.digital.com", "jp.excite.com");



    (*grep, *start, *add, *end, *name, *url) ;

}
#  check and convert the word to Japanese
#
sub check_all{
     local($word1, $longword, $engine) = @_;
     local(@words, $worda);     
           
     if ($NIHONGO_HENKAN eq "yes"){
       $word1 = &cleanup_ascii($word1);   # don't use this for kanji
     }

     $words[0]= $word1;

     if ($engine == 0){    # for goo
        @words = &check_goo($word1, $longword);  
     }

     @words = &check_multi(*words);  

     if ($NIHONGO_HENKAN eq "yes"){    # for ascii input with %
        foreach $item (@words){
           $item = &get_nihongo($item);
         }
     }

     foreach $item (@words){
        $item = &cleanup($item);
      }
           
     $worda = "";
     foreach $item (@words){
         $worda = "$worda $item";
     }
     $worda = &cleanup($worda);
}

########
#  clean up a word.  Don't use this for kanji
sub cleanup_ascii{
    local($word1) = @_;
     $word1 =~ s/\+/ /g;    # could be dangerous after converted to kanji
     $word1 =~ s/%26/ /g;   # get rid of +(SJIS hankaku). could be dangerous   
     $word1 =~ s/\>/ /g;   
     $word1 =~ s/\&/ /g;   
     $word1;
}

### convert code to nihongo
## take care of euc hankaku
## sjis hankaku are not converted 
sub get_nihongo{

  local($word) =@_;
  local($check1) = &check_euc_hankaku($word);   # 

  $word =~ s/%(..)/pack("c", hex($1))/ge;   
  if( $check1 eq "euc_hankaku" ){   #EUC half width katakana
          &jcode'h2z_euc(*word);  
          &jcode'convert(*word, $NIHONGO_CODE);
  }
  else{
        &jcode'convert(*word, $NIHONGO_CODE);  
  }
  $word;    

}
# check EUC hankaku
#     codes should be %8E etc
sub check_euc_hankaku{
   local($w1) =@_;
   local($result) = "";   # default value
   local($out) = 0;
   local($find) = 0;
   local($sub1);

    for ($i = 0;  $out == 0 ; $i = $i+6){
        $sub1 = substr($w1, $i, 3);

      if( $sub1 eq ''){   # end of code. go out
          $out = 2;                           
      }
      elsif ( $sub1 ne '%8E' ){  
          $out = 1;                           
      }    
      else{
         $find = 1;    # find %8E
      }
    }   
    if ($find ==1 && $out == 2){  # at least one EUC hankaku then out
        $result = "euc_hankaku";
    }
    $result;
}

#### multiple words devided by spaces
#  
sub check_multi{
   local(*input) = @_;
   local($num, @words, @temp);

   $num = 0;

   foreach $item (@input){

     if($item =~ /\s+/){
         @words = split(/\s+/, $item);
         foreach $word (@words){
            $temp[$num]  = &cleanup($word);
            $num++ ;

         }
     }
     else{

        $temp[$num] = $item;
        $num ++;
     }
   }

   @temp;
}

###   for goo: check 2nd and 3rd words 
#
sub check_goo{

    local($first, $input) = @_;
    local(@goo) = ("AW0=", "AW1=", "AW2=", "AW3=", "AW4=");
    local($length) = 4;
    local($index, $next, $tail);
    local(@result) = ($first) ;    # initialize

# start

     foreach $item (@goo){
        $index = index($input, $item);

        last if ($index < 0);    # end of AW's

        $next =  substr($input, $index + $length, 200);
        ($next, $tail) = split(/&/,$next);     # word should end with &
        $next = &cleanup($next);
        @result = (@result, $next);
     }
    @result;
}

####    clean up a word
sub cleanup{
   local($w1) = @_;
   $w1 =~ s/^\"//;   # omit " at the beginning
   $w1 =~ s/^\+//;   # omit + at the beginning
   $w1 =~ s/^\s+//;  # omit space at the beginning
   $w1 =~ s/\+$//;   # omit +     at the end
   $w1 =~ s/\s+$//;  # omit space at the end
   $w1;
}

######  sort engines
sub sort_engines{

  local(*count_engine, *grep, *name, *url) =@_;
  local(%url_table, %name_table);
  local ($html1, $html2, $html3, $nword);

  $html1 = "<a href=\"http://";
  $html2   = "\">";
  $html3   = "</a>";

#---- get table

  for ($i = 0; $i < @grep ; $i++){  #make table
     $url_table{$grep[$i]} = $url[$i];
     $name_table{$grep[$i]} = $name[$i];
  }
#---- start

  @sorted = sort by_counter_engine keys %count_engine;

  $nword = 0;
  foreach $item (@sorted){
     printf "%6d    ",  $count_engine{$item};
     print "  ", $html1, $url_table{$item}, $html2, 
                 $name_table{$item}, $html3, "\n";
     $nword = $nword + $count_engine{$item};
  }
   printf "%6d    total  \n\n",  $nword;

}
########  sort words
sub sort_words{
   local(*count_word) = @_;
   local($iaccum);
   @sorted = sort by_counter_word keys %count_word;

   foreach $w1 (@sorted){
     print "          ", $count_word{$w1}, "  ", $w1, " \n";
   }

   $iaccum = scalar(keys(%co));   # number of different words
   print "   total words,  different words = ", $nword, "  ", $iaccum, "  \n";
}
############# sort definitions
##
sub by_counter_engine {
   ($count_engine{$b} <=> $count_engine{$a}) 
}
##
sub by_counter_word{
    ($count_word{$b} <=> $count_word{$a}) || ($a cmp $b); 
}

   Perlのページ  コンピュータのページ

検索エンジンのアクセスを解析する perlスクリプト: engine.pl