Citation Statistics for Ingram Olkin

B. Narasimhan
Department of Statistics
Stanford University
Stanford, CA 94305
1997/02/21

Introduction

[*]

Ingram Olkin wanted the following statistics for the period 1985--1994.

  1. The number of single author papers,
  2. The number of multiple author papers with names in alphabetic order, and
  3. The number of multiple author papers with names not in alphabetic order.

We present a quick-and-dirty perl hack to do this.

The Perl Program

[*]

The program is structured as follows.

<*>=
#! /usr/local/bin/perl
<Copyright>
<Local variables>
<Process files>
<Print results>

Copyright

[*]

<Copyright>= (<-U)
#
# $Revision: 1.1 $   
#
# Copyright (C) 1997, B. Narasimhan (naras@stat.Stanford.EDU)
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
#

Local Variables

[*]

We need to know where the CIS files are located. Let us use \$cis_directory to denote the directory and \$cis_ext denote the extension to the file names.

<Local variables>= (<-U) [D->]
$cis_directory = '/usr/local/lib/cis/cis95/files';
$cis_ext = '.v95';
Defines $cis_directory, $cis_ext (links are to index).

The years in question.

<Local variables>+= (<-U) [<-D->]
@years = (85..94);
Defines @years (links are to index).

Some variables for counts which we initialize to zero.

$sj
Number of single author journal articles.
$sp
Number of single author proceedings articles.
$mj
Number of multiple author journal articles.
$mp
Number of multiple author proceedings articles.
$mja
Number of multiple author journal articles with author names in alphabetic order.
$mpa
Number of multiple author proceedings articles with author names in alphabetic order.

<Local variables>+= (<-U) [<-D]
$sj = 0; # No of single author journal articles.
$sp = 0; # No of single author proceedings articles.
$mj = 0; # No of multiple author journal articles.
$mja = 0; # No of multiple author alphabetic journal articles.
$mp = 0;  # No of multiple author proceedings articles.
$mpa = 0;  # No of multiple author alphabetic proceedings articles.
Defines $mj, $mja, $mp, $mpa, $sj, $sp (links are to index).

Processing the Files

[*]

We basically see if each file exists and is a text file and open it.

<Process files>= (<-U)
YEAR:
  foreach $year (@years) {
    $filename = $cis_directory . '/' . 'cis' . $year . $cis_ext;
    next YEAR unless -T $filename;
    if (!open(FH, $filename)) {
      print STDERR "Can't open $filename---continuing...\n";
      next YEAR;
    }
    while (<FH>) {
      <Split fields>
      <Skip irrelevant stuff>
      <Determine no of authors>
      <Update counts>
    }
  }

Splitting the fields is trivial once we know the format of the records. Notice that we ignore the trailing `garbage' in the records.

<Split fields>= (<-U)
($null,$field1,$title,$authors) = split('#');
Defines $authors, $field1, $null, $title (links are to index).

Skipping irrelevant stuff---we skip books, electronic publications and other administrative records.

<Skip irrelevant stuff>= (<-U) [D->]
$cite_ind = substr($field1,-1,1);
# Skip books, electronic literature or administrative records
next if ($cite_ind =~ /[Bb]/) || ($cite_ind =~ /C/) 
  || ($cite_ind =~ /Z/);
Defines $cite_ind (links are to index).

Oh! there are a bunch of entries for reviews of books, which we should skip too.

<Skip irrelevant stuff>+= (<-U) [<-D]
# Skip reviews of books.            
next if ($authors =~ /\(Rev\)/);

Multiple authors are separated by semicolons in the author field.

<Determine no of authors>= (<-U)
@authors = split(';', $authors);
$no_authors = $#authors + 1;
Defines $no_authors (links are to index).

So we are now down to the last part: updating counts. To detect if the authors are listed alphabetically, we sort the @author array and pack it back into a string with semicolons, just as the field would look if the authors were listed in alphabetic order. So, if the newly constructed string matches the field obtained from the file, the authors are indeed in alphabetic order! We only need to take care to update the relevant counters.

<Update counts>= (<-U)
if ($no_authors > 1) {
  $sorted_authors = join(';', sort @authors);
}
if ($cite_ind =~ /[Jj]/) { # article in Journal
  if ($no_authors > 1) {
    $mj++;
    if ($sorted_authors eq $authors) {
      $mja++;
    }
  } else {
    $sj++;
  }
} else { # article in proceedings or edited book
  if ($no_authors > 1) {
    $mp++;
    if ($sorted_authors eq $authors) {
      $mpa++;
    }
  } else {
    $sp++;
  }
}

Printing Results

[*]

This is straight-forward.

<Print results>= (<-U)
print "Statistics from CIS for years @years \n";
printf "Single author papers in Journals: %d\n", $sj;
printf "Multiple author papers in Journals (alph): %d\n", $mja;
printf "Multiple author papers in Journals (Non-alph): %d\n", $mj - $mja;
printf "Single author papers in Proceedings: %d\n", $sp;
printf "Multiple author papers in Proceedings (alph): %d\n", $mpa;
printf "Multiple author papers in Proceedings (Non-alph): %d\n", $mp - $mpa;

Index of Code Chunks

[*]

This list is generated automatically. The numeral is that of the first definition of the chunk.

Index of Identifiers

[*]

Here is a list of the identifiers used, and where they appear. Underlined entries indicate the place of definition. This index is generated automatically.

*