\documentclass[11pt]{article} \setlength{\oddsidemargin}{0.0truein} \setlength{\evensidemargin}{0.0truein} \setlength{\textwidth}{6.5truein} \setlength{\topmargin}{0.0truein} \setlength{\textheight}{9.0truein} \setlength{\headsep}{0.0truein} \setlength{\headheight}{0.0truein} \setlength{\topskip}{10.0pt} \setlength{\parskip}{5mm} \usepackage{url} \usepackage{amsmath} \usepackage{amssymb} \pagestyle{empty} \begin{document} \begin{center} \textbf{\Large{\textsc{STANFORD UNIVERSITY}}}\\[5pt] \textbf{\Large{\textsc{DEPARTMENT OF STATISTICS}}}\\[5pt] \Large{\textsc{DEPARTMENTAL SEMINAR}} \end{center} % In the following statements, replace "Time of talk", % "Weekday", and "Date of talk". An example is provided. % If you are not sure about this, just skip this part. \begin{center} 4:15 p.m., Tuesday, November 13, 2007\\ %% Example: 4:15 p.m., Tuesday, February 13, 2007\\ Sequoia Hall Room 200\\ (Cookies at 3:45 in 1st Floor Lounge) \end{center} % In the following statements, replace "Name of the speaker" with your % name, "Department Affiliation" with your department affiliation, and %"University Affiliation" with your university affiliation. \begin{center} \textsl{Carrie Grimes} \\ Google\\ \end{center} % In the following statements, replace "Title of the talk" % with your title of the talk. \begin{center} \subsection*{Estimation of Web Page Change Rates} \end{center} % In the following statements, replace "Abstract of the talk" % with your abstract. \noindent Search engines strive to maintain a "current" repository of all pages on the web to index for user queries. However, crawling all pages all the time is costly and inefficient: many small websites don't support that much load, and while some pages change very rapidly, others don't change at all. As a result, estimated frequency of change is often used to decide how often a web page needs to be crawled. Here we consider a Poisson process model for the number of state changes of a page, where a crawler samples the page at some known time interval and observes whether or not the page has changed in during that interval from which a Maximum Likelihood Estimator is calculated. We examine the performance of the MLE in a practical setting where new pages are introduced to an ongoing crawl rather than starting with a fixed test set. We demonstrate that handling of the edge cases, where no changes or only changes have been observed, is critical to correct estimation over time. We also propose adaptations to the initial estimation and search path to optimize the freshness of the corpus over a series of crawl samples. \end{document}