Perl

搜尋腳本忽略 robots.txt

  • August 3, 2015

我正在使用完美搜尋,它似乎忽略了 robots.txt 文件。有什麼想法有什麼問題嗎?

我收到相同的錯誤消息。我需要 10 個信譽點才能發布 2 個以上的連結。所以我會記下連結的位置

$$ $$:

Using DB_File...
Checking for old temp files...
Building string of special characters...
Loading 'no index' regular expressions:

[listing of files not to be indexed]

Loading stopwords...371 stopwords loaded.
Starting crawler...
Note: I will not visit more than $HTTP_MAX_PAGES=4000 pages.
Loading [direct link to main domain]/robots.txt]
Ignoring '[direct link to main domain]/robots.txt': content-type 'text/plain'
Not using any robots.txt.
Ignoring '[direct link to main domain]': content-type 'text/html'

Crawler finished: indexed 0 files, 0 terms (0 different terms).
Ignored 0 files because of conf/no_index.txt
Ignored 0 files because of robots.txt

Calculating weight vectors: 
0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
|----|----|----|----|----|----|----|----|----|----|
>
Removing unused db files:
[listing of files]

Renaming newly created db files...
[listing of files]

Indexer finished.

NOTE: ACTUAL PATHS ARE DETAILED IN BRACKETS [] OR ACTUAL URL W/ '...' INSERTED TO CHANGE TO A 'NON-LINK'


# Perlfect Search configuration file
#$rcs = ' $Id: conf.pl,v 1.74 2007/03/30 22:55:03 gzervas Exp $' ;

# NOTE: Whenever you change one of the options that's marked with [re-index]
# you need to run indexer.pl again to make the change take effect.

###########################################################################
### basic configuration
### You'll have to adapt these values if you didn't use setup.pl

# Where do you want the indexer to start on your disk?
# ** Note ** : If your files are generated dynamically (e.g. via PHP)
# you should set $HTTP_START_URL (see below), otherwise users
# will be able to see your pages' source code using the
# "highlight matches" link.
# [re-index]
$DOCUMENT_ROOT = '/home1/shamaror/public_html/emet/';

# The base url of your site (normally that's the URL which
# corresponds to $DOCUMENT_ROOT).
$BASE_URL = '[to main domain]';

# The url in which Perlfect Search is located (usually somewhere in cgi-bin/).
$CGIBIN = 'http: ... //emetnews.org/ ... cgi-bin/perlfect/search/';

# The full-path of the directory where Perlfect Search is installed.
$INSTALL_DIR = '/home1/shamaror/public_html/emet/cgi-bin/perlfect/search/';

# Only files with these extensions should be indexed (case-sensitive).
# This is only relevant for file system indexing, when you index files via
# http you need to set @HTTP_CONTENT_TYPES instead. [re-index]
@EXT = ("htm","html","php","PDF","pdf");

###########################################################################
### http configuration
### You only need this if you want to index your pages via http

# Where you want the indexer to start via http. Leave empty if
# you want to index the files in the filesystem ($DOCUMENT_ROOT).
# ** WARNING **: Do not use for foreign servers! It might use too many
# resources on other people's servers. [re-index]
# example: $HTTP_START_URL = 'http://localhost/';
$HTTP_START_URL = 'http://emetnews.org/';

# The indexer might not notice if it runs into an endless loop. To void
# that, set this to the maximum number of pages that will be visited
# (this can be bigger than the number of pages indexed). [re-index]
$HTTP_MAX_PAGES = 4000;

# The web server's document root. Normally that's the same as $DOCUMENT_ROOT,
# it differs if you're only using Perlfect Search on a subdirectory. [re-index]
$HTTP_SERVER_ROOT = $DOCUMENT_ROOT;

# Limit crawling to these URL pattern. This is an important setting so
# the script doesn't run out of control.
# ** WARNING **: The default ($HTTP_START_URL) should not be changed,
# otherwise you risk the script to crawl on remote servers. For example,
# the robots.txt file will only be used on the $HTTP_START_URL server!
# [re-index]
@HTTP_LIMIT_URLS = ($HTTP_START_URL);

# Comment this out if you want to ignore robots.txt (only do that if
# you really know what you are doing):
$ROBOT_AGENT = 'perlfectsearch';

# Should the indexer follow links that are commented out?
$HTTP_FOLLOW_COMMENT_LINKS = 1;

# Only if indexing via http: the content types to index.
# Add 'application/msword' for for MS-Word,
# 'application/pdf' for PDF. [re-index]
#@HTTP_CONTENT_TYPES = ('text/html', 'text/plain');

# Set to 1 to get verbose output during indexing. [re-index]
$HTTP_DEBUG = 1;

###########################################################################
### advanced configuration
### You only need this if you want to adapt advanced features

# Programs that convert other formats to ascii text.
# The name of the file to be filtered is passed as FILENAME, and the command
# must print out ascii (or latin1) text.
# pdftotext is part of xpdf, available at

# NOTE: You also have to set @EXT or @HTTP_CONTENT_TYPES accordingly.
# If there's a problem with pdftotext, try a new version or hand over
# the -raw option to pdftotext.
# [re-index]
%EXT_FILTER = (
      "pdf" => "/usr/bin/pdftotext FILENAME -",
      "doc" => "/usr/bin/antiword FILENAME"
);

# How many results should be shown per page.
$RESULTS_PER_PAGE = 10;

# Limit the number of results. 0 = no limit.
$MAX_RESULTS = 0;

# Enable the "highlight matches" feature that displays the original
# pages, but with the search terms highlighted. See the README on
# restrictions of this feature.
$HIGHLIGHT_MATCHES = 1;

# A "highlight matches" link does only work for HTML files, so only
# offer such a link for files with these suffixes.
# ** Note **: If $HTTP_START_URL is not set, the highlighting
# will load the file from disk so that the user might find
# passwords in the highlightes file! So don't set this to include
# dynamic files, unless you are using $HTTP_START_URL.
@HIGHLIGHT_EXT = ("html", "htm");

# Perlfect Search can highlight the search terms in the matching
# document. These are the colors that will be used for the background
# of the terms (the browser must support CSS for this). If the last color
# is used, the first one will be used again if there are still terms left.
@HIGHLIGHT_COLORS = ('#4fafea', '#e5b547', '#aaaaaa', '#ee77ee');

# Show the ranking in percent, with the first document = 100%.
$PERCENTAGE_RANKING = 1;

# Do you want to index numbers? If so set $INDEX_NUMBERS to 1. [re-index]
$INDEX_NUMBERS = 0;

# If you don't have enough memory, set this to 1. This will slow down
# indexer.pl by a factor of about 2. Searching is not affected.
$LOW_MEMORY_INDEX = 1;

# How much of the document should be put in the index? With this option,
# the context of the match is shown on the results page. This only works
# if the match was in the first $CONTEXT_SIZE bytes of the document.
# Warning: Using this option will generate a very big index file.
# Set to 0 to disable, set to -1 for no limit. [re-index]
$CONTEXT_SIZE = 0;

# If $CONTEXT_SIZE is enabled, how many occurences of every term should be shown
# on the results page?
$CONTEXT_EXAMPLES = 2;

# If $CONTEXT_SIZE is enabled, how many words should be used to show the context
# of a term?
$CONTEXT_DESC_WORDS = 12;

# How many words should be used from the <BODY> of an html document as a
# description for the document in case there is no <META description> tag
# available and $CONTEXT_SIZE is 0. [re-index]
$DESC_WORDS = 25;

# The minimum length of a word. Any word of smaller size is not indexed.
# [re-index]
$MINLENGTH = 3;

# If you have umlauts or accents etc. in your text, enable this.
# With this option accented characters will be indexed as the characters
# they are based on (e.g. è -> e, ü -> u), without this option they will
# be filtered out completely (you don't want that). [re-index]
$SPECIAL_CHARACTERS = 1;

# The largest acceptable word size. Reducing this saves space but decreases
# result accuracy. Setting the variable to 0 ignores stemming alltogether.
# [re-index]
$STEMCHARS = 0;

# Add URLs to the index, so one can search for them? Note that special
# characters will be ignored, just as in normal text. [re-index]
$INDEX_URLS = 0;

# You can completely ignore certain parts of your documents if you put these
# HTML comments around them. [re-index]
$IGNORE_TEXT_START = '<!--ignore_perlfect_search-->';
$IGNORE_TEXT_END = '<!--/ignore_perlfect_search-->';

# The maximum length of <title> elements, everything longer than this
# will be cut off. [re-index]
$MAX_TITLE_LENGTH = 80;

# How much more important are words found in the title, in the meta values
# (author, description, keywords), and in the headlines compared to normal
# text in the body? This influences the ranking of the results.
# Use any integer (0 = ignore that text completely) [re-index]
$TITLE_WEIGHT = 5;
$META_WEIGHT = 5;
$H_WEIGHT{'1'} = 5; # headline <h1>...</h1>
$H_WEIGHT{'2'} = 4;
$H_WEIGHT{'3'} = 3;
$H_WEIGHT{'4'} = 1;
$H_WEIGHT{'5'} = 1;
$H_WEIGHT{'6'} = 1; # headline <h6>...</h6>

# If you want to log the queries to an extra file, set this to 1.
# Every use of search.pl will then be logged to data/log.txt. That file
# has to exist and must be writable for the webserver. The line format is:
# REMOTE_HOST;date;terms;matches;current page;(time to search in seconds);
# NOTE: You'll have to comment in two lines at the top of search.pl to get the
# time value (see the comment there).
# NOTE: if you have many queries, this file will grow quite fast.
$LOG = 0;

# This will increase the score of results that contain more than one of
# the searched terms. Queries with only one term will not be affected.
# The number given here is a factor that multiplies the score (even
# several times, if there are more than two terms). 0 turns it off.
$MULTIPLE_MATCH_BOOST = 0;

# Date format for the result page. %Y = year, %m = month, %d = day,
# %H = hour, %M = minute, %S = second. On a Unix system use
# 'man strftime' to get a list of all possible options.
$DATE_FORMAT = "%Y-%m-%d";

# Date format for the "Latest Index update" information on the result page.
$INDEX_DATE_FORMAT = "%Y-%m-%d %H:%M";

# Directory with templates (normally you don't have to modify this).
$TEMPLATE_DIR = $INSTALL_DIR.'templates/';

# What's the default language. This is the language that's used if no lang
# parameter is passed to the script or if the parameter is invalid.
$DEFAULT_LANG = 'en';

# The result templates for several languages.
$SEARCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'search.html';
$SEARCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'search_de.html';
$SEARCH_TEMPLATE{'fr'} = $TEMPLATE_DIR.'search_fr.html';
$SEARCH_TEMPLATE{'it'} = $TEMPLATE_DIR.'search_it.html';
$NO_MATCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'no_match.html';
$NO_MATCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'no_match_de.html';
$NO_MATCH_TEMPLATE{'fr'} = $TEMPLATE_DIR.'no_match_fr.html';
$NO_MATCH_TEMPLATE{'it'} = $TEMPLATE_DIR.'no_match_it.html';
# This is the template for using search.pl via command line:
$SEARCH_TEMPLATE{'text'} = $TEMPLATE_DIR.'search.txt';
$NO_MATCH_TEMPLATE{'text'} = $TEMPLATE_DIR.'no_match.txt';
# This is the template for using the test cases (development only):
$SEARCH_TEMPLATE{'qa'} = $INSTALL_DIR.'qa/search_qa.txt';
$NO_MATCH_TEMPLATE{'qa'} = $INSTALL_DIR.'qa/no_match_qa.txt';

# The text for the "Next Page" link in several languages.
$NEXT_PAGE{'en'} = 'Next';
$NEXT_PAGE{'de'} = 'nächste Seite';
$NEXT_PAGE{'fr'} = 'Suivant';
$NEXT_PAGE{'it'} = 'Successiva';

# The text for the "Previous Page" link in several languages.
$PREV_PAGE{'en'} = 'Previous';
$PREV_PAGE{'de'} = 'vorige Seite';
$PREV_PAGE{'fr'} = 'Précédent';
$PREV_PAGE{'it'} = 'Precedente';

# Text of the link that shows a colored backround for matched terms:
$HIGHLIGHT_TERMS{'en'} = 'highlight matches';
$HIGHLIGHT_TERMS{'de'} = 'Treffer hervorheben';

# The text for the "too common" warning. <WORDS> will be replaced with
# a list of the ignored words. If there are no ignored words, this text
# will not appear.
$IGNORED_WORDS{'en'} = '<p>The following words are either too short or very common and were
   not included in your search: <strong><WORDS></strong></p>';
$IGNORED_WORDS{'de'} = '<p>Folgende Wörter sind zu kurz oder kommen sehr häufig vor und wurden
   daher in Ihrer Suchanfrage ignoriert: <strong><WORDS></strong></p>';
$IGNORED_WORDS{'fr'} = '<p>Les mots suivants sont trop courts ou très courants et n\'ont
   pas été inclus dans votre recherche: <strong><WORDS></strong></p>';
# fixme: "too short" missing:
$IGNORED_WORDS{'it'} = '<p>Le seguenti parole sono molto comuni e non
   saranno incluse nella vostra ricerca: <strong><WORDS></strong></p>';

###########################################################################
### You shouldn't have to edit anything below this line.

# Various paths (do NOT use system-wide /tmp for security reasons!)
$TMP_DIR  = $INSTALL_DIR.'temp/';
$DATA_DIR = $INSTALL_DIR.'data/';
$CONF_DIR = $INSTALL_DIR."conf/";
$STOPWORDS_FILE = $CONF_DIR.'stopwords.txt';
$NO_INDEX_FILE = $CONF_DIR.'no_index.txt';
$LOGFILE = $DATA_DIR.'log.txt';
$SEARCH = 'search.pl';
$SEARCH_URL = $CGIBIN.$SEARCH;
$UPDATE_FILE = $DATA_DIR.'update';

# Paths to the database files.
$INV_INDEX_DB_FILE = $DATA_DIR.'inv_index';
$DOCS_DB_FILE      = $DATA_DIR.'docs';
$URLS_DB_FILE      = $DATA_DIR.'urls';
$SIZES_DB_FILE     = $DATA_DIR.'sizes';
$TERMS_DB_FILE     = $DATA_DIR.'terms';
$DF_DB_FILE        = $DATA_DIR.'df';
$TF_DB_FILE        = $DATA_DIR.'tf';
$CONTENT_DB_FILE   = $DATA_DIR.'content';
$DESC_DB_FILE      = $DATA_DIR.'desc';
$TITLES_DB_FILE    = $DATA_DIR.'titles';
$DATES_DB_FILE     = $DATA_DIR.'dates';

# Paths to the temporary database files.
$INV_INDEX_TMP_DB_FILE = $DATA_DIR.'inv_index_tmp';
$DOCS_TMP_DB_FILE      = $DATA_DIR.'docs_tmp';
$URLS_TMP_DB_FILE      = $DATA_DIR.'urls_tmp';
$SIZES_TMP_DB_FILE     = $DATA_DIR.'sizes_tmp';
$TERMS_TMP_DB_FILE     = $DATA_DIR.'terms_tmp';
$CONTENT_TMP_DB_FILE   = $DATA_DIR.'content_tmp';
$DESC_TMP_DB_FILE      = $DATA_DIR.'desc_tmp';
$TITLES_TMP_DB_FILE    = $DATA_DIR.'titles_tmp';
$DATES_TMP_DB_FILE     = $DATA_DIR.'dates_tmp';

# Official version number.
$VERSION = "3.37";
1;

我將只發布一小段,因為它都是一樣的。它在 16 頁後停止。兩件事情:

  • 它找到一個頁面(以數字開頭列出,例如 1: , 2: )然後執行它應該忽略的所有項目(它以前從未這樣做過。)
  • 另請注意,在 conf.pl 文件中,就在該內容類型的正上方,它說:“# 僅當通過 http 進行索引時:要索引的內容類型。” 我不通過 http 建立索引。我通過終端執行此操作,然後直接在站點上執行索引器。
   Starting crawler...
   Note: I will not visit more than $HTTP_MAX_PAGES=4000 pages.
   Loading http: ... //emetnews.org/ ... robots.txt...
   Fetched  'http: ... //emetnews.org/ ... robots.txt', 734 bytes
       1: http: ... //emetnews.org/ ...  (33.09 KB)
   Ignoring 'http: ... //emetnews.org/ ... style/cssNews.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... javascript/greybox/gb_styles.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... style/tabbox_ie.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... ens.ico': content-type 'image/vnd.microsoft.icon'
   Ignoring 'http: ... //emetnews.org/ ... apple-touch-icon.png': content-type 'image/png'
   Ignoring 'http: ... //emetnews.org/ ... ': already visited
   Fetched  'http: ... //emetnews.org/ ... about.php', 19049 bytes
       2: http: ... //emetnews.org/ ... about.php (18.60 KB)
   Ignoring 'http: ... //emetnews.org/ ... style/cssNews.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... style/print.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... ens.ico': content-type 'image/vnd.microsoft.icon'
   Ignoring 'http: ... //emetnews.org/ ... ': already visited
   Fetched  'http: ... //emetnews.org/ ... contact.php', 14410 bytes
   'http: ... //emetnews.org/ ... contact.php': META tags forbid indexing
   Ignoring 'http: ... //emetnews.org/ ... style/cssNews.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... ': already visited
   Ignoring 'http: ... //emetnews.org/ ... about.php': already visited
   Fetched  'http: ... //emetnews.org/ ... index.php', 33785 bytes
       3: http: ... //emetnews.org/ ... index.php (32.99 KB)
   Ignoring 'http: ... //emetnews.org/ ... style/cssNews.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... javascript/greybox/gb_styles.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... style/tabbox_ie.css': content-type 'text/css'
   Ignoring 'http: ... //emetnews.org/ ... ens.ico': content-type 'image/vnd.microsoft.icon'
   Ignoring 'http: ... //emetnews.org/ ... apple-touch-icon.png': content-type 'image/png'
   Ignoring 'http: ... //emetnews.org/ ... ': already visited
   Ignoring 'http: ... //emetnews.org/ ... about.php': already visited
   Ignoring 'http: ... //emetnews.org/ ... contact.php': already visited
   Ignoring 'http: ... //emetnews.org/ ... opensearch.xml': content-type 'text/xml'
   Fetch

tools.pl包含:

if( $response->headers_as_string =~ m/^Content-Type:\s*(.+)$/im ) {
 my $content_type = $1;
 $content_type =~ s/^(.*?);.*$/$1/;          # ignore possible charset value
 if( ! grep(/^$content_type$/i, @HTTP_CONTENT_TYPES) ) {
   debug("Ignoring '$url': content-type '$content_type'\n");
   return;
 }
}

@HTTP_CONTENT_TYPES是一個空數組。! grep(/^$content_type$/i, @HTTP_CONTENT_TYPES)總是返回true

##@HTTP_CONTENT_TYPES = ('text/html', 'text/plain');您的中刪除並重conf.pl試。

所有這些消息都來自http-indexer。如果要索引文件系統中的文件,請在以下位置$HTTP_START_URL留空conf.pl$HTTP_START_URL='';

引用自:https://unix.stackexchange.com/questions/219797