Main Page   Class Hierarchy   Compound List   File List   Compound Members   File Members  

URLFilter Class Reference

Knows how to parse a URL string and related functions. More...

#include <urlfilter.h>

List of all members.

Public Methods

 URLFilter (bool rs)
const char * DeindexURL (const char *anurl)
const char * CompressURL (const char *anurl)
void ParseURL (const char *anurl, char *schemebuf, char *netlocbuf, char *querybuf, char *paramsbuf, char *pathbuf)
 Decomposes a URL into its components for analysis. More...

void NormalizeURLPath (char *apath)
 Fixes the path in case document is of type index.html. More...

ContentType ClassifyURLPath (const char *path)
 classifies a file according to its extension. More...

const char * FormatURL (const char *anurl, int anurl_len, URLComponents *baseurl, ContentType *foundtype) throw (domain_error)

Private Attributes

char scratchbuf0 [STRINGBUF_LEN0+1]
char scratchbuf1 [STRINGBUF_LEN2+1]
char scratchbuf2 [STRINGBUF_LEN2+1]
char scratchbuf3 [STRINGBUF_LEN2+1]
char scratchbuf4 [STRINGBUF_LEN3+1]
char scratchbuf5 [STRINGBUF_LEN1+1]
char scratchbuf6 [STRINGBUF_LEN1+1]
char comp_scratchbuf [STRINGBUF_LEN2+1]
char parse_scratchbuf [STRINGBUF_LEN1+1]
char deindex_scratchbuf [STRINGBUF_LEN2+1]
struct {
   bool   remove_html_suffix
   bool   rearrange_components
flags


Detailed Description

Knows how to parse a URL string and related functions.

Definition at line 47 of file urlfilter.h.


Constructor & Destructor Documentation

URLFilter::URLFilter bool    rs
 

Definition at line 28 of file urlfilter.cc.

References flags.


Member Function Documentation

ContentType URLFilter::ClassifyURLPath const char *    path
 

classifies a file according to its extension.

Definition at line 261 of file urlfilter.cc.

References CONTENT_APPLICATION_MS_POWERPOINT, CONTENT_APPLICATION_MSWORD, CONTENT_APPLICATION_PDF, CONTENT_APPLICATION_POSTSCRIPT, CONTENT_APPLICATION_XGZIP, CONTENT_AUDIO_MP3, CONTENT_GOOGLE_OTHER, CONTENT_IMAGE, CONTENT_TEXT_HTML, CONTENT_TEXT_PLAIN, CONTENT_TEXT_RTF, and ContentType.

const char * URLFilter::CompressURL const char *    anurl
 

This function compresses a URL, whose characters are guaranteed to fit within seven bits, and removes all the forward slashes, which are the most commonly used character. Everytime a slash is removed, the *preceding* charater has its eight bit set. A slash is not removed if the previous character already has its eight bit set.

The compressed URL is always located in the special buffer comp_scratchbuf[]. The string anurl is not modified

Definition at line 107 of file urlfilter.cc.

References comp_scratchbuf.

Referenced by GraphBuilder::FindLeafNodeKey(), GraphBuilder::FindWebNode(), GraphBuilder::NodeSetURL(), and GraphBuilder::TrieInsertLinkURL().

const char * URLFilter::DeindexURL const char *    anurl
 

this function takes a standardized url ( see NormalizeURLPath() ) and removes the trailing string /index.htm(l) This is used to compactify the string before adding it to the Trie (in a trie, common prefixes are harmless, but common suffixes waste space) In case the remove_html_suffix flag is set, other common html endings are also tokenized to reduce space requirements.

Note that this operation is irreversible (we cannot reinsert the suffix /index.html reliably in all cases). The string anurl is not modified.

Definition at line 46 of file urlfilter.cc.

References deindex_scratchbuf, and flags.

Referenced by GraphBuilder::FindLeafNodeKey(), GraphBuilder::FindWebNode(), GraphBuilder::NodeSetURL(), and GraphBuilder::TrieInsertLinkURL().

const char * URLFilter::FormatURL const char *    anurl,
int    anurl_len,
URLComponents   baseurl,
ContentType   foundtype
throw (domain_error)
 

This function formats anurl into a standard form. Its most important use is as a completion mechanism for URL fragments as can be found in anchor tags. The URL is completed relative to the baseurl, which typically is the current document's url.

The return value will always be a pointer to one of the scratch buffers so you should copy the returned string before formatting another.

Definition at line 333 of file urlfilter.cc.

References ContentType, and NULL.

Referenced by GraphBuilder::FormatURL(), and GraphBuilder::NodeSetURL().

void URLFilter::NormalizeURLPath char *    apath
 

Fixes the path in case document is of type index.html.

This function maps paths of the form xxx/ xxx/index.htm to the standard xxx/index.html xxx/index.html

WARNING: This function modifies the string apath. It is assumed that apath has STRINGBUF_LEN1 storage available

Definition at line 239 of file urlfilter.cc.

void URLFilter::ParseURL const char *    anurl,
char *    schemebuf,
char *    netlocbuf,
char *    querybuf,
char *    paramsbuf,
char *    pathbuf
 

Decomposes a URL into its components for analysis.

Each of the supplied buffers must be STRINGBUF_LEN1 long. This function does not modify anurl.

If flags.rearrange_components is true, the network location and file path are rearranged so that the suffix is placed first.

Definition at line 150 of file urlfilter.cc.

References parse_scratchbuf.

Referenced by GraphBuilder::NodeSetURL().


Member Data Documentation

char URLFilter::comp_scratchbuf[STRINGBUF_LEN2+1] [private]
 

Definition at line 70 of file urlfilter.h.

Referenced by CompressURL().

char URLFilter::deindex_scratchbuf[STRINGBUF_LEN2+1] [private]
 

Definition at line 72 of file urlfilter.h.

Referenced by DeindexURL().

struct { ... } URLFilter::flags [private]
 

Referenced by DeindexURL(), and URLFilter().

char URLFilter::parse_scratchbuf[STRINGBUF_LEN1+1] [private]
 

Definition at line 71 of file urlfilter.h.

Referenced by ParseURL().

bool URLFilter::rearrange_components [private]
 

Definition at line 76 of file urlfilter.h.

bool URLFilter::remove_html_suffix [private]
 

Definition at line 75 of file urlfilter.h.

char URLFilter::scratchbuf0[STRINGBUF_LEN0+1] [private]
 

Definition at line 62 of file urlfilter.h.

char URLFilter::scratchbuf1[STRINGBUF_LEN2+1] [private]
 

Definition at line 63 of file urlfilter.h.

char URLFilter::scratchbuf2[STRINGBUF_LEN2+1] [private]
 

Definition at line 64 of file urlfilter.h.

char URLFilter::scratchbuf3[STRINGBUF_LEN2+1] [private]
 

Definition at line 65 of file urlfilter.h.

char URLFilter::scratchbuf4[STRINGBUF_LEN3+1] [private]
 

Definition at line 66 of file urlfilter.h.

char URLFilter::scratchbuf5[STRINGBUF_LEN1+1] [private]
 

Definition at line 67 of file urlfilter.h.

char URLFilter::scratchbuf6[STRINGBUF_LEN1+1] [private]
 

Definition at line 68 of file urlfilter.h.


Generated on Wed May 29 11:37:28 2002 for MarkovPR by doxygen1.2.15