Main Page   Class Hierarchy   Compound List   File List   Compound Members   File Members  

GraphBuilder Class Reference

Builds the web link graph as an object of type WebLinkGraph. More...

#include <graphbuilder.h>

Collaboration diagram for GraphBuilder:

Collaboration graph
[legend]
List of all members.

Public Methods

 GraphBuilder (int smem, int jmem, int nmem, int lmem, bool sl)
 ~GraphBuilder ()
void NodeInitialize (uint32 idno)
 Creates a new WebNode on the heap and assigns anothernode as a handle. More...

void NodeSetURL (const char *docurl, const char *aliasurl)
 Copies the current document's URL into another_url, and places it into the trie. More...

const char * NodeGetURL ()
const char * NodeGetAlias ()
const char * NodeGetURL_ ()
const char * NodeGetAlias_ ()
const uint32 NodeGetID ()
const uint16 NodeGetDate ()
void NodeSetDate (unsigned short aDate)
 Sets anothernode's date. More...

void NodeInsertLinks ()
 Inserts the anchor links contained in linkset into anothernode's fromlinks array. More...

void NodeLaunch ()
 Places the WebNode handled by anothernode into the graph and clears anothernode. More...

URLComponentsNodeGetURLParts ()
WebLinkGraphUndockWebGraph ()
 Here we walk through all the nodes and change the links so that they no longer require the trie. More...

const char * FormatURL (const char *anurl, int anurl_len, ContentType *t)
void TrieInsertLinkURL (const char *url)
WebNodePtr FindWebNode (const char *url)
 finds an existing WebNode from a URL. Returns NULL if not found. More...

void SetupLeafTable ()
void AddLeaf (ptrdiff_t key, LeafNodePtr leaf)
const ptrdiff_t FindLeafNodeKey (const char *url)
 finds an existing LeafNode key into the leaftable from a URL. Returns -1 if not found. More...

void UpdateLeafLinks ()
void StatisticsMem (ostream &o)
void StatisticsGraph (ostream &o)
uint32 LowestID ()
uint32 HighestID ()
uint16 LowestDate ()
uint16 HighestDate ()

Public Attributes

struct {
   bool   show_links
   int   leaftable_memory
flags

Private Attributes

WebNodecurdoc
URLComponents curdoc_baseurl
char doc_url [STRINGBUF_LEN2+1]
char doc_alias [STRINGBUF_LEN2+1]
const char * docurl__
const char * aliasurl__
WebLinkGraphgraph
bool graph_is_docked
Trietrie
SimpleWebNodePtrHashTablenodetable
SimpleLeafNodePtrHashTableleaftable
URLFilterurlfilter
RawLinkSetlinkset
struct {
   uint32   heap_used_webnodes
   uint32   cumulative_tolinks
   uint32   cumulative_fromlinks
   uint32   cumulative_leaflinks
   uint32   cumulative_dangling
   uint32   nodetable_insertions
   uint32   nodetable_alias_insertions
   uint32   lowid
   uint32   highid
   uint16   lowdate
   uint16   highdate
stats

Detailed Description

Builds the web link graph as an object of type WebLinkGraph.

Every WebNode is a separate document, and GraphBuilder handles the connection of fromlinks and tolinks, and the discarding of dangling links.

To do this, GraphBuilder must construct a trie containing all known document URLs. Once the WebLinkGraph is built, the trie is "undocked" from the list. This allows the (substantial) memory taken by the URL strings to be regained, at the cost of no longer being able to identity a WebNode by its document URL.

Definition at line 50 of file graphbuilder.h.


Constructor & Destructor Documentation

GraphBuilder::GraphBuilder int    smem,
int    jmem,
int    nmem,
int    lmem,
bool    sl
 

This constructor allocates approximately (smem+jmem) Mb for the trie and nmem Mb for the nodetable.

Definition at line 30 of file graphbuilder.cc.

References URLComponents::Clear(), curdoc, curdoc_baseurl, doc_alias, doc_url, flags, graph, graph_is_docked, kint32max, kuint16max, linkset, Mb, nodetable, NULL, RawLinkSet, SimpleWebNodePtrHashTable, stats, trie, and urlfilter.

GraphBuilder::~GraphBuilder  
 

Definition at line 72 of file graphbuilder.cc.

References leaftable, linkset, nodetable, and trie.


Member Function Documentation

void GraphBuilder::AddLeaf ptrdiff_t    key,
LeafNodePtr    leaf
 

Definition at line 358 of file graphbuilder.cc.

References LeafNode::Date(), SimpleHashTable< LeafNodePtr >::Find(), SimpleHashTable< LeafNodePtr >::Insert(), leaftable, and stats.

Referenced by Talker::LoadLeaves().

const ptrdiff_t GraphBuilder::FindLeafNodeKey const char *    url
 

finds an existing LeafNode key into the leaftable from a URL. Returns -1 if not found.

Definition at line 347 of file graphbuilder.cc.

References URLFilter::CompressURL(), URLFilter::DeindexURL(), Trie::FindURL(), trie, and urlfilter.

Referenced by Talker::LoadLeaves().

WebNode * GraphBuilder::FindWebNode const char *    url
 

finds an existing WebNode from a URL. Returns NULL if not found.

Definition at line 335 of file graphbuilder.cc.

References URLFilter::CompressURL(), URLFilter::DeindexURL(), SimpleHashTable< WebNodePtr >::Find(), Trie::FindURL(), nodetable, NULL, trie, and urlfilter.

Referenced by Talker::BuildTags(), and Talker::LoadLeaves().

const char* GraphBuilder::FormatURL const char *    anurl,
int    anurl_len,
ContentType   t
[inline]
 

Definition at line 72 of file graphbuilder.h.

References ContentType, curdoc_baseurl, URLFilter::FormatURL(), and urlfilter.

Referenced by GraphParseHandler::AddAnchor().

uint16 GraphBuilder::HighestDate   [inline]
 

Definition at line 94 of file graphbuilder.h.

References stats, and uint16.

Referenced by Talker::ProcessCommand().

uint32 GraphBuilder::HighestID   [inline]
 

Definition at line 89 of file graphbuilder.h.

References stats, and uint32.

uint16 GraphBuilder::LowestDate   [inline]
 

Definition at line 92 of file graphbuilder.h.

References stats, and uint16.

Referenced by Talker::ProcessCommand().

uint32 GraphBuilder::LowestID   [inline]
 

Definition at line 87 of file graphbuilder.h.

References stats, and uint32.

Referenced by Talker::ProcessCommand().

const char * GraphBuilder::NodeGetAlias  
 

Definition at line 246 of file graphbuilder.cc.

References doc_alias.

Referenced by Ripper::RipRepository().

const char * GraphBuilder::NodeGetAlias_  
 

Definition at line 238 of file graphbuilder.cc.

References aliasurl__.

Referenced by Ripper::RipRepository().

const uint16 GraphBuilder::NodeGetDate  
 

Definition at line 255 of file graphbuilder.cc.

References curdoc, WebNode::Date(), and uint16.

Referenced by Ripper::RipRepository().

const uint32 GraphBuilder::NodeGetID  
 

Definition at line 250 of file graphbuilder.cc.

References curdoc, WebNode::ID(), and uint32.

Referenced by Ripper::RipRepository().

const char * GraphBuilder::NodeGetURL  
 

Definition at line 242 of file graphbuilder.cc.

References doc_url.

Referenced by GraphParseHandler::AddAnchor(), and Ripper::RipRepository().

const char * GraphBuilder::NodeGetURL_  
 

Definition at line 234 of file graphbuilder.cc.

References docurl__.

Referenced by Ripper::RipRepository().

URLComponents* GraphBuilder::NodeGetURLParts   [inline]
 

Definition at line 67 of file graphbuilder.h.

References curdoc_baseurl.

void GraphBuilder::NodeInitialize uint32    idno
 

Creates a new WebNode on the heap and assigns anothernode as a handle.

Definition at line 109 of file graphbuilder.cc.

References URLComponents::Clear(), curdoc, curdoc_baseurl, doc_alias, doc_url, stats, and uint32.

Referenced by Ripper::RipRepository().

void GraphBuilder::NodeInsertLinks  
 

Inserts the anchor links contained in linkset into anothernode's fromlinks array.

Note that at this stage, all the links are pointer differences into the trie.

Definition at line 158 of file graphbuilder.cc.

References curdoc, WebNode::InsertRawLinks(), and linkset.

Referenced by Ripper::RipRepository().

void GraphBuilder::NodeLaunch  
 

Places the WebNode handled by anothernode into the graph and clears anothernode.

The WebNode is pushed at the front of the graph. Since node id's are given sequentially in increasing order, the graph will contain nodes with decreasing id sequence. This ordering should not be tampered with, as it is used by Talker().

Definition at line 131 of file graphbuilder.cc.

References URLComponents::Clear(), curdoc, curdoc_baseurl, WebNode::Date(), doc_alias, doc_url, graph, linkset, NULL, and stats.

Referenced by Ripper::RipRepository().

void GraphBuilder::NodeSetDate unsigned short    aDate
 

Sets anothernode's date.

Definition at line 150 of file graphbuilder.cc.

References curdoc, and WebNode::SetDate().

Referenced by GraphParseHandler::AddHeader().

void GraphBuilder::NodeSetURL const char *    docurl,
const char *    aliasurl
 

Copies the current document's URL into another_url, and places it into the trie.

The copy is necessary so that later calls to FormatURL() can use another_url to complete anchor link URLs, whenever those are incomplete.

Definition at line 167 of file graphbuilder.cc.

References aliasurl__, URLFilter::CompressURL(), ContentType, curdoc, curdoc_baseurl, URLFilter::DeindexURL(), doc_alias, doc_url, docurl__, SimpleHashTable< WebNodePtr >::Find(), flags, URLFilter::FormatURL(), SimpleHashTable< WebNodePtr >::Insert(), Trie::InsertURL(), URLComponents::netloc, nodetable, NULL, URLComponents::params, URLFilter::ParseURL(), URLComponents::path, URLComponents::query, URLComponents::scheme, stats, trie, and urlfilter.

Referenced by GraphParseHandler::NewDocument().

void GraphBuilder::SetupLeafTable  
 

Definition at line 351 of file graphbuilder.cc.

References SimpleHashTable< LeafNodePtr >::Clear(), flags, leaftable, Mb, and SimpleLeafNodePtrHashTable.

Referenced by Talker::LoadLeaves().

void GraphBuilder::StatisticsGraph ostream &    o
 

Definition at line 95 of file graphbuilder.cc.

References graph, and stats.

Referenced by Talker::PrintStatisticsGraph().

void GraphBuilder::StatisticsMem ostream &    o
 

Definition at line 80 of file graphbuilder.cc.

References MemoryPooled< WebNodeStruct >::FreeBlocks(), MemPool< S >::FreeBlocks1(), MemPool< S >::FreeBlocks2(), nodetable, SimpleHashTable< WebNodePtr >::Size(), Trie::Statistics(), stats, and trie.

Referenced by Talker::PrintStatistics(), and Ripper::PrintStatistics().

void GraphBuilder::TrieInsertLinkURL const char *    url
 

Definition at line 371 of file graphbuilder.cc.

References Trie::bigs, URLFilter::CompressURL(), URLFilter::DeindexURL(), Trie::InsertURL(), linkset, trie, and urlfilter.

Referenced by GraphParseHandler::AddAnchor().

WebLinkGraph * GraphBuilder::UndockWebGraph  
 

Here we walk through all the nodes and change the links so that they no longer require the trie.

Definition at line 261 of file graphbuilder.cc.

References graph, graph_is_docked, nodetable, and stats.

Referenced by Ripper::PublishWebGraph().

void GraphBuilder::UpdateLeafLinks  
 

Definition at line 310 of file graphbuilder.cc.

References graph, leaftable, and stats.

Referenced by Talker::LoadLeaves().


Member Data Documentation

const char* GraphBuilder::aliasurl__ [private]
 

Definition at line 111 of file graphbuilder.h.

Referenced by NodeGetAlias_(), and NodeSetURL().

uint32 GraphBuilder::cumulative_dangling [private]
 

Definition at line 130 of file graphbuilder.h.

uint32 GraphBuilder::cumulative_fromlinks [private]
 

Definition at line 128 of file graphbuilder.h.

uint32 GraphBuilder::cumulative_leaflinks [private]
 

Definition at line 129 of file graphbuilder.h.

uint32 GraphBuilder::cumulative_tolinks [private]
 

Definition at line 127 of file graphbuilder.h.

WebNode* GraphBuilder::curdoc [private]
 

Definition at line 104 of file graphbuilder.h.

Referenced by GraphBuilder(), NodeGetDate(), NodeGetID(), NodeInitialize(), NodeInsertLinks(), NodeLaunch(), NodeSetDate(), and NodeSetURL().

URLComponents GraphBuilder::curdoc_baseurl [private]
 

Definition at line 106 of file graphbuilder.h.

Referenced by FormatURL(), GraphBuilder(), NodeGetURLParts(), NodeInitialize(), NodeLaunch(), and NodeSetURL().

char GraphBuilder::doc_alias[STRINGBUF_LEN2+1] [private]
 

Definition at line 108 of file graphbuilder.h.

Referenced by GraphBuilder(), NodeGetAlias(), NodeInitialize(), NodeLaunch(), and NodeSetURL().

char GraphBuilder::doc_url[STRINGBUF_LEN2+1] [private]
 

Definition at line 107 of file graphbuilder.h.

Referenced by GraphBuilder(), NodeGetURL(), NodeInitialize(), NodeLaunch(), and NodeSetURL().

const char* GraphBuilder::docurl__ [private]
 

Definition at line 110 of file graphbuilder.h.

Referenced by NodeGetURL_(), and NodeSetURL().

struct { ... } GraphBuilder::flags
 

Referenced by GraphParseHandler::AddAnchor(), GraphBuilder(), NodeSetURL(), and SetupLeafTable().

WebLinkGraph* GraphBuilder::graph [private]
 

Definition at line 115 of file graphbuilder.h.

Referenced by GraphBuilder(), NodeLaunch(), StatisticsGraph(), UndockWebGraph(), and UpdateLeafLinks().

bool GraphBuilder::graph_is_docked [private]
 

Definition at line 116 of file graphbuilder.h.

Referenced by GraphBuilder(), and UndockWebGraph().

uint32 GraphBuilder::heap_used_webnodes [private]
 

Definition at line 126 of file graphbuilder.h.

uint16 GraphBuilder::highdate [private]
 

Definition at line 136 of file graphbuilder.h.

uint32 GraphBuilder::highid [private]
 

Definition at line 134 of file graphbuilder.h.

SimpleLeafNodePtrHashTable* GraphBuilder::leaftable [private]
 

Definition at line 120 of file graphbuilder.h.

Referenced by AddLeaf(), SetupLeafTable(), UpdateLeafLinks(), and ~GraphBuilder().

int GraphBuilder::leaftable_memory
 

Definition at line 99 of file graphbuilder.h.

RawLinkSet* GraphBuilder::linkset [private]
 

Definition at line 123 of file graphbuilder.h.

Referenced by GraphBuilder(), NodeInsertLinks(), NodeLaunch(), TrieInsertLinkURL(), and ~GraphBuilder().

uint16 GraphBuilder::lowdate [private]
 

Definition at line 135 of file graphbuilder.h.

uint32 GraphBuilder::lowid [private]
 

Definition at line 133 of file graphbuilder.h.

SimpleWebNodePtrHashTable* GraphBuilder::nodetable [private]
 

Definition at line 119 of file graphbuilder.h.

Referenced by FindWebNode(), GraphBuilder(), NodeSetURL(), StatisticsMem(), UndockWebGraph(), and ~GraphBuilder().

uint32 GraphBuilder::nodetable_alias_insertions [private]
 

Definition at line 132 of file graphbuilder.h.

uint32 GraphBuilder::nodetable_insertions [private]
 

Definition at line 131 of file graphbuilder.h.

bool GraphBuilder::show_links
 

Definition at line 98 of file graphbuilder.h.

struct { ... } GraphBuilder::stats [private]
 

Referenced by AddLeaf(), GraphBuilder(), HighestDate(), HighestID(), LowestDate(), LowestID(), NodeInitialize(), NodeLaunch(), NodeSetURL(), StatisticsGraph(), StatisticsMem(), UndockWebGraph(), and UpdateLeafLinks().

Trie* GraphBuilder::trie [private]
 

Definition at line 118 of file graphbuilder.h.

Referenced by FindLeafNodeKey(), FindWebNode(), GraphBuilder(), NodeSetURL(), StatisticsMem(), TrieInsertLinkURL(), and ~GraphBuilder().

URLFilter* GraphBuilder::urlfilter [private]
 

Definition at line 121 of file graphbuilder.h.

Referenced by FindLeafNodeKey(), FindWebNode(), FormatURL(), GraphBuilder(), NodeSetURL(), and TrieInsertLinkURL().


Generated on Wed May 29 11:37:25 2002 for MarkovPR by doxygen1.2.15