Appeared in Journal of Web Engineering , Vol. 13, No. 56, ISSN 1540-9589, pp. 507-524, 201 4 [613914]

Appeared in Journal of Web Engineering , Vol. 13, No. 5&6, ISSN 1540-9589, pp. 507-524, 201 4

WEB PAGE PREDICTION ENHANCED WITH CONFIDENCE MECHAN ISM
ARPAD GELLERT
Computer Science and Electrical Engineering Departm ent, Lucian Blaga University of Sibiu, Romania
[anonimizat]
ADRIAN FLOREA
Computer Science and Electrical Engineering Departm ent, Lucian Blaga University of Sibiu, Romania
[anonimizat]

In this work we comparatively present and evaluate different prediction techniques used to anticipate and
prefetch web pages and files accessed via browsers. The goal is to reduce the delays necessary to load the
web pages and files visited by the users. We have i ncluded into our analysis Markov chains, Hidden
Markov Models and graph algorithms. We have enhance d all these predictors with confidence mechanism
which classifies dynamically web pages as predictab le or unpredictable. A prediction is generated only if
the confidence counter attached to the current web page is in a predictable state, improving thus the
accuracy. Based on the results we have also develop ed a hybrid predictor consisting in a Hidden Markov
Model and a graph-based predictor. The experiments show that this hybrid predictor provides the best
prediction accuracies, an average of 85.45% on the “Ocean Group Research” dataset from the University
of Boston and 87.28% on the dataset collected from the educational web server of our university, being
thus the most appropriate to efficiently predict an d prefetch web pages.
Key words : Web page prediction, prefetching, Markov chains, Hi dden Markov Models, graph
algorithms, browser extension .
Communicated by : (to be filled by the JWE editorial)

1 Introduction
Nowadays the access to the Internet becomes more pr evalent worldwide and often there are requests for
higher bandwidths. According with Yahoo [2], today, there are over 130 million Web servers and the
Web is the largest data repository (estimated to 10 0 billion pages). By the end of 2011 the Internatio nal
Data Corporation predicted that the amount of infor mation in the world will exceed 1.8 zettabytes (2 70 ).
The amount of archived web pages exceeds two petaby tes (2 50 ) and it increases at a rate of 20 terabytes
(2 40 ) each month. These statistics are expressed in oth er words by Eric Schmidt, the executive chairman
of Google, namely, the digital world “now produces as much information in two days as it did all the

2 Web Page Prediction Enhanced with Confidence Mechan ism

way along from the dawn of man up to the year 2003” [25]. Also, current economic companies rely on
technology and Internet communications, and informa tion came to travel and appear at a fantastic rate.
However, only a small part of the information conve yed is relevant to the user at a time. Therefore,
creating a user profile based on recent history of visited pages can reduce waiting times.
Clients are frequently confronted with delays in ac cessing web pages, especially those ones that are
limited by low-bandwidth modems or wireless routers . Many latency-tolerant techniques have been
developed during the last years, the most important being caching and prefetching. Caching exploits th e
temporal locality principle by keeping the accessed pages and files in a cache structure, whereas
prefetching anticipates the next accesses and loads the corresponding web pages or files into the cach e.
If the user accesses a web page or a file which is available in the cache, the browser can load it wit hout
any delays. Thus, when the users have long browsing sessions, prediction-based prefetching can be
very effective by minimizing the access latencies.
In this paper we analyze comparatively different we b page prediction techniques based on Markov
chains, Hidden Markov Models (HMM) and graphs, all of them enhanced with a confidence
mechanism. We propose also a hybrid predictor. The general applicability of the techniques considered
confers a greater portability to our methods. We ev aluate the mentioned predictors in terms of
prediction accuracy. The goal is to find the most a ppropriate prediction technique for anticipating an d
prefetching web pages and to integrate it as an ext ension into browsers.
The organization of the rest of this paper is as fo llows. In Section 2, we are reviewing the related
work in web page prefetching and also in prediction algorithms. Section 3 is describing our proposed
web page predictors. In Section 4 we are presenting the experimental results. Conclusions and further
work directions are included in Section 5.
2 Related Work
In our previous works [30, 11] we already implement ed multi-layer perceptrons, Markov chains and
HMMs for next location prediction, to anticipate th e movements of employees within an office building
by predicting the next rooms based on the history o f visited rooms. The goal of the research was to
design some smart doorplates that are able to direc t visitors to the current location of an office own er
based on a location-tracking system and predict if the office owner is coming back soon. These
predictors were evaluated on some movement sequence s of real persons, acquired from the Smart
Doorplates project developed at the University of A ugsburg [23]. The experimental results showed that
the accuracy in the next location prediction reache s up to 92%.
In his work [24] Rabiner showed how HMMs can be app lied in speech recognition. The author
presented the theory of HMMs from the simplest conc epts (discrete Markov chains) to the most
sophisticated models (variable duration, continuous density models, etc.). He also illustrated some
applications of the theory of HMMs to simple proble ms in speech recognition, and pointed out how the
techniques were applied to more advanced speech rec ognition problems. In [31] we used, among other
metrics, a HMM-based prediction algorithm to evalua te the random degrees of some difficult to predict

A. Gellert and A. Florea 3

branches and we showed that these branches have int rinsic random behavior, being generated by very
complex program structures. In [16] the authors use d HMMs for semantic-based web page prefetching.
In contrast, we do not use web semantics, we predic t web pages based only on the history of links,
exploiting pattern repetition. We also use a confid ence mechanism in order to increase accuracy.
In [14] the authors proposed a continuous-time Mark ov model for web page prediction. They used
Kolmogorov’s backward equations to compute the tran sition probabilities and to predict which web
page will be visited by a certain user and when. Th eir method can also estimate the total number of
visits to a web page by all the people at a certain interval. Link prediction based on Markov chains w as
presented also in [33, 7, 26]. In [17] the author a pplied the Markov model together with the k-nearest
neighbor classifier algorithm to enhance the perfor mance of traditional Markov chains in predicting
web pages. He obtained lower consumed memory, quite similar build time and evaluation time and
higher accuracy. In [18, 19] clustering, associatio n rules and lower order Markov models were
combined providing an improved prediction accuracy and state space complexity. Clustering is used to
group homogeneous user sessions, low order Markov c hains are built on these clustered sessions and
when Markov models cannot make clear predictions th e association rules are used. The proposed
integrated model has less state complexity and is m ore accurate than a higher order Markov model. In
[20] the authors presented a survey of web page ran king for web personalization. They concluded that
low order Markov models have higher accuracy and lo wer coverage, whereas the higher order models
have a number of limitations associated with: highe r state complexity, reduced coverage and sometimes
even worse prediction accuracy.
Graphs were also used for prediction through some s pecific algorithms. In [15] Huang explored link
prediction based on graph topology. Hasan addressed link prediction in social networks by using
supervised learning and classification algorithms su ch as decision tree, multi-layer perceptron, suppor t
vector machines [13]. His results were reported on BIOBASE and DBLP, two datasets that have
information about different research publications i n biology and computer science, respectively. Unlik e
Hasan’s work, we have simulated on the “Ocean Group Research” benchmarks from the University of
Boston (BU) [6] and on the dataset collected from t he educational web server of “Lucian Blaga”
University of Sibiu (LBUS) and our algorithms are n ot applied strictly on social networks but are
useful for predicting any web pages. We have also u sed a confidence mechanism in order to increase
accuracy.
Another approach in link prediction of social netwo rks was proposed in [22]. The author’s solution
aims at predicting links based on weighted proximit y measures of social networks relying on the
structural properties (topology) of a given network . There are some similarities between our work and
[22], namely: the prediction is based on weighted g raphs and we reached the same conclusion that
considering the recently used link can be more appr opriate in next page prediction. However, the
calculation of weighted graphs and the data sets ar e different. Also, the solution from [22] has a
particular character being suitable only for open a nd dynamic online social networks.
In [12] the authors presented a web page prediction method based on conditional random fields,
used for labeling and segmenting sequential data. C onditional random fields have the ability to model

4 Web Page Prediction Enhanced with Confidence Mechan ism

the dependencies among observations and they can al so incorporate various features from observation
sequences to increase the prediction accuracy. They concluded that conditional random fields
outperformed Markov chains and HMMs but for a great number of labels their training may become
very expensive and even intractable. Their results also showed that the second order Markov chains and
HMMs are better than the first order ones. In contr adiction, our experiments show that the first order
Markov chains and HMMs outperform the second order ones. An explanation for this difference can be
the different data sets, as they used msnbc , whereas we used the BU and LBUS datasets.
In [10] the authors proposed a hybrid web page pred iction method which combines support vector
machines, association rule mining and Markov chains in order to enhance the efficiency. Their
experimental results showed that the proposed hybri d predictor outperformed the individual predictors.
In [28] the authors presented an n-gram based model to utilize path profiles of users from very large
web log to predict future requests. They showed tha t the proposed method achieves a reasonable
balance between precision and applicability.
In [4] the authors proposed a page rank algorithm t o predict the next page that will be visited by a
web surfer. For the first set of sessions they appl ied the page rank algorithm which provides the rank s
for web pages. For each web page their method deter mines to which pages the user can navigate and,
using the page ranks, it computes the probability o f visiting them by dividing each rank to the sum of
ranks, and the number of links to the total number of links, respectively. The authors evaluated the
proposed algorithm on the NASA dataset with an accu racy improvement from 19.75% to 32.5% and
also on a log file from a commercial web site with an accuracy improvement from 74.66% to 77.77%.
In [3] Canali et al. proposed adaptive algorithms t hat combine predictive and social information and
dynamically adjust their parameters according to th e continuously changing workload characteristics.
Their experimental results showed that such adaptiv e algorithms can achieve performance close to
theoretical ideal algorithms.
Some of the algorithms proposed in the literature c onsider a training period before making
predictions. The training period can improve or eve n decrease performance, since, if it is too long, i t
can involve a high amount of resources. In [9] the authors analyzed how this training affects the
prediction accuracy.
In [32] Wan et al. proposed an approach based on a vector space model, called random indexing, for
the discovery of the intrinsic characteristics of w eb user activities. The underlying factors are then used
for clustering individual user navigational pattern s. The clustering results are used to predict and
prefetch web requests for grouped users.
In [29] Temgire et al. presented a review on web pr efetching techniques. The huge variety of
prefetching algorithms hinder to a certain extent t he performance evaluation of new techniques and
their correct comparison with existing methods. The refore, in [8] the authors proposed a free
environment for implementation and efficient evalua tion of prefetching techniques. Their framework
combines real and simulated behavior of web users a nd servers.

A. Gellert and A. Florea 5

3 Prediction of Web Pages
The general prediction mechanism consists in antici pating future contexts based on current and
previous context information, recovering the correc t context if the speculation fails and updating the
predictor to improve future prediction accuracy. Pr ediction can be very useful if the availability of
some data in advance allows to reduce waiting times , improving thus the efficiency. Obviously, the
prediction must be accurate, because in some applic ations a misprediction leads to certain costs due t o
the necessity of correct state recovery.
For predicting or anticipating future situations, s ome learning techniques like Markov chains,
HMMs, graph algorithms, bayesian networks, time ser ies or neural networks are obvious candidates.
The challenge is to adapt such algorithms to work w ith context information.
As Figure 1 illustrates, our proposed browser exten sion, when activated, collects the links accessed
by the user which are then preprocessed: each link is codified with a unique number, ports and relativ e
parts are eliminated from links, if there are two c onsecutive accesses of the same link only one is
considered, links having the extension .gif and .xbm , which are usually small images, are also
eliminated. The browser extension keeps a certain h istory of the accessed links. When the current link
is accessed, the next link is predicted on the basi s of the history of the previously visited links. T he
predicted web page or file is prefetched in the cac he in order to be available if the user accesses it .
………..
www….
www….
www….
……….. Client
Data
Collector Preprocessor Predictor Prefetcher Browser Extension ………..
www….
www….
www….
……….. Client
Data
Collector Preprocessor Predictor Prefetcher Browser Extension

Figure 1. The structure of the application.
In order to improve the prediction accuracy, the pr edictor can be enhanced with a confidence
mechanism which consists in saturating counters, at tached to all the links kept in the history, which are
incremented on correct prediction and decremented o n misprediction. A prediction is generated only if
the confidence counter of the current link is in a predictable state. A possible confidence automaton is
presented in Figure 2.

6 Web Page Prediction Enhanced with Confidence Mechan ism

-1
Unpredictable 0
Predictable
Misprediction Misprediction Correct Prediction
Correct Prediction
-2
Unpredictable
Misprediction Correct Prediction
1
Predictable
Misprediction Correct Prediction
-1
Unpredictable 0
Predictable
Misprediction Misprediction Misprediction Correct Prediction
Correct Prediction Correct Prediction
-2
Unpredictable
Misprediction Correct Prediction
1
Predictable
Misprediction Correct Prediction

Figure 2. Confidence automaton with 4 states: 2 unp redictable and 2 predictable.
By using such a confidence mechanism, the number of predictions is lower but the prediction accuracy
is significantly higher.
The next sections tackle with the way of we have ad apted Markov chains, HMMs and graph
algorithms to predict and prefetch web pages.
3.1 Markov Predictors
A first order discrete Markov process may be descri bed at any time as being in one of a set of N distinct
states }…, ,,{21 NSSSS= [24], as illustrated in Figure 3.
S1 S2
S3a12
a21 a22
a32 a23
a33 a13
a31 a11
S1 S2
S3a12
a21 a22
a32 a23
a33 a13
a31 a11

Figure 3. A Markov chain with 3 states.
The full probabilistic description of a discrete Ma rkov chain requires the specification of the curren t
state as well as all the predecessor states (the cu rrent state in a sequence depends on all the previo us
states). For the special case of a discrete, first order, Markov chain, this probabilistic description is
truncated to just the current and the predecessor s tate (the current state depends only on the previou s
state): ] […] , , [1 2 1 i tjt k ti tjt SqSqP SqSqSqP = == = = =− − − , where tq is the state at time t.
Thus, for a first order Markov chain with N states, the set of transition probabilities betwee n states Si

A. Gellert and A. Florea 7

and Sj is }{ij aA=, where ] [1itjt ij SqSqPa = ==−, Nji≤≤,1 , having the properties 0≥ij a and
1
1=∑
=N
jij a. For a Markov chain of order R, the probabilistic description is truncated to the current and
R previous states (the current state depends on R previous states).
The Markov chains can be used to predict the next v alue in a sequence based on a particular stored
pattern. The prediction supposes searching for the context within the value sequence and determining
which value followed the context with the highest f requency, that value being the predicted one. In a
Markov chain of order R, the context consists in the last R values of the sequence, and, therefore, the
search is done using this pattern of R values. A transition-table-based implementation is also possible,
but it is inefficient for a high number of observat ion symbols. Theoretically, Markov chains can be us ed
to predict any stochastic repetitive sequences.
The following example shows the necessity of using superior order Markov models. If the sequence
of states is AAABCAAABCAAA, the Markov models of or der 1 and 2 mispredict A, and only a
Markov model of order 3 predicts the next state B c orrectly.
In our multi-page prediction approach we prefetch a ll the pages that appeared in the history after the
considered context. The predictor can be enhanced w ith confidence mechanism (a prediction is
generated only if the confidence counter of the cur rent link is in a predictable state, see Figure 2).
3.2 HMM-Based Predictors
For the prediction of the next accessed web page we consider HMMs like those developed in [24, 11].
A HMM is a doubly embedded stochastic process with a hidden stochastic process that can only be
observed through another set of stochastic processe s that generate the sequence of observable symbols.
A generic HMM is illustrated in Figure 4, where qt is the hidden state at time t, Ot is the observation at
time t, A is the matrix of transition probabilities between hidden states, and B is the matrix of
observation probabilities within each hidden state.
q1q2q3 qTA A A A
B B B B
O1O2 OT O3Hidden State Sequence (Q):
Observation Sequence (O): q1q2q3 qTAA AA AA AA
BB BB BB BB
O1O2 OT O3Hidden State Sequence (Q):
Observation Sequence (O):
Figure 4. Hidden Markov Model.
HMM predictors are very powerful adaptive stochasti c models. Our hypothesis is that HMMs could
compensate lack of relevant information through its underlying stochastic process that is not
observable. A superior order HMM consists of the fo llowing elements [11]:

8 Web Page Prediction Enhanced with Confidence Mechan ism

R – the order of HMM (a combination of R primitive hidden states form a so called super-sta te).
N – the number of primitive hidden states (belongin g to a HMM of order 1), with
}…, ,,{21RNSSSS= being the set of hidden super-states and Sqt∈the hidden super-state at
time t. The current super-state determines the transition into the next one based on a super-state
transition matrix with restrictions [11]. N will be varied in order to obtain the optimal valu e.
M – the number of observable states, with }…, ,,{21 MVVVV= the set of observable states
(symbols), and tO the observable state at time t.
A = }{ij a – the transition probabilities between the hidden super-states iS and jS, where
R
itj t ij Nji SqSqPa ≤≤ ===+ ,1], [1 .
B = )} ({kbj – the probabilities of the observable states kV, considering the current hidden super-state
jS, where ] [)(jtkt j SqVOPkb === ,RNj≤≤1 , Mk≤≤1 .
π = }{iπ– the initial hidden super-state probabilities, whe re ][1i i SqP==π , RNi≤≤1 .

The following variables are also defined [11, 24]:

) ,… () (21 λ αitt t SqOOOPi = = – the forward variable, representing the probabili ty of the partial
observation sequence until time t, and hidden state iS at time t, given the model ),,(π λBA= .
), … () (21 λ βitT tt t SqOOOPi = =++ – the backward variable, representing the probabil ity of the
partial observation sequence from t+ 1 to the end T, given the hidden state iS at time t and the
model ),,(π λBA= .
),… ,(), (21 1 λ ξT j tit t OOOSqSqPji = ==+ – the probability of being in hidden state iS at time t,
and in hidden state jS at time t+ 1, given the model ),,(π λBA= and the observation sequence.
),… () (21 λ γT it t OOOSqPi == – the probability of being in hidden state iS at time t, given the
model ),,(π λBA= and the observation sequence.
H – the history (the number of observations used in the prediction process). In [24] and [27] the enti re
observation sequence is used in the prediction proc ess ( H=T ), but in some practical applications
the observation sequence increases continuously, th erefore its limitation is necessary [11].
I – the maximum number of iterations in the adjust ment process. Usually the adjustment process ends
when the probability of the last H observations does not increase anymore, but for a faster
adjustment, the number of iterations is limited, as in [11].

The HMM-based prediction algorithm consists in the following steps:

A. Gellert and A. Florea 9

1) T=H (T is the length of the observation sequence);
2) c=0 (c is the number of current iteration, its max imum value is given by I);
3) Initialize ),,(π λBA= ;
4) Compute 1…, , 0, 1 …, , 0,…, , 1), (), , (), (), ( − =− = =R R
t tt t Nj NiTtijiii γξβα ;
5) Adjust the model ),,(π λBA= ;
6) c=c+1;
7) If )(λOP increases and c<I, go to 4.);
8) At time T, the next observation symbol 1+TO is predicted, using the adjusted model
),,(π λBA= :
8.1) choose the current hidden state iS, by maximizing ) (iTα, 1…, , 0− =RNi ;
8.2) choose the next hidden state jS, by maximizing ij a,
1 ) mod (…, ,) mod (1 1−+⋅ ⋅ =− −NNNiNNijR R;
8.3) predict the next symbol kV, by maximizing )(kbj, 1…, , 0− =Mk .

Before a prediction, the model ),,(π λBA= is randomly initialized, as shown in [11], and aft er that it
is repeatedly adjusted based on the last H observations T HTHT O OO …, ,,2 1 +−+− (the entire observation
sequence if H=T ), in order to increase the probability of the obse rvation sequence
)… (2 1 λT HTHT O OOP+−+− . In this work we use the Baum-Welch iterative algo rithm introduced in [1]
– identical with the Expectation Maximization (EM) method for this particular problem – which
improves iteratively an initial model. The adjusted model ),,(π λBA= is then used to predict the next
observation symbol 1+TO.
In the multi-page prediction approach we prefetch t he links whose probability is higher than 1/M.
The predictor can be enhanced with confidence mecha nism (like that from Figure 2).
3.3 Graph-Based Predictors
The graphs are data structures used in many types o f applications to model different processes. A grap h
consists in vertices which can be connected by edge s. We denote a graph G=(V, E), where V is the set
of vertices and E is the set of edges. We also deno te a path P={v 1, v 2, …, v k}, where (v i, v i+1 ) is an
edge, 1, 1−=ki .

10 Web Page Prediction Enhanced with Confidence Mechan ism

1
2
34 www.ulbsibiu.ro
www.csac.ro
webmail.ulbsibiu.ro acaps.ulbsibiu.ro
31 26
17 1
2
34 www.ulbsibiu.ro
www.csac.ro
webmail.ulbsibiu.ro acaps.ulbsibiu.ro
31 26
17

Figure 5. Modeling web page accesses using graphs.

As Figure 5 depicts, in our application we use undi rected weighted graphs whose vertices represent the
accessed links, an edge between two vertices meanin g at least one transition from a link to another,
whereas the weights are keeping the number of trans itions.
Three graph algorithms have been analyzed from the web page prediction perspective. The weight-
based and Dijkstra algorithms are statistical centr ic whereas the common-neighbors-based algorithm is
topological centric. These algorithms can be enhanc ed with confidence mechanism (see Figure 2).
The weight-based algorithm predicts the next link a s being the vertex whose edge with the current
vertex has the highest weight. In the multi-page pr ediction approach all the adjacent vertices are
prefetched. The multi-page prediction algorithm, en hanced with the confidence mechanism, is
presented below:

c = current vertex (link)
if c.isPredictable() then
predict c.getPairs()
update()

where getPairs returns a list with all the adjacent vertices and update adapts the confidence of the
current link (by incrementing it on correct predict ion or decrementing it on misprediction) and adds t he
new link to the graph.
The Dijkstra-based algorithm determines the highest path (having the highest sum of weights) and
predicts the vertices from that path. This path is determined on the grounds of the well-known Dijkstr a
algorithm, presented in [5], which finds the paths with the lowest costs between a given source vertex
and the other vertices from a weighted graph. Since we are interested in the highest path, we apply th e
Dijkstra algorithm with inversed weights ( 1−w). In the multi-page prediction approach all the ve rtices
from that path are prefetched. The multi-page predi ction algorithm, enhanced with the confidence
mechanism, is presented below:

A. Gellert and A. Florea 11

c = current vertex (link)
pmax = Dijkstra(c)
if c.isPredictable() then
predict pmax.getVertices()
update()

where getVertices returns a list with all the vertices from a path a nd update adapts the confidence of the
current link and adds the new link to the graph.
The common-neighbors-based algorithm [21] discovers the vertex having the highest number of
common neighbors with the current vertex and predic ts these common neighbors. The multi-page
prediction algorithm, enhanced with confidence mech anism, is given by the following pseudocode:

c = current vertex (link)
max = -1
foreach vertex v do
neighbors = c.commonNeighbors(v)
if neighbors.getSize() > max then
max = neighbors.getSize()
maxneighbors = neighbors
if c.isPredictable() then
predict maxneighbors.getVertices()
update()

where commonNeighbors returns the list of vertices which are common neig hbors for two given
vertices, getVertices returns all the vertices from a list and update adapts the confidence of the current
link and adds the new link to the graph.
3.4 Hybrid Predictors
Since we apply multi-page prediction, it is natural a hybrid prediction scheme which returns the
predictions of all the components. All these predic ted web pages are prefetched. In the next section w e
evaluate the predictors presented in the previous s ections and we use the most accurate predictors as
hybrid prediction components.
4 Experimental Results
The first step of our research is to comparatively analyze the proposed algorithms, from the predictio n
accuracy point of view, by varying their input para meters. The prediction accuracies presented in this
section have been obtained through multi-page predi ction. Such a study can be better highlighted on a
set of log files. Therefore, the above presented al gorithms have been implemented in Java and
evaluated on two datasets.

12 Web Page Prediction Enhanced with Confidence Mechan ism

The BU benchmark set was generated by the “Ocean Gr oup Research” from Boston University [6]
and consists in log files collected during 7 months on 37 workstations, spanning the timeframe from 21
November 1994 to 8 May 1995. Each log file name con tains a user ID number, the machine on which
the session took place and the Unix timestamp when the session started. Each line in a log corresponds
to a single URL requested by the user; it contains the machine name, the timestamp when the request
was made, the URL, the size of the document (includ ing the overhead of the protocol) and the object
retrieval time in seconds.
The evaluations have been performed also on the new er LBUS dataset collected from the
educational web server of “Lucian Blaga” University of Sibiu, during 31 October 2006 – 10 November
2006. The log file contains about 210,000 web acces ses, each line corresponding to an URL requested
by the user.
First, we have evaluated the Markov predictor on th e above mentioned datasets, by varying the
pattern size (R). We used a history of 1000 links a nd 4-state confidence counters. The results are
presented in Figure 6, where con112 – con99 are the BU log files.
020 40 60 80 100
con112 con133 con160 con172 con195 con221 con228 con316 con325 con337 con338 con347 con357 con405 con408
con57 con60
con603
con68 con71 con76 con99
Average BU
LBUS Benchmarks Prediction Accuracy [%] R=1
R=2
R=3

Figure 6. Web page prediction using Markov chains ( H = 1000, 4-state confidence).
As Figure 6 shows, the first order Markov chain (R= 1) was the most efficient on the evaluated data
sets, which is in concordance with [20]. However, o n some benchmarks (con172, con221, con57,
con60, con603 and con99) superior order Markov chai ns performed better. The average prediction
accuracy with a first order Markov chain was 80.07% on the BU dataset and 70.58% on LBUS.
Then, we have evaluated the HMM predictor on the sa me datasets, by varying the number of hidden
states (N). We used a first order HMM with a histor y of 1000 links and 4-state confidence counters.
The experimental results are presented in Figure 7.

A. Gellert and A. Florea 13

020 40 60 80 100
con112 con133 con160 con172 con195 con221 con228 con316 con325 con337 con338 con347 con357 con405 con408
con57 con60
con603
con68 con71 con76 con99
Average BU
LBUS Benchmarks Prediction Accuracy [%] N=1
N=2
N=3

Figure 7. Web page prediction using HMMs (H=1000, R =1, 4-state confidence).
As the results show, the first order HMM with one h idden state (N=1) performed better on all
benchmarks. The average prediction accuracy was 81. 77% on BU and 78.29% on the LBUS dataset.
We have evaluated also superior order HMMs (R=2, 3) , obviously with higher number of hidden states,
but the prediction accuracy was lower, which is con sistent with what is reported in [20].
Finally, we have evaluated the graph based predicto rs. Figure 8 shows comparatively the prediction
accuracies obtained with the weight-based predictor , the Dijkstra-based one and the common-
neighbors-based (topological) one, using a history of 1000 links and 4-state confidence counters.
020 40 60 80 100
con112 con133 con160 con172 con195 con221 con228 con316 con325 con337 con338 con347 con357 con405 con408
con57 con60
con603
con68 con71 con76 con99
Average BU
LBUS Benchmarks Prediction Accuracy [%] Weight-
Based
Dijkstra-
Based
Topology-
Based

Figure 8. Web page prediction using graphs (H=1000, 4-state confidence).

14 Web Page Prediction Enhanced with Confidence Mechan ism

As it can be observed, the weight-based predictor o utperforms both the Dijkstra-based and topology-
based algorithms on all the evaluated benchmarks. T he average prediction accuracy, obtained with the
weight-based graph predictor, was 83.07% on the BU dataset (but exceeding 90% on some
benchmarks) and 75.06% on LBUS.
020 40 60 80 100
con112 con133 con160 con172 con195 con221 con228 con316 con325 con337 con338 con347 con357 con405 con408
con57 con60
con603
con68 con71 con76 con99
Average BU
LBUS Benchmarks Prediction Accuracy [%] Markov
HMM
Graph

Figure 9. Comparing web page predictors (H=1000, 4-state confidence).
020 40 60 80 100
con112 con133 con160 con172 con195 con221 con228 con316 con325 con337 con338 con347 con357 con405 con408
con57 con60
con603
con68 con71 con76 con99
Average BU
LBUS Benchmarks Prediction Accuracy [%] 4-state
conf.
2-state
conf.
Without
conf.

Figure 10. Comparing hybrid web page predictors wit h 4-state confidence, 2-state confidence and withou t confidence.
Figure 9 presents comparatively the best configurat ions of our evaluated predictors, all of them using
the same history of 1000 links and 4-state confiden ce counters. The results show that on the BU datase t
the weight-based graph predictor outperforms the ot her predictors, however, on some benchmarks

A. Gellert and A. Florea 15

(con160, con221, con228, con316, con357, con60 and con603) the HMM predictor showed the better
result. On the LBUS dataset the HMM predictor was b etter. These results suggest that a hybrid
predictor, consisting in a HMM and a graph-based pr edictor, could provide better results. Figure 10
shows comparatively the accuracies obtained using s uch hybrid predictors with 4-state confidence, 2-
state confidence and without confidence, respective ly. The hybrid predictor with 4-state confidence
outperforms, in prediction accuracy, all the evalua ted component predictors. Its average accuracy was
85.45% on BU and 87.28% on the LBUS dataset. It sli ghtly outperforms the 2-state confidence based
hybrid predictor and significantly outperforms the hybrid predictor which is not using any confidence
mechanism. As Figure 10 shows, the confidence mecha nism can increase the accuracy by avoiding
prediction when the confidence in a certain context is low.
00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
con112 con133 con160 con172 con195 con221 con228 con316 con325 con337 con338 con347 con357 con405 con408
con57 con60
con603
con68 con71 con76 con99
Average BU
LBUS Benchmarks Prediction Rate 4-state
conf.
2-state
conf.

Figure 11. The prediction rates of the hybrid web p age predictors with 4-state confidence and 2-state confidence.

Figure 11 depicts how the prediction rate (the numb er of predicted web pages divided to the total
number of web pages) is influenced by the selectivi ty of the 4-state and 2-state confidence mechanisms
in the case of the hybrid predictor. The prediction rate of the predictor without confidence is 1.0
because a prediction is always performed. It can be observed that as more selective confidence
mechanism we have used as higher accuracy and lower prediction rate we obtained. An exception from
this rule is only the LBUS log file where the predi ction rate is lower with 2-state confidence than wi th
4-state confidence, at almost the same accuracy. Fo r this dataset the faster transitions from predicta ble
to unpredictable states, due to the lower number of predictable states in the 2-state confidence, is
unfavorable. Thus, on the LBUS dataset a 4-state co nfidence mechanism is better from both the
accuracy and prediction rate viewpoints.
The best predictors have been also integrated into the Mozilla Firefox browser as a JavaScript
extension and have been evaluated in a real environ ment with good results.

16 Web Page Prediction Enhanced with Confidence Mechan ism

Summary
In this section we have chosen the best Markov pred ictor (Figure 6), the best HMM predictor (Figure
7) and the best graph-based predictor (Figure 8), i n terms of prediction accuracy. A comparison
between these optimal predictors has been presented in Figure 9, which shows that the weight-based
graph predictor outperforms the other evaluated pre dictors, but on some benchmarks, including the
LBUS dataset, the HMM-based predictor has proved be tter. Finally, we have presented in Figure 10 the
results obtained with a hybrid predictor having as components the HMM and the weight-based graph
predictor and using different confidence mechanisms . Figures 10 and 11 also depict the benefit of the
confidence mechanism.
However, the hybrid predictor with 4-state confiden ce provided the best prediction accuracy:
85.45% at average on BU and 87.28% on the LBUS data set.
5 Conclusions and Further Work
In this study, we have presented three web page pre diction methods in a comparative manner. We have
included into our analysis Markov chains, Hidden Ma rkov Models and graph algorithms. We have
enhanced all these predictors with a confidence mec hanism for higher prediction accuracy. The goal of
the proposed predictive prefetching methods is to r educe the delays necessary to load the web pages
and files visited by the users. The experimental re sults performed on the LBUS dataset have shown that
the HMM predictor provided the best prediction accu racy, 78.29%. On the BU dataset the weight-
based graph predictor provided the best prediction accuracy, 83.07% at average, but on some BU
benchmarks the HMM predictor was better than the gr aph-based one, which suggests that a hybrid
predictor, consisting in a HMM and a graph-based pr edictor, could provide better results. With such a
hybrid predictor we obtained the best accuracy, 85. 45% on BU and 87.28% on LBUS, being thus the
most appropriate to predict and prefetch web pages efficiently. The evaluations have also shown that
the first order Markov chain outperforms the superi or order ones, as well as the first order HMM
outperforms the superior order HMMs. Thus, since su perior order predictors imply higher complexity
but lower accuracy in both cases, in our opinion, m ore simple predictors are more suitable for web pag e
prefetching. This is verified also in the case of g raphs because the simpler weight-based predictor is
more efficient than the Dijkstra-based one whose co mplexity is considerably higher.
The confidence mechanism has a high benefit since i t can significantly increase the accuracy by
avoiding predictions from low-confidence contexts.
A further work direction is the analysis of using n eural networks and also other graph algorithms for
web page prediction. Another further work direction consists in developing and evaluating different
hybrid predictors. Another one could be the latency evaluation because it would be useful to show how
retrieval times are reduced through prefetching. Fi nding and exploiting similarity among users is a
research challenge in itself.

A. Gellert and A. Florea 17

Acknowledgements
We express our gratitude to our MSc student Ionut-M adalin Ghimis for providing his useful and
competent help in implementing the graph algorithms presented in this work.
References
1. Baum, L.E. An Inequality and Associated Maximizatio n Technique in Statistical Estimation for
Probabilistic Functions of Markov Processes. Inequa lities, Vol. 3, 1-8, 1972.
2. Cambazoglu, B.B. Algorithmic Techniques for Reducin g Energy Consumption of Commercial
Web Search engines. Yahoo! Research Barcelona, Spai n, Training School, University of Balearic
Islands, Palma de Mallorca, 2012.
3. Canali, C., Colajanni, M., Lancellotti, R. Adaptive algorithms for efficient content management in
social network services. 10 th International Conference on Computer and Informati on Technology,
68-75, 2010.
4. Ciobanu, D., Dinuca, C.E. Predicting the next page that will be visited by a web surfer using Page
Rank algorithm. International Journal of Computers and Communications, Issue 1, Vol. 6, 60-67,
2012.
5. Cormen, T., Leiserson, C., Rivest, R., Stein, C. In troduction to Algorithms. MIT Press, Third
Edition, 2009.
6. Cunha, C. A., Bestavros, A., Crovella, M. E. Charac teristics of WWW Client Traces. Boston
University Department of Computer Science, Technica l Report TR-95-010, 1995.
7. Deshpande, M., Karypis, G. Selective Markov Models for Predicting Web-Page Accesses. ACM
Transactions on Internet Technology, Vol. 4, Issue 2, 163-184, 2004.
8. Domènech, J., Pont, A., Sahuquillo, J., Gil, J. A. An experimental framework for testing web
prefetching techniques. The 30th EUROMICRO Conferen ce, 214-221, 2004.
9. Domènech, J., Sahuquillo, J., Pont, A., Gil, J. A. How current web generation affects prediction
algorithms performance. Proceedings of SoftCOM Inte rnational Conference on Software,
Telecommunications and Computer Networks, 2005.
10. Dubey, S., Mishra, N. Web Page Prediction using Hyb rid Model. International Journal on
Computer Science & Engineering, Vol. 3 Issue 5, 217 0-2176, 2011.
11. Gellert, A., Vintan, L. Person Movement Prediction Using Hidden Markov Models. Studies in
Informatics and Control, Vol. 15, No. 1, National I nstitute for Research and Development in
Informatics, Bucharest, 17-30, 2006.
12. Guo, Y. Z., Ramamohanarao, K., Park, L. A. F. Web P age Prediction Based on Conditional
Random Fields. The 18th European Conference on Arti ficial Intelligence, 251-255, 2008.
13. Hasan, M. A., Chaoji, V., Salem, S., Zaki, M. Link Prediction using Supervised Learning.
Proceedings of SDM 06 Workshop on Link Analysis, Co unterterrorism and Security, 2006.
14. Huang, Q, Yang, Q., Huang, J. Z., Ng, M. K. Mining of Web-Page Visiting Patterns with
Continuous-Time Markov Models. Springer-Verlag Berl in Heidelberg, 549-558, 2004.
15. Huang, Z. Link Prediction Based on Graph Topology: The Predictive Value of Generalized
Clustering Coefficient. Proceedings of the Workshop on Link Analysis: Dynamics and Static of
Large Networks, 2006.

18 Web Page Prediction Enhanced with Confidence Mechan ism

16. Jin, X., Xu, H. An Approach to Intelligent Web Pre- fetching Based on Hidden Markov Model.
Proceedings of the 42nd Conference on Decision and Control, Maui, Hawaii, USA, 2954-2958,
Vol. 3, 2003.
17. Kaushal, P. Hybrid Markov model for better predicti on of web page. International Journal of
Scientific and Research Publications, Vol. 2, Issue 8, 2012.
18. Khalil, F., Li, J., Wang, H. Integrating Recommenda tion Models for Improved Web Page
Prediction Accuracy. Proceedings of the 31 st Australasian Conference on Computer Science, Vol.
74, 91-100, 2008.
19. Khalil, F., Li, J., Wang, H. An Integrated Model fo r Next Page Access Prediction. Internation
Journal of Knowledge and Web Intelligence, Vol. 1, No. 1/2, 48-80, 2009.
20. Khanchana, R., Punithavalli, M. Web Page Prediction for Web Personalization: A Review. Global
Journal of Computer Science and Technology, Vol. 11 , Issue 7, 39-44, 2011.
21. Liben-Nowell, D., Kleinberg, J. The Link-Prediction Problem for Social Networks, Journal of the
American Society for Information Science and Techno logy, Vol. 58, Issue 7, pages 1019-1031,
2007.
22. Murata, T., Moriyasu, S. Link Prediction of Social Networks Based on Weighted Proximity
Measures. IEEE/WIC/ACM International Conference on Web Intelligence, 85-88, 2007.
23. Petzold, J. Augsburg Indoor Location Tracking Bench marks. Technical Report 2004-9, Institute of
Computer Science, University of Augsburg, Germany, 2004.
24. Rabiner, L.R. A Tutorial on Hidden Markov Models an d Selected Applications in Speech
Recognition. Proceedings of the IEEE, Vol 77, No. 2 , 257-286, 1989.
25. Schmidt, E. Every 2 days we create as much informat ion as we did up to 2003. Lake Tahoe, CA.
URL http://techcrunch.com/2010/08/04/schmidt-data/, 2010.
26. Singhai, N., Nigam R.K. A Novel Technique to Predic t Oftenly Used Web Pages from Usage
Patterns. International Journal of Emerging Trends & Technology in Computer Science, Vol. 1,
Issue 4, 49-55. (2012)
27. Stamp, M. A Revealing Introduction to Hidden Markov Models.
http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf, 2 004.
28. Su, Z., Yang, Q., Zhang, H. J. A Prediction System for Multimedia Pre-fetching in Internet.
Proceedings of the eighth ACM international confere nce on Multimedia, 3-11, 2000.
29. Temgire, S., Gupta. P. Review on Web Prefetching Te chniques. International Journal of
Technology Enhancements and Emerging Engineering Re search, Vol. 1, Issue 4, 100-105, 2013.
30. Vintan, L., Gellert, A., Petzold, J., Ungerer, T. P erson Movement Prediction Using Neural
Networks. Proceedings of the KI2004 International W orkshop on Modeling and Retrieval of
Context (MRC 2004), Vol-114, Ulm, Germany, 2004.
31. Vintan, L., Florea, A., Gellert, A. Random Degrees of Unbiased Branches. Proceedings of the
Romanian Academy, Series A, No. 3, 259-268, 2008.
32. Wan, M., Jönsson, A., Wang, C., Li, L., Yang, Y Web user clustering and Web prefetching using
Random Indexing with weight functions. Knowledge an d Information Systems, Vol. 33, Issue 1,
89-115, 2012.
33. Zhu, J., Hong, J., Hughes, J. G. Using Markov Chain s for Link Prediction in Adaptive Web Sites.
Springer-Verlag Berlin Heidelberg, 60-73, 2002.

Similar Posts