Faculty of Mathematics and Computer Sciences Advanced Techniques for Information Processing Master Thesis Scientific leader : Graduate : Conf. Univ…. [617750]

PITE STI
2017 Universit y of Pites ti
Faculty of Mathematics and Computer Sciences

Advanced Techniques for Information Processing

Master Thesis

Scientific leader : Graduate :
Conf. Univ. Dr. Doru Anastasiu POPESCU Andrei ILIE

2 Universit y of Pites ti
Faculty of Mathematics and Computer Sciences

Advanced Techniques for Information Processing

Website indexing with NoSQL storage
using .NET technologies

Scientific leader : Graduate :
Conf. Univ. Dr. Doru Anastasiu POPESCU Andrei ILIE

3 Content

Content ………………………….. ………………………….. ………………………….. ………………………….. ………… 3
INTRODUCTION ………………………….. ………………………….. ………………………….. ………………………….. . 5
1 THEORETICAL FOUNDATIO N ………………………….. ………………………….. ………………………….. …… 6
1.1 Elements of non relational database theory ………………………….. ………………………….. …… 6
1.1.1 Introduction ………………………….. ………………………….. ………………………….. …………… 6
1.1.2 Characteristics of NoSQL ………………………….. ………………………….. …………………….. 7
1.1.3 NoSQL Storage types ………………………….. ………………………….. ………………………….. 9
1.1.4 Advantages of using NoSQL ………………………….. ………………………….. ………………. 14
1.2 MongoDB ………………………….. ………………………….. ………………………….. ………………….. 18
1.3 .NET Framework ………………………….. ………………………….. ………………………….. ………… 21
1.3.1 CLR Features ………………………….. ………………………….. ………………………….. ……….. 21
1.3.2 Class Library Features ………………………….. ………………………….. ………………………… 23
1.3.3 The Common Type System (CTS) ………………………….. ………………………….. ……….. 24
1.3.4 The Common Language Infrastructure (CLI) ………………………….. ……………………… 24
1.3.5 The Common Language Specification (CLS) ………………………….. …………………….. 25
1.3.6 Classes ………………………….. ………………………….. ………………………….. ………………… 25
1.3.7 Namespaces ………………………….. ………………………….. ………………………….. ………….. 25
1.3.8 Assemblies ………………………….. ………………………….. ………………………….. …………… 26
1.3.9 Intermediate language (IL) ………………………….. ………………………….. ………………….. 26
1.3.10 Managed execution ………………………….. ………………………….. ………………………….. .. 26
1.3.11 Manifests, Metadata and Attributes ………………………….. ………………………….. ………. 27
1.3.12 Object Orientation in the .NET Framework ………………………….. ……………………….. 27
1.3.13 Rapid Development and Reuse ………………………….. ………………………….. ……………. 28
1.3.14 Choosing a Language ………………………….. ………………………….. …………………………. 28
1.4 Design Patterns ………………………….. ………………………….. ………………………….. …………… 29
1.4.1 Factory Pattern ………………………….. ………………………….. ………………………….. ……… 29
1.4.2 Singleton Pattern ………………………….. ………………………….. ………………………….. ….. 31

4 1.5 JSON ………………………….. ………………………….. ………………………….. …………………………. 32
1.6 Topshelf ………………………….. ………………………….. ………………………….. …………………….. 34
2 SYSTEM SPECIFICATIONS AND ARCHITECTURE ………………………….. ………………………….. …….. 35
3 DESIGN AP PLICATION ………………………….. ………………………….. ………………………….. ………….. 37
3.1 Windows Service ………………………….. ………………………….. ………………………….. ………… 37
3.1.1 Services ………………………….. ………………………….. ………………………….. ……………….. 37
3.1.2 Services.Utils ………………………….. ………………………….. ………………………….. ……….. 38
3.1.3 Services.Bridge ………………………….. ………………………….. ………………………….. …….. 39
3.2 Website Indexing ………………………….. ………………………….. ………………………….. ………… 39
3.2.1 JSON Configuration File ………………………….. ………………………….. ……………………. 39
3.2.2 Crawling ………………………….. ………………………….. ………………………….. ……………… 42
4 Conclusions ………………………….. ………………………….. ………………………….. ………………………. 45
Bibliography ………………………….. ………………………….. ………………………….. ………………………….. .. 47

5 INTRODUCTION
The application aims to index a website as generic as possible. Within any
website there is a lot of information that can be downloaded automatically , just by
setting up the application. This application will be able to collect data and store
them in the database.
By setting the type of website, and create the mapper (configuration file) for
each website, a lot of information can be stored in no time, with a minimum of
settings, and a task that can run the indexing of selected website (collectiong only
new data).

6 1 THEORETICAL FOUNDATIO N
In this chapter we will review the introductory aspects of the used
components that helped in implementation of the application.
1.1 Elements of non relational database theory
1.1.1 Introduction

A non -relational database is any database that does not follow the relational
model provided by traditional relational database management systems. This
category of databases, also referred to as NoSQL data bases, has seen steady
adoption growth in recent years with the rise of Big Data applications.
Non-relational databases have grown in popularity because they were
designed to overcome the limitations of relational databases in dealing with Big
Data demands. Big Data refers to data that is growing and moving too fast, and is
too diverse in structure for conventional technologies to handle.
While these NoSQL technologies vary greatly, these databases are typically
more scalable and flexible than their relational counterparts. Non -relational
databases have evolved from relational technology in these ways:
 Data models : Unlike relational models which require predefined schema,
NoSQL databases offer flexible schema design that make it much easier
to update the database to handle changing application requirements.

7  Data structure : Non-relational databases are designed to handle
unstructured data that doesn’t fit neatly into rows and columns. This
matters as most of the data generated today is unstructured.
 Scaling: You can scale your system horizontally by taking advantage of
cheap, commodity servers.
 Development model : NoSQL databases are typically open source which
means you don’t have to pay any software licensing fees upfront.
1.1.2 Characteristics of NoSQL

NoSQ L-based solutions provide answers to most of the challenges that we
put up :
 Schema flexibility : Column -oriented databases store data as columns
as opposed to rows in RDBMS. This allows flexibility of adding one or more
columns as required, on the fly. Si milarly, document stores that allow storing
semistructured data are also good options.

 Complex queries : NoSQL databases do not have support for
relationships or foreign keys. There are no complex queries. There are no
JOIN statements. Is that a drawback? How does one query across tables? It is
a functional drawback, definitely. To query across tables, multiple queries
must be executed. Database is a shared resource, used across application
servers and must not be released from use as quickly as possible. T he options
involve combination of simplifying queries to be executed, caching data, and
performing complex operations in application tier. A lot of databases provide
in-built entity -level caching. This means that as and when a record is
accessed, it may be automatically cached transparently by the database. The
cache may be in -memory distributed cache for performance and scale.

8  Data update : Data update and synchronization across physical
instances are difficult engineering problems to solve. Synchronization across
nodes within a datacenter has a different set of requirements as compared to
synchronizing across multiple datacenters. One would want the latency
within a couple of milliseconds or tens of milliseconds at the best. NoSQL
solutions offer great sync hronization options. MongoDB, for example,
allows concurrent updates across nodes, synchronization with conflict
resolution and eventually, consistency across the datacenters within an
acceptable time that would run in few milliseconds. As such, MongoDB ha s
no concept of isolation. Note that now because the complexity of managing
the transaction may be moved out of the database, the application will have
to do some hard work. An example of this is a two -phase commit while
implementing transactions. Do not w orry or get scared. A plethora of
databases offer Multiversion concurrency control (MCC) to achieve
transactional. Surprisingly, eBay does not use transactions at all. Well, as
Dan Pritchett, Technical Fellow at eBay puts it, eBay.com does not use
transact ions. Note that PayPal does use transactions.

 Scalability : NoSQL solutions provider greater scalability for obvious
reasons. A lot of complexity that is required for transaction oriented RDBMS
does not exist in ACID non -compliant NoSQL databases. Interesti ngly, since
NoSQL do not provide cross -table references and there are no JOIN queries
possible, and because one cannot write a single query to collate data across
multiple tables, one simple and logical solution is to —at times —duplicate
the data across tab les. In some scenarios, embedding the information within
the primary entity —especially in one -to-one mapping cases —may be a great
idea.

9 1.1.3 NoSQL Storage types

There are various storage types available in which the content can be
modeled for NoSQL databases. In subsequent sections, we will explore the
following storage types:
 Column -oriented
 Document Store
 Key Value Store
 Graph .

Column -oriented databases
The column -oriented databases store data as columns as opposed to rows that
is prominent in RDBMS. A relational database shows the data as two -dimensional
tables comprising of rows and columns but stores, retrieves, and processes it one
row at a time, whereas a column -oriented database stores data as columns.

For example, assume that the following data i s to be stored:

In RDBMS, the data may be serialized and stored internally as follows:
SM1,Anuj,Sharma,45,10000000
MM2,Anand,,34,5000000
T3,Vikas,Gupta,39,7500000
E4,Dinesh,Verma,32,2000000

10
However, in column -oriented databases, the data will be stored internally as
follows:
SM1,MM2,T3,E4
Anuj,Anand,Vikas,Dinesh
Sharma,,Gupta,Verma,
45,34,39,32 10000000,5000000,7500000,2000000
Online transaction processing (OLTP) focused relational databases are row
oriented. Online analytical processing (OLA P) systems that require processing of
data need column -oriented access. Having said that, OLTP operations may also
require column -oriented access when working on a subset of columns and
operating on them. Data access to these databases is typically done by using either
a proprietary protocol in case of commercial solutions or open standard binary (for
example, Remote Method Invocation). The transport protocol is generally binary.
Some of the databases that fall under this category include:
o Oracle RDBMS Col umnar Expression
o Microsoft SQL Server 2012 Enterprise Edition
o Apache Cassandra
o HBase
o Google BigTable (available as part of Google App Engine branded
Datastore)

Most of the solutions, such as Apache Cassandra, HBase, and Google
Datastore, allow adding columns over time without having to worry about filling in
default values for the existing rows for the new columns. This gives flexibility in
model and entity design allowing one to account for new columns in future for
unforeseen scenarios and new requir ements.

11
Document store
Also referred to as document -oriented database, a document store allows the
inserting, retrieving, and manipulating of semi -structured data. Most of the
databases available under this category use XML, JSON, BSON, or YAML, with
data access typically over HTTP protocol using RESTful API or over Apache
Thrift protocol for cross -language interoperability. Compared to RDBMS, the
documents themselves act as records (or rows), however, it is semi -structured as
compared to rigid RDBMS. For example, two records may have completely
different set of fields or columns. The records may or may not adhere to a specific
schema (like the table definitions in RDBMS). For that matter, the database may
not support a schema or validating a document again st the schema at all. Even
though the documents do not follow a strict schema, indexes can be created and
queried.
Document -oriented databases provide this flexibility —dynamic or
changeable schema or even schemaless documents. Because of the limitless
flexibility provided in this model, this is one of the more popular models
implemented and used. Some of popular databases that provide document -oriented
storage include:
• MongoDB
• CouchDB
• Jackrabbit
• Lotus Notes
• Apache Cassandra

12 The most promine nt advantage, as evident in the preceding examples, is that
content is schemaless, or at best loosely defined. This is very useful in web -based
applications where there is a need for storing different types of content that may
evolve over time. For example , for a grocery store, information about the users,
inventory and orders can be stored as simple JSON or XML documents. Note that
"document store" is not the same as "blob store" where the data cannot be indexed.
Based on the implementation, it may or may not be possible to retrieve or update a
record partially. If it is possible to do so, there is a great advantage. Note that stores
based on XML, BSON, JSON, and YAML would typically support this. XML –
based BaseX can be really powerful, while integrating mu ltiple systems working
with XML given that it supports XQuery 3.0 and XSLT 2.0. Searching across
multiple entity types is far more trivial compared to doing so in traditional RDBMS
or even in column -oriented databases. Because, now, there is no concept of
tables —which is essentially nothing more than a schema definition —one can query
across the records, irrespective of the underlying content or schema or in other
words, the query is directly against the entire database. Note that the databases
allow for the creation of indexes (using common parameters or otherwise and
evolve over time).

Key-value store
A Key -value store is very closely related to a document store —it allows the
storage of a value against a key. Similar to a document store, there is no need f or a
schema to be enforced on the value. However, there a are few constraints that are
enforced by a key -value store:
o Unlike a document store that can create a key when a new document is
inserted, a key -value store requires the key to be specified

13 o Unlike a document store where the value can be indexed and queried, for
a key -value store, the value is opaque and as such, the key must be known
to retrieve the value . If you are familiar with the concept of maps or
associative arrays or have worked with hash, then you already have
worked with a in -memory key -value store. The most prominent use of
working with a key -value store is for in -memory distributed or otherwise
cache. However, implementations do exist to provide persistent storage.
A few of the popular k ey value stores are:
o Redis (in -memory, with dump or command -log persistence)
o Memcached (in -memory)
o MemcacheDB (built on Memcached)
o Berkley DB
o Voldemort (open source implementation of Amazon Dynamo)

Key-value stores are optimized for querying against keys. As such, they
serve great in -memory caches. Memcached and Redis support expiry for the
keys—sliding or absolute —after which the entry is evicted from the store. At
times, one can generate the keys smartly —say, bucketed UUID —and can query
against rang es of keys. For example, Redis allows retrieving a list of all the keys
matching a glob -style pattern.
Though the key -value stores cannot query on the values, they can still
understand the type of value. Stores like Redis support different value types —
strings, hashes, lists, sets, and sorted sets. Based on the value types, advanced
functionalities can be provided. Some of them include atomic increment,
setting/updating multiple fields of a hash (equivalent of partially updating the
document), and intersecti on, union, and difference while working with sets.

14 Graph store
Graph databases represent a special category of NoSQL databases where
relationships are represented as graphs. There can be multiple links between two
nodes in a graph —representing the multiple relationships that the two nodes share.
The relationships represented may include social relationships between people,
transport links between places, or network topologies between connected systems.
Graph databases can be considered as special pu rpose NoSQL databases
optimized for relation -heavy data. If there is no relationship among the entities,
there is no usecase for graph databases. The one advantage that graph databases
have is easy representation, retrieval and manipulation of relationship s between the
entities in the system. It is not uncommon to store data in a document store and
relationships in a graph database.

1.1.4 Advantages of using NoSQL

For a quarter of a century, the relational database (RDBMS) has been the
dominant model for databa se management. But, today, non -relational, "cloud," or
"NoSQL" databases are gaining mindshare as an alternative model for database
management.

1. Elastic scaling
For years, database administrators have relied on scale up — buying bigger
servers as database load increases — rather than scale out — distributing the
database across multiple hosts as load increases. However, as transaction rates and
availability requirements increase, and as databases move into the cloud or onto
virtualized environments, the ec onomic advantages of scaling out on commodity
hardware become irresistible.

15 RDBMS might not scale out easily on commodity clusters, but the new breed
of NoSQL databases are designed to expand transparently to take advantage of new
nodes, and they're usuall y designed with low -cost commodity hardware in mind.
2. Big data
Just as transaction rates have grown out of recognition over the last decade,
the volumes of data that are being stored also have increased massively. O'Reilly
has cleverly called this the "indu strial revolution of data." RDBMS capacity has
been growing to match these increases, but as with transaction rates, the constraints
of data volumes that can be practically managed by a single RDBMS are becoming
intolerable for some enterprises. Today, the volumes of "big data" that can be
handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the
biggest RDBMS.
3. Goodbye DBAs
Despite the many manageability improvements claimed by RDBMS vendors
over the years, high -end RDBMS systems can be maintained only with the
assistance of expensive, highly trained DBAs. DBAs are intimately involved in the
design, installation, and ongoing tuning of high -end RDBMS systems.
NoSQL databases are generally designed from the ground up to require less
management: automatic repair, data distribution, and simpler data models lead to
lower administration and tuning requirements — in theory. In practice, it's likely
that rumors of the DBA's death have been slightly exaggerated. Someone will
always be accou ntable for the performance and availability of any mission -critical
data store.
4. Economics
NoSQL databases typically use clusters of cheap commodity servers to
manage the exploding data and transaction volumes, while RDBMS tends to rely
on expensive proprie tary servers and storage systems. The result is that the cost per

16 gigabyte or transaction/second for NoSQL can be many times less than the cost for
RDBMS, allowing you to store and process more data at a much lower price point.
5. Flexible data models
Change management is a big headache for large production RDBMS. Even
minor changes to the data model of an RDBMS have to be carefully managed and
may necessitate downtime or reduced service levels.

NoSQL databases have far more relaxed — or even nonexistent — data
model restrictions. NoSQL Key Value stores and document databases allow the
application to store virtually any structure it wants in a data element. Even the more
rigidly defined BigTable -based NoSQL databases (Cassandra, HBase) typically
allow new colu mns to be created without too much fuss.
The result is that application changes and database schema changes do not
have to be managed as one complicated change unit. In theory, this will allow
applications to iterate faster, though,clearly, there can be un desirable side effects if
the application fails to manage data integrity.

The promise of the NoSQL database has generated a lot of enthusiasm, but
there are many obstacles to overcome before they can appeal to mainstream
enterprises. Here are a few of the top challenges.

1. Maturity
RDBMS systems have been around for a long time. NoSQL advocates will
argue that their advancing age is a sign of their obsolescence, but for most CIOs,
the maturity of the RDBMS is reassuring. For the most part, RDBMS systems are
stable and richly functional. In comparison, most NoSQL alternatives are in pre –
production versions with many key features yet to be implemented.

17 Living on the technological leading edge is an exciting prospect for many
developers, but enterprises should a pproach it with extreme caution.
2. Support
Enterprises want the reassurance that if a key system fails, they will be able
to get timely and competent support. All RDBMS vendors go to great lengths to
provide a high level of enterprise support.
In contrast, m ost NoSQL systems are open source projects, and although
there are usually one or more firms offering support for each NoSQL database,
these companies often are small start -ups without the global reach, support
resources, or credibility o f an Oracle, Micro soft, or IBM.
3. Analytics and business intelligence
NoSQL databases have evolved to meet the scaling demands of modern Web
2.0 applications. Consequently, most of their feature set is oriented toward the
demands of these applications. However, data in an app lication has value to the
business that goes beyond the insert -read-update -delete cycle of a typical Web
application. Businesses mine information in corporate databases to improve their
efficiency and competitiveness, and business intelligence (BI) is a ke y IT issue for
all medium to large companies.
NoSQL databases offer few facilities for ad -hoc query and analysis. Even a
simple query requires significant programming expertise, and commonly used BI
tools do not provide connectivity to NoSQL.
Some relief i s provided by the emergence of solutions such as HIVE or PIG,
which can provide easier access to data held in Hadoop clusters and perhaps
eventually, other NoSQL databases. Quest Software has developed a product —
Toad for Cloud Databases — that can provid e ad-hoc query capabilities to a variety
of NoSQL databases.

18 4. Administration
The design goals for NoSQL may be to provide a zero -admin solution, but
the current reality falls well short of that goal. NoSQL today requires a lot of skill
to install and a lot of effort to maintain.
5. Expertise
There are literally millions of developers throughout the world, and in every
business segment, who are familiar with RDBMS concepts and programming. In
contrast, almost every NoSQL developer is in a learning mode. This sit uation will
address naturally over time, but for now, it's far easier to find experienced RDBMS
programmers or administrators than a NoSQL expert.
1.2 MongoDB
MongoDB is a powerful, flexible, and scalable general -purpose database. It
combines the ability to sc ale out with features such as secondary indexes, range
queries, sorting, aggregations, and geospatial indexes.
MongoDB is a document -oriented database, not a relational one. The
primary reason for moving away from the relational model is to make scaling ou t
easier, but there are some other advantages as well. A document -oriented database
replaces the concept of a ―row‖ with a more flexible model, the ―document.‖ By
allowing embedded documents and arrays, the documentoriented approach makes it
possible to re present complex hierarchical relationships with a single record. This
fits naturally into the way developers in modern objectoriented languages think
about their data. There are also no predefined schemas: a document’s keys and
values are not of fixed type s or sizes. Without a fixed schema, adding or removing
fields as needed becomes easier. Generally, this makes development faster as

19 developers can quickly iterate. It is also easier to experiment. Developers can try
dozens of models for the data and then c hoose the best one to pursue.
Data set sizes for applications are growing at an incredible pace. Increases in
available bandwidth and cheap storage have created an environment where even
small -scale ap‐ plications need to store more data than many database s were meant
to handle. A terabyte of data, once an unheard -of amount of information, is now
commonplace.
As the amount of data that developers need to store grows, developers face a
difficult decision: how should they scale their databases? Scaling a database comes
down to the choice between scaling up (getting a bigger machine) or scaling out
(partitioning data across more machines). Scaling up is often the path of least
resistance, but it has draw‐ backs: large machines are often very expensive, and
eventually a physical limit is reached where a more powerful machine cannot be
purchased at any cost. The alternative is to scale out: to add storage space or
increase performance, buy another commodity server and add it to your cluster.
This is both cheap er and more scalable; however, it is more difficult to administer a
thousand machines than it is to care for one. MongoDB was designed to scale out.
Its document -oriented data model makes it easier for it to split up data across
multiple servers. MongoDB a utomatically takes care of balancing data and load
across a cluster, redistributing documents automatically and routing user requests to
the correct machines. This allows developers to focus on pro‐ gramming the
application, not scaling it. When a cluster need more capacity, new ma‐ chines can
be added and MongoDB will figure out how the existing data should be spread to
them.
MongoDB is intended to be a general -purpose database, so aside from
creating, reading, updating, and deleting data, it provides an e ver-growing list of
unique features: Indexing MongoDB supports generic secondary indexes, allowing

20 a variety of fast queries, and provides unique, compound, geospatial, and full -text
indexing capabilities as well. Aggregation MongoDB supports an ―aggregati on
pipeline‖ that allows you to build complex aggregations from simple pieces and
allow the database to optimize it. Special collection types MongoDB supports time –
to-live collections for data that should expire at a certain time, such as sessions. It
also supports fixed -size collections, which are useful for holding recent data, such
as logs. File storage MongoDB supports an easy -to-use protocol for storing large
files and file metadata. Some features common to relational databases are not
present in Mongo DB, notably joins and complex multirow transactions. Omitting
these was an architectural decision to allow for greater scalability, as both of those
features are difficult to provide efficiently in a distributed system.

Incredible performance is a major g oal for MongoDB and has shaped much
of its design. MongoDB adds dynamic padding to documents and preallocates data
files to trade extra space usage for consistent performance. It uses as much of RAM
as it can as its cache and attempts to automatically choo se the correct indexes for
queries. In short, almost every aspect of MongoDB was designed to maintain high
performance. Although MongoDB is powerful and attempts to keep many features
from relational systems, it is not intended to do everything that a rela tional database
does. Whenever possible, the database server offloads processing and logic to the
client side (handled either by the drivers or by a user’s application code).
Maintaining this streamlined design is one of the reasons MongoDB can achieve
such high performance

21 1.3 .NET Framework
.NET framework is a Windows Component that supports the building and
running of windows applications and XML Web services. The purpose of the
component is to provide the user with a consistent object oriented programming
environment whether the code is stored locally or remotely.
It aims to minimize software deployment and versioning conflicts and also
promote safe execution of code including codes executed by trusted third parties. It
is directed towards eliminating perfor mance problems of scripted or interpreted
environments. The effort is to make developer experience consistent across a
variety of applications and platforms and create communication standards that help
.NET framework applications integrate with all other w eb based applications.
The .NET framework has two major components – The Common Runtime
(CLR) and the Class Library
The CLR is the foundation upon which the .NET Framework has been built.
The runtime manages code at execution time and provides all the core services such
as memory management, thread management and remoting. It also enforces strict
type safety and ensures code accuracy in order to provide security and robustness to
the applications. This capability to manage code at runtime is the distinguishi ng
feature of the CLR. All code that is managed by the CLR is known as managed
code while other codes are known as unmanaged code.

1.3.1 CLR Features

1. CLR manages memory, thread execution, code execution, compilation code
safety verification and other system services.

22 2. For security reasons, managed code is assigned varying degrees of trust
based on origin. This prevents or allows the managed component from
performing file access operations, registry access operations or other
sensitive functions even within the same active application.
3. The Runtime enforces code robustness by implementing strict type and code
verification infrastructure called Common type System (CTS). The CTS
ensures that all managed code is self describing and all Microsoft or third
party langu age compiler generated codes conform to CTS. This enables the
managed code to consume other managed types and enforce strict type
fidelity and type safety.
4. CLR eliminates many common software issues like handling of object
layout, references to objects and garbage clearance. This type of memory
management prevents memory leaks and invalid memory references.

5. The CLR also accelerates developer productivity. The programmer is free to
choose the language of the application without worrying about compatibility
and integration issues. He is also enabled to take advantage of the runtime
and the class library of the .NET Framework and also harvest components
from other applications written in different languages by different
developers. This implicitly eases the pr ocess of migration.
6. Though CLR aims to be futuristic software, it lends support to existing
applications. The interoperability between the managed and unmanaged
codes makes this process extremely simple.
The design of the CLR is geared towards enhancing p erformance. The Just –
in-time (JIT) compiling enables managed code to run in the native machine
language of the system executing it. During the process the memory manager

23 removes the possibilities of fragmented memory and increases memory
locality -of-refere nce to enhance performance.
7. Finally, server side applications can host runtime. High performance servers
like Microsoft SQL Server and Internet Information Services can host this
CLR and the infrastructure so provided can be used to write business logic
while enjoying the best benefits of enterprise server support.

The Class Library is an object oriented collection of reusable types. It is
comprehensive and the types can be used to develop command line applications or
GUI applications such as Web forms or XML Web services. Unmanaged
components that load CLR into their processes can be hosted by the .NET
Framework to initiate the execution of managed code. This creates a software
environment that exploits both the managed and unmanaged codes. The.NET
Framewo rk also provides a number of runtime hosts and supports third party
runtime hosts

1.3.2 Class Library Features

1. The class library is a collection of reusable types that integrate with the CLR.
2. It is object oriented and provides types from which user defined type s can
derive functionality. This makes for ease of use and is time saving.
3. Third party components can be integrated seamlessly with classes in the
.NET framework.
4. It enables a range of common programming tasks such as string
management, data collection and file access.

24 5. It supports a variety of specialized development scenarios such as console
application development, Windows GUI applications, ASP.NET
Applications, XML Web services.

1.3.3 The Common Type System (CTS)

A number of types are supported by the CLR and are described by the CTS.
Both value types are supported —primitive data types and reference types. The
primitive data types include Byte, Int16, Double and Boolean while Reference
types include arrays, classes and object and string types. Reference typ es are types
that store a reference to the location of their values. The value is stored as part of a
defined class and is referenced through a class member on the instance of a class.
User defined value types and enumerations are derived from the value ty pes
mentioned above.
Language compilers implement types using their own terminology.
The process of converting a value type to a reference type and vice versa is
called boxing and unboxing. The implicit conversion of a value type to a reference
type is ref erred to as boxing. The explicit conversion of an object type into a
specific value type is referred to as unboxing.

1.3.4 The Common Language Infrastructure (CLI)

A subset of the .NET framework is the CLI. The CLI includes the
functionality of the Common Lang uage Runtime and specifications for the
Common Type System, metadata and Intermediate language. A subset of the
Framework Class Library incorporates the base class library, a Network library, a
Reflection library, an XML library and Floating point and Exte nded Array Library.

25 The shared source implementation of the CLI is available for both the FreeBSD and
Windows operating Systems.

1.3.5 The Common Language Specification (CLS)

The CLR supports the CLS which is a subset of it. Additionally the CLR
supports a set of rules that language and compiler designers follow. It provides
robust interoperability between the .NET languages and the ability to inherit classes
written in one language in any other .NET language. Cross language debugging
also becomes a possibility in this scenario. It must be noted that the CLS rules
apply only to publicly exposed features of a class.

1.3.6 Classes

A blueprint of an object is called a class. All definitions of haw a particular
object will be instantiated at runtime, its properties and methods and storage
structures are defined in the class. Classes are used by developers for creating
instances of the class at runtime using the keyword ―New‖.

1.3.7 Namespaces

This is the key part of the .NET Framework. It provides scope for both
preinstalled framework classes and custom developed classes. Vb.NET uses the
―Imports‖ keyword to enable the use of member names from the namespace
declared. C# uses the ―using‖ keyword. In both cases the System Namespace is also
imported so that the Console window ca n be written without explicitly referring to
the System.Console.

26
1.3.8 Assemblies

Assemblies are also known as managed DLLs. They are the fundamental unit
of deployment for the .NET platform. The .NET framework itself is made of a
number of assemblies. An assembly contains the Intermediate language generated
by the language compiler, an assembly manifest, type metadata and resources.
They can be private or public. They are self describing and hence different versions
of the same assembly can be run simultan eously.

1.3.9 Intermediate language (IL)

This is a processor independent representation of executable code. It is
similar to assembly code and specific to the CLR. It is generated by the language
compilers that target the CLR. At runtime, the CLR just -in-time compiles the IL to
native code for execution. The tool ngen.exe which is part of the .NET framework
pre-compiles assemblies to native code at install time and caches the precompiled
code to the disk.

1.3.10 Managed execution

This refers to code whose execution is managed by the CLR. It includes
memory management, access security, cross -language integration for debugging
and exception handling etc. These assemblies are required for the creation of
metadata on the code and the assemblies so that the CLR can manag e the execution
of the code.

27 1.3.11 Manifests, Metadata and Attributes

Metadata and manifests are key aspects of managed code execution. The
portions of an assembly that contains descriptive information about the types
contained in the assembly, the members ex posed by the assembly and the resources
required by the assembly are called manifests. Metadata is contained within the
manifest. This metadata describes the assembly and some of it is generated by the
language compiler at compile time. Other metadata may be added by the developer
at design time. Declarations added to the code to describe or modify some aspect of
the code’s behavior at runtime are known as Attributes. These are stored with an
assembly as metadata. They serve many useful purposes in the .NET Framework

1.3.12 Object Orientation in the .NET Framework

Objects are the core of Object oriented programming. Classes are blueprints
of objects and contain all the methods and properties of the object. Encapsulation,
inheritance and polymorphism are attribut es of an object. Encapsulation means the
ability of an object to hide its internal data from outside view and allow access to
only that data that is publicly available. Inheritance is the ability to derive one class
from another.
New classes can be created from existing classes and the new class inherits
all the properties and methods of the old class and new methods and events can be
added to the new class. This is useful when users want to create specialized classes.
Polymorphism is the ability of multipl e classes derived from the same base class to
expose methods in the same name, regardless of the underlying process of
implementation.

28 1.3.13 Rapid Development and Reuse

The object orientation of the .NET Framework provides for faster
development and deployment of applications. The use of classes, derived classes to
provide common functionality has gone a long way in reducing development time.
Object orientation is also the crucial element in the development of the code -behind
concept and the latest code beside concept. Code behind allows developers to
separate executable code form the HTML markup of the user interface. The
executable code is placed in a module called code behind file. This file contains a
class that inherits from the Page class. The ASP.NET page inherits from code –
behind class and the two are compiled at runtime into a single executable assembly.
The BETA 2.0 has added a number of functionalities to aid in rapid
development. We will be looking at these changes in the next unit ―What’s new in
BETA 2.0‖

1.3.14 Choosing a Language

An important aspect of the .NET framework is that developers can continue
to use the language of their choice in application development. The cross language
interoperability in .NET makes it possible to create an application in any .NET
supported language as all languages will work together smoothly using the CLR
which translates all languages into Intermediary language.

29 1.4 Design Patterns
In software engineering, a design pattern is a general repeatable solution to a
commonly occu rring problem in software design. A design pattern isn't a finished
design that can be transformed directly into code. It is a description or template for
how to solve a problem that can be used in many different situations.
Design patterns can speed up th e development process by providing tested,
proven development paradigms. Effective software design requires considering
issues that may not become visible until later in the implementation. Reusing
design patterns helps to prevent subtle issues that can ca use major problems and
improves code readability for coders and architects familiar with the patterns.
Often, people only understand how to apply certain software design
techniques to certain problems. These techniques are difficult to apply to a broader
range of problems. Design patterns provide general solutions, documented in a
format that doesn't require specifics tied to a particular problem.
In addition, patterns allow developers to communicate using well -known,
well understood names for software inte ractions. Common design patterns can be
improved over time, making them more robust than ad -hoc designs.
These design patterns are all about class instantiation. This pattern can be
further divided into class -creation patterns and object -creational pattern s. While
class -creation patterns use inheritance effectively in the instantiation process,
object -creation patterns use delegation effectively to get the job done.

1.4.1 Factory Pattern

Factory Method is to creating objects as Template Method is to
implementing an algorithm. A superclass specifies all standard and generic

30 behavior (using pure virtual "placeholders" for creation steps), and then delegates
the creation details to subclasses that are supplied by the client.
Factory Method makes a design mor e customizable and only a little more
complicated. Other design patterns require new classes, whereas Factory Method
only requires a new operation.
People often use Factory Method as the standard way to create objects; but it
isn't necessary if: the class that's instantiated never changes, or instantiation takes
place in an operation that subclasses can easily override (such as an initialization
operation). Factory Method is similar to Abstract Factory but without the emphasis
on families. Factory Methods a re routinely specified by an architectural framework,
and then implemented by the user of the framework.
Rules of thumb
 Abstract Factory classes are often implemented with Factory Methods, but
they can be implemented using Prototype.
 Factory Methods are usually called within Template Methods.
 Factory Method: creation through inheritance. Prototype: creation through
delegation.
 Often, designs start out using Factory Method (less complicated, more
customizable, subclasses proliferate) and evolve toward Abst ract Factory,
Prototype, or Builder (more flexible, more complex) as the designer
discovers where more flexibility is needed.
 Prototype doesn't require subclassing, but it does require an Initialize
operation. Factory Method requires subclassing, but doesn 't require Initialize.
 The advantage of a Factory Method is that it can return the same instance
multiple times, or can return a subclass rather than an object of that exact
type.

31  Some Factory Method advocates recommend that as a matter of language
design (or failing that, as a matter of style) absolutely all constructors should
be private or protected. It's no one else's business whether a class
manufactures a new object or recycles an old one.
 The new operator considered harmful. There is a difference bet ween
requesting an object and creating one. The new operator always creates an
object, and fails to encapsulate object creation. A Factory Method enforces
that encapsulation, and allows an object to be requested without inextricable
coupling to the act of creation.

1.4.2 Singleton Pattern

The singleton pattern is one of the best -known patterns in software
engineering. Essentially, a singleton is a class which only allows a single instance
of itself to be created, and usually gives simple access to that instance . Most
commonly, singletons don't allow any parameters to be specified when creating the
instance – as otherwise a second request for an instance but with a different
parameter could be problematic! (If the same instance should be accessed for all
requests with the same parameter, the factory pattern is more appropriate.) This
article deals only with the situation where no parameters are required. Typically a
requirement of singletons is that they are created lazily – i.e. that the instance isn't
created un til it is first needed.
There are various different ways of implementing the singleton pattern in C#.
I shall present them here in reverse order of elegance, starting with the most
commonly seen, which is not thread -safe, and working up to a fully lazily -loaded,
thread -safe, simple and highly performant version.
All these implementations share four common characteristics, however:

32  A single constructor, which is private and parameterless. This prevents other
classes from instantiating it (which would be a vi olation of the pattern). Note
that it also prevents subclassing – if a singleton can be subclassed once, it can
be subclassed twice, and if each of those subclasses can create an instance,
the pattern is violated. The factory pattern can be used if you nee d a single
instance of a base type, but the exact type isn't known until runtime.
 The class is sealed. This is unnecessary, strictly speaking, due to the above
point, but may help the JIT to optimise things more.
 A static variable which holds a reference t o the single created instance, if
any.
 A public static means of getting the reference to the single created instance,
creating one if necessary.
Note that all of these implementations also use a public static property Instance as
the means of accessing the instance. In all cases, the property could easily be
converted to a method, with no impact on thread -safety or performance.
1.5 JSON
JSON (JavaScript Object Notation) is a format for representation and
interchange of data between computer applications. It is a human -readable text
format used to represent objects and other data structures and is especially used to
transmit structured data over the network, the process being serialized. JSON is
simpler, easier than XML. The elegance of the JSON format comes fro m the fact
that it is a subset of the JavaScript language used in conjunction with this language.
JSON is built on two structures:
 collection of name / value pairs. In different languages, this is done as an
object, record, structure, dictionary, hash tabl e, key list, or associative array;
 the ordered list of values. In most languages, this is done as a matrix, vector,
list, or sequence.

33 An object is an unmatched set of name / value pairs. An object starts with
{(left -handed) and ends with} (right -handed). Each name is followed by: (two
points) and the name / value pairs are separated by, (semicolon).

{"firstName":"Popescu", "lastName":"Ion"}

A picture is an orderly collection of values. A series begins with [(left
bracket) and ends with] (parent bracket) . Values are separated by, (comma).

"students ":[
{"firstName ":"Popescu ","lastName":" Ion"},
{"firstName":" George ","lastName":" Radulescu ‖}
]
O valoare poate fi un șir în ghilimele, sau un număr, său adevărat sau fals sau
nul, sau un obiect sau un tablou. Aceste structuri pot fi imbricate.
{
"id": 1,
"name": " Adrian ",
"flag": true,
"tags": ["student", "facultate"]}
A string is a sequence of zero or more Unicode characters in the quotes. A
character is represented as one character string.

{
"name": ― Adrian ‖,
"character": "A"
}

34
A number is very similar to C or Java, except that octal and hexadecimal
formats are not used.
{
"id": 8
}

1.6 Topshelf
Topshelf is a framework for hosting services written using the . NET
framework. The creation of services is simplified, allowing developers to create a
simple console application that can be installed as a service using Topshelf. The
reason for this is simple: It is far easier to debug a console application than a
servi ce. And once the application is tested and ready for production, Topshelf
makes it easy to install the application as a service.

35 2 SYSTEM SPECIFICATIONS AND ARCHITECTURE

The project originated 1 -2 years before the Windows 10 operating system
appeared , when the notion of personal assistant does not exist in the Microsoft
Windows ecosystem, only on mobile devices running Windows 8.1, and there were
other implementations on game consoles.
With the desire to ease day -to-day tasks, the project has been thought o f as a
Windows service that can be expanded with various functionalities that are added
through DLLs.
Basic architecture consists of:
 Windows Service :
It was created using the TopShelf library, and allows us to have an
interface in the console, to allow the installation and uninstall of the
application, and also to facilitate the development process. This service has 3
modules:
 Services ( Executable itself )
 Services.Utils ( A collection of Helpers)
 Services.Bridge

36  Quartz Scheduler Server : it handles tasks at certain times specified by the
system
 MongoDB Server : it has the role of storing the records generated by the
application.

In order to create a good example for this application, I've chosen a torrent
website application because it contains a more complex structure and a high
numbe r of information. The information within these types of sites is divided into
categories, but also divided into paging pages.
Other types of sites that contain much information, but also hierarchical
divisions on categories, but also other areas, are films (eg IMDB).

37 3 DESIGN APPLICATION
3.1 Windows Service
Windows Service contains 3 modules:
 Services
 Services.Utils
 Services.Bridge

3.1.1 Services

The Servi ces module provides the functionality of installing the service in
Windows, but also uninstalling it. For it to run, it needs a database server:
MongoDB, and a server that allows you to schedule certain tasks like Task
Scheduler. The TopShelf Library allow s us to install / uninstall the application in a
simpler and faster way, allowing us to specify the set of instructions that are
executed at various times of the installation / uninstall process, startup and
shutdown of the service.
Before installing as a s ervice, the application checks the running system if it
meets the required requirements: Quartz Server and MongoDB server. In the case
of inc, the test for dependency verification does not pass validation, the application
will continue by downloading and i nstalling the necessary related services. Before
the startup process starts, look for the DLLs in the "Modules" directory and check
if they are recorded as extensions of the application. The "Modules" directory for
DLL files is scanned, and if the module is registered as an extension of the service,

38 it is loaded automatically by the application. It also opens a connection to the
Quartz Scheduler server.
Before the startup process starts, the module looks for the DLLs in the
"Modules" directory and check if they are recor ded as extensions of the application.
The "Modules" directory for DLL files is scanned, and if the module is registered
as an extension of the service, it is loaded automatically by the application. It also
opens a connection to the Quartz Scheduler server .
The test by which a DLL module is validated if it expands the functionality
of the service is performed using the .NET Reflection methods. If the DLL library
implements the "IModule" interface, then the test is passed successfully, or the
module is valida ted, and the service incorporates its functionality. After the module
has been loaded, it also checks through Reflection if the module requires a quartz
job to be recorded.
When the Windows service was started, a semaphore is instantiated on the
―Modules ‖ folder, which tracks the activity of adding, deleting, or modifying
DLLs. If the traffic light detects activity at any of these events, the service will shut
down and then restart to use the new features.
When the service was stopped by the user, the traffic l ights no longer hear
the changes that appear in the "Modules" folder.

3.1.2 Services.Utils

Services.Utils is the library that provides us with methods to help verify the
system if it meets the necessary requirements, provides dependencies for installing
dependencies if the syst em requirements are not met. It also expands the
functionality through which the service will communicate with the Quartz
Scheduler server for recurring jobs.

39
3.1.3 Services.Bridge

Services. Bridge is an interface that any module created to extend service
functionality needs to be deployed. The interface exposes the mode recording
features, which may include initiating a connection to a third party service,
uploading a configuration file, and displaying a general logon property.
3.2 Website Indexing
The main theme of the project is Website Indexing.
When the Windows Serving is turned on and has the Website Indexing
functionality enabled, it will allow us to update the news within the database to a
specific site with in a predefined and configurable timeframe.
The primary requirement for this module to work is a JSON file that contains
a mapping of the fields to analyze within the site, and how to interpret them.

3.2.1 JSON Configuration File

The main example from which the module was started is a site that allows us
to download and view movies. One of the most important configuration sections is
the URL structure. This allows us to specify the basic URL of the site, and then
create patterns for related pages with details.

40 {
"base": "http://baseURL.ro ",
"browse ": "$ref[base]/browse.php ",
"category ": "$ref[browse]?cat={0}&includedead=0 ",
"details ": "$ref[base]/details.php?id={0} ",
"download ": "$ref[base]/download.php?id={0}&file={1} ",
"pagination ": "$ref[browse]?cat={0}&page={1}&includedead=0 ",
"loginUrl ": "$ref[base]/login.php ",
"loginPostData ": "$ref[base]/takelogin.php "
}

As can be seen in this examp le, the structure is very simple and easy to
understand. After that, the base address of the site was specified, the rest of the
pages can be built using the $ ref [NAME] key, where NAME can represent any of
the previously specified keys as a template.

Some sites require authentication to access the content. Therefore, the
configuration file also shows the AUTH section, which requires us to enter
encrypted authentication credentials and field names.
The base site from which the moduls idea started has the con tent divided by
categories. In order to keep the existing categories as easy as possible, it is
necessary to configure the "Categories" section. This requires, through the property
"Path," the value of XPath, which represents where exactly the DOM dataset will
find in the DOM structure. Because not everyone is interested in the same
categories to track them, we have the "Exclusions" property that is a set of names
and Id structures that clearly specify the indexing algorithm as what they find with
those val ues should not be saved.

41 Sites that have large content and can not be displayed on one page are used
for paging: Pagination Section Configuration. Requires entering the "Path" property
with the XPath address, which specifies where the validator is located o n the East
page.
The "Record" section of JSON contains the records on the page, and can also
show a multitude of fields to provide the information needed for better cataloging.
Each cell in the "Record" contains the "Use" value that will tell the parser whether
or not to store the current information. There are also the following cells
(properties):
 Path
 Image
 Title (which features 3 more properties besides "Use": url, text (movie title),
and genre (movie type).
 Download Link (this section will give us the download address of the
movie. )
 Files (gives us information about the number of files present in the package
when download ing)
 Comments
 Date_Added (mandatory field to be set, because when running the parser, it
will make a check if the present date has been registered or not in the
database, hence indexing continues or stops; To the properties listed to this
point, this section offers an extra property called "Format," which allows us
to specify how within the site the DateTime structure is displayed ).
 Size (allows us to analyze and normalize and memorize the packag e size that
could be downloaded )
 Downloaded (indicates how many times the downloaded package was )

42  Seeds
 Leachers
 Upload er (is the name of the one who created the package )

3.2.2 Crawling

The Windows service has been started and the Indexing module has been
registered, when it searches in its own directory called Config all JSON files with
the previously mentioned structure. For each detected file, an Indexing Engine is
instantiated in a dictionary that has the name of the file as the key. For each
detected Engine, it signals to the Windows service that it is necessary to reg ister 2
new jobs in the Quartz service, namely: indexing categories and indexing movies
(or packets).
When the Quart Scheduler service in the internal monitoring mechanism
detects that one of the two jobs mentioned above has to be run, the required crawler
is initialized. So, if the category index crawler is to run, the Quartz service calls the
EXECUTE method exposed by the IJob interface implemented by the "Category
Scrapper" class. Each configuration file found in the directory is generated, an
instant of the indexing engine is generated, and if the site requires authentication,
the authentication instructions are executed using the credentials in the module
configuration file. After authentication has been successfully completed, continue
with instancing t he category parser and getting results (Engine.GetCategories). The
method of getting the categories downloads the HTML source of the page, and
looks for it the container that was specified in the JSON file at the Categories
section. For each child found at the XPath address provided by the configuration

43 file, the values (id and name) are taken, and before saving to memory, checking
whether they are in the Exclusions list.
In the case of the other quartz class, in the case of the other job, when the
EXECUTE method is called, the engine name to be instantiated is passed as a
parameter. After initializing the Engine, check again if authentication for accessing
the content is required, and take each category that the site presents. In the moment
it is iterated through the category collection, the current category is transmitted as a
parameter to the PROCEED method displayed by the IEngine interface. To avoid
overloading the multiple -request server, it has been chosen to implement a 1 -hour
delay.
The PROCEED metho d in the IEngine interface accesses the site page using
the declared pattern in the configuration file at the URL section, and loads the
HTML source into memory. It detects the number of pages available in the
category by calling the GetTotalPages method w ithin the PaginationParser parser
(analyzing each link present in the page, extracting the ID attribute from each
record, and then deleting it by checking if it is already present within a Internal
lists, if it does not add them, and finally returns the ma ximum in that list, which
results in the maximum number of pages per category). Having the total number of
pages per category, we can continue by accessing the URLs of the paginated
categories as they were presented in the URL section of the configuration file. The
process is as follows:
o Create an instance of the TorrentRecord parser, and call the GetRecords
method. This method will search on the already existing memory page for all
records that meet the defined configurator criteria. For each child found in
the record container, a Torrent object is initialized and assigned a unique
identifier (Guid). Then using the "Use" flag value in each property will

44 determine the set of information it will store in the database (in the generated
object).
o After all the properties in the configurator have been parked, the package
details about the packet provider and the category they belong to are
provided to the package, but make sure that they are saved, check that there
is no registration in the indexing history Of t hat category containing the same
value for the Date_Added property. If this test (criterion) is not met, continue
with the collection of records until the condition is satisfied, and the
algorithm will stop and continue with the next category.
o In the final, all data is stored into MongoDB server.

45 4 Conclusions

The application presented has proposed collecting data as easily,
automatically as possible. With all the technologies used, I've been able to create an
easy-to-use app for any type of website, and just use the JSON configuration file
for the type of website.
Given that at the moment, sites are a must for everything, and data is
required for any type of activity, a method is needed to store them as easily as
possible.
Windows Service deployment has been chosen because it is now easier for
something to run automatically, and we just give it to the configuration without
having to start the application. The user of the application will be able to use the
application very easily (for existing site types, and for others, they can be
configured) using only the configuration file that it will need to set up.

46

47 Bibliography

[1] Gaurav Vaish – Getting Started with NoSQL, Inc. 2013
[2] Brad Dayley – NoSQL with MongoDB in 24 Hours , Inc. 20 14
[3] Thuan L. Thai – .NET Framework Essentials, Inc. 2001
[4] Joe Duffy – Professional .NET Framework 2.0, Inc. 2006
[5] O’Reilly – Head First Design Patterns , Inc/ 2014
[6] http://topshelf -project.com/
[7] https://docs.mongodb.com/
[8] http://en.wikipedia.org/wiki/JSON
[9] http://json.org/

Similar Posts