FACULTY OF MATHEMATICS AND COMPUTER SCIENCE SPECIALIZATION COMPUTER SCIENCE Diploma thesis SQL/NOSQL DATABASE SYNCHRONIZATION Scientific coordinator… [617609]
„BABEȘ -BOLYAI” UNIVERSITY
FACULTY OF MATHEMATICS AND COMPUTER SCIENCE
SPECIALIZATION COMPUTER SCIENCE
Diploma thesis
SQL/NOSQL DATABASE
SYNCHRONIZATION
Scientific coordinator Author
Dr. POP Emilia ISTRATE Cătălin-Flaviu
Cluj -Napoca
2018
2
Contents
Abstract ………………………….. ………………………….. ………………………….. ………………………….. ………………… 4
Introduction ………………………….. ………………………….. ………………………….. ………………………….. ………………. 5
Background ………………………….. ………………………….. ………………………….. ………………………….. ……………. 5
Purpose ………………………….. ………………………….. ………………………….. ………………………….. …………………. 6
Definition of terms ………………………….. ………………………….. ………………………….. ………………………….. …. 7
Problem description and analysis ………………………….. ………………………….. ………………………….. ……………… 8
Problem statement ………………………….. ………………………….. ………………………….. ………………………….. … 8
SQL database characteristics ………………………….. ………………………….. ………………………….. ………………… 9
Relational model ………………………….. ………………………….. ………………………….. ………………………….. … 9
Keys ………………………….. ………………………….. ………………………….. ………………………….. ………………… 10
Relationships ………………………….. ………………………….. ………………………….. ………………………….. ……. 10
Index ………………………….. ………………………….. ………………………….. ………………………….. ……………….. 11
Stored procedures ………………………….. ………………………….. ………………………….. …………………………. 12
Document oriented NOSQL database characteristics ………………………….. ………………………….. …………. 12
Entity correspondence between relational/non -relational databases ………………………….. ………………. 14
Related work ………………………….. ………………………….. ………………………….. ………………………….. …………… 15
DBMS history ………………………….. ………………………….. ………………………….. ………………………….. ……….. 16
NOSQL history ………………………….. ………………………….. ………………………….. ………………………….. ……… 16
Notable individuals ………………………….. ………………………….. ………………………….. ………………………….. .. 16
Exiting database synchronization methods ………………………….. ………………………….. ……………………….. 16
Proposed Solution ………………………….. ………………………….. ………………………….. ………………………….. ……. 16
Approach ………………………….. ………………………….. ………………………….. ………………………….. …………….. 16
Solution main flow ………………………….. ………………………….. ………………………….. ………………………… 16
Structure of the input data ………………………….. ………………………….. ………………………….. ……………… 17
Configurations ………………………….. ………………………….. ………………………….. ………………………….. ……… 24
Last run configuration ………………………….. ………………………….. ………………………….. ……………………. 24
Couchbase connection ………………………….. ………………………….. ………………………….. …………………… 24
SQL connection ………………………….. ………………………….. ………………………….. ………………………….. …. 25
Scheduled task ………………………….. ………………………….. ………………………….. ………………………….. ….. 25
Database Opera tions ………………………….. ………………………….. ………………………….. …………………………. 25
SQL operations ………………………….. ………………………….. ………………………….. ………………………….. …. 25
Couchbase operations ………………………….. ………………………….. ………………………….. ……………………. 27
3
Genericity ………………………….. ………………………….. ………………………….. ………………………….. ……………. 29
Data freshness comparison ………………………….. ………………………….. ………………………….. ………………… 29
Task scheduling ………………………….. ………………………….. ………………………….. ………………………….. ……. 29
Relevant technologies ………………………….. ………………………….. ………………………….. ……………………….. 29
Core technologies ………………………….. ………………………….. ………………………….. ………………………….. 29
Database technologies ………………………….. ………………………….. ………………………….. …………………… 29
Running options ………………………….. ………………………….. ………………………….. ………………………….. …… 29
Result ………………………….. ………………………….. ………………………….. ………………………….. ……………………… 30
Run examples ………………………….. ………………………….. ………………………….. ………………………….. ………. 30
Performance ………………………….. ………………………….. ………………………….. ………………………….. ………… 30
Limitations ………………………….. ………………………….. ………………………….. ………………………….. …………… 30
Future development ………………………….. ………………………….. ………………………….. ………………………….. ….. 31
Code impro vements ………………………….. ………………………….. ………………………….. ………………………….. 31
Functionality improvements ………………………….. ………………………….. ………………………….. ………………. 31
New functionalities ………………………….. ………………………….. ………………………….. ………………………….. . 31
Conclusions ………………………….. ………………………….. ………………………….. ………………………….. …………….. 32
Bibliography ………………………….. ………………………….. ………………………….. ………………………….. ……………. 33
4
Abstract
SQL/NO -SQL DATABASE SYNCHRONIZATION
by Catalin F. Istrate
Submitted to the Department of Electrical Engineering and Computer Science
July, 2018
In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical
Engineering and Computer Science
The stratigraphic model editor provides a convenient and efficient tool for geophysicists
manipulating the complex models used in ray tracing and other forward modeling techniques.
The editor allows the user to build a model of the earth on a Symbolics Lisp Machine screen
using the mouse. The earth models manipulated by the program represent a two -dimensional
slice of the earth, typically extending three or four miles down and having 10 to 50 layers. The
layers need not be flat but can contain features such as reefs and salt domes. These models are
important in exploration geophysics. The editor generates data files rep resenting this model that
can be used as input to a ray -tracing program. The description of the file format is written in a
high-level declarative language, so that the editor can be easily modified to support other
formats.
This work is the result of my o wn activity. I have neither given nor received unauthorized
assistance on this work.
5
Introduction
Background
In a world of technology advancing in an exponential way and a century characterized by
lack of time and desire of speed for individuals, computer software must adapt in order to
completely satisfy the user’s need. With the increase in accessibility of Internet and the avail -ability
of cheap storage, huge amounts of structured, semi -structured, and unstructured data are captured
and stored for a variety of applications. Such data is commonly referred to as Big Data [1].
Considering the fact that “The data volumes are exploding; more data has been created in the past
two years than in the entire previous history of the human race .” [2], software developers are
looking for a more dynamic, flexible and easy to scale dat abase solutions and the current trend is
to opt for NOSQL technologies which are better than SQL technologies in retrieving and generally
working with data, especially with static data . Processing this vast amount of information requires
speed, flexible sch emas, and distributed architectures. NoSQL databases became the preferred
solution for operating Big Data being the closest to satisfy these requirements.
One of the NOSQL te chnologies that became popular is Couchbase server. Couchbase
Server, originally known as Membase, is an open -source , distributed ( shared -nothing
architecture ) multi -model NoSQL document -oriented database software package that is optimized
for interactive applications. [4] Its performs better than SQL technologies and some other NOSQL
technologies for most of the operations. ( Figure )
Couchbase is not only one of the fastest non -relational database systems for document
retrieval, but it also performs very well for data writing when compared to other popular systems.
(Figure 2 )
Figure 1 – Time for reading (ms) [5]
Figure 2 – Time for writing (ms) [5]
6
Purpose
Some applications are already implemented using SQL technologies and other technologies
that might be obsolete for the today’s requirements and their ow ners choose to rewrite them using
latest technologies including maybe a NOSQL database. Considering the fact that most of the IT
development industry bases on Agile methodologies (~71% by 2018) [ 3] which include continuous
integration, at some point, the old application and an initial version of the new one could be running
simultaneously, offering the users the option to switch whenever they want between them. In this
case, the old database and the new one must be kept synchronized, containing the same data, even
though operations are performed in both of them and the structure of the data is not similar (It
cannot be considering the fact that entities specific for SQL and NOSQL have no exact
corres pondence). This can be done using connections from both apps to both databases but this is
not desired because it requires implementation for the old app to work also with NOSQL and for
the new app to work with SQL. This solution is not really good also be cause it implies to double
the time needed for all database operations. A better solution would be to have a single API that
has connection to both of the databases, checks for changed data and merges the changes
periodically. The only downside of this sol ution is the time between synchronizations when
changes from an app do not reflect in the other one, but this is not a problem if the period is short
enough like a matter of minutes.
The purpose of my work is to design and implement such a way of synchroni zation between
one of the most used SQL technolog ies, respectively Microsoft SQL Server and one of the most
powerful NOSQL system as described previously, Couchbase Server. In order to be truly useful,
the solution must be created as a completely stand -alone API, with generic work flows and an easy
way of configuration for the correspondence between data entities in both databases. By achieving
this, there will be no time needed to be spent on implementing methods to use both databases in
the case of rewrit ing an application, no time spent on designing the proper correspondence between
entities and only a little time for configurations needed to insert input data in the API that will map
the old database to the new database. Even tough the migration will be a step by step process, the
API must handle this and again, only input mappings has to be added which each step further of
migration
7
Definition of terms
8
Problem description and analysis
Problem statement
There are given two databases with different internal structures and they must keep
exactly the same data but using their own structure and also synchronize when changes are
performed in one of them. The first database (source) is a relational Microsoft SQ L Server 2008.
The second database (destination) is a non -relational Couchbase Server 4.x.
The initial state of the databases is the following: source is heavily populated, meanwhile
destination. It is desired that initially an import of selected data from source to be performed in
destination in order to have a “moment 0” in the synchronization timeline. Considering the fact
that destination does not support or have any correspondence of programmability features (stored
procedures, functions, triggers, etc.) there is no way for replicating them hence this will be not
needed in the initial import or in any other further synchronization . The main focus of the
synchronization will be entities.
After the initial import of data, a large number of alterations are performed independently
in both source and destination. We must exchange data between dat abases in such a way that only
the freshest is kept. Breaking down the problem in smaller issues we get the following to be
solved:
a) Match existing entities in source to new ones in destination as much as possible in order
not to change the logic behind an d purpose behind them
b) Bringing data from source to destination when a certain entry was modified only at source
c) Bringing data from destination to source when a certain entry was modified only at source
d) Merging data using last chronological modified criter ia and saving it to the least fresh side
when a certain entry was modified at both sides
The merging process and any kind of chronologic comparison between pieces of data will
be done using the update timestamp and insert timestamp which is availabl e on every set of
information. In the source database they are updated via triggers and in the destination database
9
they are update via operations performed within a Java application. Both timestamps must always
be available and update timestamp is populat ed and equal with insert timestamp at the creation of
an instance of an entity. The timestamps are independent from time zone, being declared as a
long integer which represents the number of millisecond since epoch (1st January 1970).
The synchronization ’s process comple xity must be small enough that it can support about
50 000 modifications per hour . The synchronization must not be instant but after a modification
is performed, there should be no more than 10 minutes before it reflects in the other database.
SQL database characteristics
Relational model
Use a simple data structure: The Table
• simple to understand
• useful data structure (capture many situations)
• leads to useful not too complex query lang.
Use mathematics to describe and represent records and collections of records: The Relation
• can be understood formally
• leads to formal query languages
• properties can be explained and proven [6]
The relational model (RM) for database management is an approach to managing data using
a structure and language consistent with first -order predicate logic where all data is represented in
terms of tuples, grouped into relations. A database organized in terms of the relational model is a
relational database. The purpose of the relational model is to provide a declarative method for
specifying data and queries: users directly state what information the database conta ins and what
information they want from it and let the database management system software take care of
describing data structures for storing the data and retrieval procedures for answering queries.
Most relational databases use the SQL data definition and query language; these systems
implement what can be regarded as an engineering approximation to the relational model. A table
in an SQL database schema corresponds to a predicate variable; the conte nts of a table to a relation;
key constraints, other constraints, and SQL queries correspond to predicates. However, SQL
10
databases deviate from the relational model in many details . In relational data model the data is
organized into tables. These tables a re called relations. [7]
Keys
Each row in a table has its own unique key. Rows in a table can be linked to rows in other
tables by adding a column for the unique key of the linked row (such columns are known as foreign
keys). Data relationships of arbitrary complexity can be represented by a simple set of concepts.
Part of this processing involves consistently being able to select or modify one and only
one row in a table. Therefore, most physical implementations have a unique prim ary key (PK) for
each table. When a new row is written to the table, a new unique value for the primary key is
generated; this is the key that the system uses primarily for accessing the table. System performance
is optimized for PKs. Other, more natural k eys may also be identified and defined as alternate keys
(AK). Often several columns are needed to form an AK (this is one reason why a single integer
column is usually made the PK). Both PKs and AKs have the ability to uniquely identify a row
within a tab le. Additional technology may be applied to ensure a unique ID across the world, a
globally unique identifier, when there are broader system requirements.
The primary keys within a database are used to define the relationships among the tables.
When a PK m igrates to another table, it becomes a foreign key in the other table. When each cell
can contain only one value and the PK migrates into a regular entity table, this design pattern can
represent either a one -to-one or one -to-many relationship. Most relati onal database designs resolve
many -to-many relationships by creating an additional table that contains the PKs from both of the
other entity tables —the relationship becomes an entity; the resolution table is then named
appropriately and the two FKs are com bined to form a PK. The migration of PKs to other tables is
the second major reason why system -assigned integers are used normally as PKs; there's seldom
efficiency nor clarity in migrating a bunch of other types of columns.
Relationships
Relationships are a logical connection between different tables, established on the basis of
interaction among these tables.
One-to-one
In a relational database, a one -to-one relationship exists when one row in a table may be
linked with only one row in another table and vice versa. It is important to note that a one -to-one
11
relationship is not a property of the data, but rather of the relationship itself. A list of mothers and
their children may happen to describe mothers with only one child, in which case one row of the
mothers table will refer to only one row of the children table and vice versa, but the relationship
itself is not one -to-one, because mothers may have more than one child, thus forming a one -to-
many relationship. [8]
One-to-many
In a relational database, a one-to-many relationship exists when one row in table A may be
linked with many rows in table B, but one row in table B is linked to only one row in table A. It is
important to note that a one -to-many relationship is not a property of the data, but rather of the
relationship itself. A list of authors and their books may happen to describe books with only one
author, in which case one row of the books table will refer to only one row of the authors table, but
the relationship itself is not one -to-many, becau se books may have more than one author, forming
a many -to-many relationship. The opposite of one -to-many is many -to-one. [8]
Many -to-many
In a relational database management system, such relationships are usually implemented
by means of an associative tabl e (also known as junction table or cross -reference table), say, AB
with two one -to-many relationships A -> AB and B -> AB. In this case the logical primary key for
AB is formed from the two foreign keys (i.e. copies of the primary keys of A and B). [8]
Index
An index is one way of providing quicker access to data. Indexes can be created on any
combination of attributes on a relation. Queries that filter using those attributes can find matching
tuples randomly using the index, without having to check each tu ple in turn. This is analogous to
using the index of a book to go directly to the page on which the information you are looking for
is found, so that you do not have to read the entire book to find what you are looking for. Relational
databases typically s upply multiple indexing techniques, each of which is optimal for some
combination of data distribution, relation size, and typical access pattern. Indices are usually
implemented via B+ trees, R -trees, and bitmaps. Indices are usually not considered part o f the
database, as they are considered an implementation detail, though indices are usually maintained
by the same group that maintains the other parts of the database. The use of efficient indexes on
both primary and foreign keys can dramatically improve query performance. This is because B –
12
tree indexes result in query times proportional to log(n) where n is the number of rows in a table
and hash indexes result in constant time queries (no size dependency as long as the relevant part of
the index fits into memory). [9]
Stored procedures
A stored procedure (also termed proc, storp, sproc, StoPro, StoredProc, StoreProc, sp, or
SP) is a subroutine available to applications that access a relational database management system
(RDBMS). Such procedures are stored in the database data dictionary.
Uses for stored procedures include data -validation (integrated into the database) or access –
control mechanisms. Furthermore, stored procedures can consolidate and centralize logic that was
originally implemented in applicat ions. To save time and memory, extensive or complex
processing that requires execution of several SQL statements can be saved into stored procedures,
and all applications call the procedures. One can use nested stored procedures by executing one
stored pro cedure from within another.
Stored procedures may return result sets, i.e., the results of a SELECT statement. Such
result sets can be processed using cursors, by other stored procedures, by associating a result -set
locator, or by applications. Stored proc edures may also contain declared variables for processing
data and cursors that allow it to loop through multiple rows in a table. Stored -procedure flow –
control statements typically include IF, WHILE, LOOP, REPEAT, and CASE statements, and
more. Stored pro cedures can receive variables, return results or modify variables and return them,
depending on how and where the variable is declared. [9]
Document oriented NOSQL database characteristics
A document -oriented database, or document store, is a computer program designed for
storing, retrieving and managing document -oriented information, also known as semi -structured
data. Document -oriented databases are one of the main categories of NoSQL databases, and the
popularity of the term "document -oriented databa se" has grown [10] with the use of the term
NoSQL itself.
Document -oriented databases are inherently a subclass of the key -value store, another
NoSQL database concept. The difference lies in the way the data is processed; in a key -value store,
the data is considered to be inherently opaque to the database, wherea s a document -oriented system
13
relies on internal structure in the document in order to extract metadata that the database engine
uses for further optimization. Although the difference is often moot due to tools in the systems (to
the point that document -oriented and key -value systems can often be interchanged in operation),
conceptually the document -store is designed to offer a richer experience with modern programming
techniques.
The central concept of a document -oriented database is the notion of a docume nt. While
each document -oriented database implementation differs on the details of this definition, in general,
they all assume documents encapsulate and encode data (or information) in some standard format
or encoding. Encodings in use include XML, YAML, JSON, and BSON, as well as binary forms
like PDF and Microsoft Office documents (MS Word, Excel, and so on).
Documents in a document store are roughly equivalent to the programming concept of an
object. They are not required to adhere to a standard schema, nor will they have all the same
sections, slots, parts, or keys. Generally, programs using objects have many different types of
objects, and those objects often have many optional fields. Every object, even those of the same
class, can look very different . Document stores are similar in that they allow different types of
documents in a single store, allow the fields within them to be optional, and often allow them to be
encoded using different encoding systems.
The structure and text and other data inside the document are usually referred to as the
document's content and may be referenced via retrieval or editing methods, (see below). Unlike a
relational database where every record contains the same fields, leaving unused fields empty; there
are no empty ' fields' in either document (record) in the above example. This approach allows new
information to be added to some records without requiring that every other record in the database
share the same structure.
Document databases typically provide for addition al metadata to be associated with and
stored along with the document content. That metadata may be related to facilities the datastore
provides for organizing documents, providing security, or other implementation specific features.
Documents are addressed in the database via a unique key that represents that document.
This key is a simple identifier (or ID), typically a string, a URI, or a path. The key can be used to
retrieve the document from the database. Typically, the database retains an index on the key to
14
speed up document retrieval, and in some cases the key is required to create or insert the document
into the database.
Entity correspondence between relational/non -relational databases
Document databases contrast strongly with the traditional relati onal database (RDB).
Relational databases generally store data in separate tables that are defined by the programmer, and
a single object may be spread across several tables. Document databases store all information for
a given object in a single instance in the database, and every stored object can be different from
every other. This makes mapping objects into the database a simple task, normally eliminating
anything similar to an object -relational mapping. This makes document stores attractive for
program ming web applications, which are subject to continual change in place, and where speed
of deployment is an important issue.
The first step is to understand how data is modeled in a document database. In a relational
database, data is typically stored flat in a table and it is given structure with primary and foreign
keys. As a simple example, let’s consider a relational database for a web site that has users and
credit cards, that are in a relationship one to many. (Figure 3)
Figure 3 – SQL one to many relationship example
15
In a document database, data is stored as keys and values. A Couchbase bucket contains
documents; each document has a unique key and a JSON value. There are no foreign keys (or,
more accurately, there are no foreign key constraints). Here is a basic correspondence between
SQL/Couchbase features and entities naming. (Figure 4)
SQL Server Couchbase Server
Server Cluster
Database Bucket
Row(s) from table(s) Document
Column JSON key/value
Primary Key Document Key
Figure 4 – raw naming correspondence SQL -COUCHBASE
16
Related work
DBMS history
NOSQL history
Notable individuals
Exiting database synchronization methods
Proposed S olution
Approach
Solution main flow
The input data which is stored in resources/ mappings.json is used in order to map entities
from source and destination. Input data is loaded into memory when the module DBS is started.
The main flow of the application will work based on the following steps:
1. A timestamp is loaded from a filed stored on disk with name scheduleConfigs.json which
represents the last time when a sync was run on a certain entity.
2. A select is performed in source in order to get all the entities that have the update
timestamp greater than last run timestamp.
3. A similar selec t is performed in destination
4. Data retrieved from source is parsed and instances of entities that have insert timestamp
greater than last run are moved to a different list that keeps new elements from source
5. Data retrieved from destination is parsed and instances of entities that have insert
timestamp greater than last run are moved to a different list that keeps new elements from
destination
6. Remaining sets of data are then compared and the instances that are present in both of
them are filtered out and s tored in a separate list (now we have the following lists: new
items at source, new items at destination, items updated only at source, items updated
only at destination and two other lists which contain the items updated after last run
timestamp.
7. The last 2 lists are merged and only the freshest corresponding instances are kept
17
8. The items remaining in source and destination lists are added to items to update in the
other side
9. New items at source are added to destination
10. New items at destination are added to source
11. Updated items at source are updated at destination
12. Updated items at destination are updated at source
13. The file scheduleConfigs.json is updated with last run timestamp = the time when the
process started
Structure of the input data
…………. Based on the f act that there is no perfect correspondence between Couchbase and SQL server
entities, there is now way to match them perfectly, for example a Couchbase bucket is similar to both
a SQL server database and a table. As a solution, SQL server tables will be m atched with documents in
Couchbase (which have JSON format) and they must include a attribute called “type” which will
identify the matching SQL Server table. For each document type we will have a so called “mapping” in
the document “mappings.json”.
Here is an example of one -to-one mapping:
{"mappingName":"Users","mappingType":"one -to-one",
"source": {"server":"MSSQL","database":"#DBName",
"additionalSyncCondition":" and not exists (select 1 from ExternalSystemReference esr (nolock) where esr.ext ernalsystemid=42 and esr.externalsystemreferencetypeid=68 and esr.fkey =
u.userid )"
,"additionalConditionIsUsedInInitialImport":false,
"tables":[ {"table":"Users", "alias":"u","editable":true,"joinType":"inner", "mainTable":true, "insertable":false, "pk":"userId" },
{"table":"LANGUAGE","joinType":"left","alias":"l", "joinColumn":"languageId", "insertable":false},
{"table":"UserConfiguration", "joinType":"left","editable":true, "insertable":true, "alias":"uc","pk":"userId", "safeInsertFi eld":"userId", "joinColumn":"userId","fk":"userId","fkTable":"Users"
,"additionalJoinConditions":"userconfigurationtypeid=4"}
] , "columnForDestinationId": "ngDocId", "timestampColumn":"u.updateTimestamp", "insertTimeColumn": "insertTimeStamp", "pkC olumn":"u.userId"},
"destination": {"server":"CB","database":"users","tables":[{"table":"usr"}], "timestampColumn":"updTime", "insertTimeColumn": "insTime","typeColumnName":"type", "pkColumn":"id"},
"columnMappings": [
{"srcColumn":"userId","destCo lumn":"id" , "srcAlias":"u"},
{"srcColumn":"userId","destColumn":"ccm_user_id" ,"destType":"long", "srcAlias":"u"},
{"srcColumn":"firstName","destColumn":"firstName", "srcAlias":"u"},
{"srcColumn":"lastName","destColumn":"lastName", "srcAlias": "u"},
{"srcColumn":"titleId","destColumn":"titleId","destType":"long", "srcAlias":"u"},
{"srcColumn":"nationalityId","destColumn":"nationalityId","destType":"long", "srcAlias":"u"},
{"srcColumn":"languageId","destColumn":"languageId","valueMapp ings":{"25":"25","31":"31","34":"34","82":"82","90":"90"},"destType":"long", "defaultValue":"25", "insertOnly":true,
"srcAlias":"u"},
{"srcColumn":"countryId","destColumn":"userCountryId","destFormat":"cou_$$$", "srcAlias":"u"},
{"srcColumn":"secon daryEmail","destColumn":"secondaryEmail", "srcAlias":"u"},
{"srcColumn":"address1","destColumn":"address1", "srcAlias":"u"},
{"srcColumn":"address2","destColumn":"address2", "srcAlias":"u"},
18
{"srcColumn":"CompanyId","destColumn":"companyId","de stType":"long", "srcAlias":"u"},
{"srcColumn":"email","destColumn":"primaryEmail", "srcAlias":"u"},
{"srcColumn":"emailconfirmation","destColumn":"emailNotification","destType":"boolean", "srcAlias":"u"},
{"srcColumn":"PHONEMOBILE","destColumn" :"mobileNumber", "srcAlias":"u"},
{"srcColumn":"PHONE","destColumn":"phoneNumber", "srcAlias":"u"},
{"srcColumn":"SMSCONFIRMATION","destColumn":"smsNotifications","destType":"boolean", "srcAlias":"u"},
{"srcColumn":"POSTCODE","destColumn":"zipc ode", "srcAlias":"u"},
{"srcColumn":"LOCATIONSEARCHUNIT","destColumn":"measurementUnit","valueMappings":{"1":"mi","0":"km"}, "defaultValue":"km", "s rcAlias":"u"},
{"srcColumn":"username","destColumn":"userName", "srcAlias":"u"},
{"srcColumn":"i sActive","destColumn":"isActive", "destType":"long","srcType":"boolean", "srcAlias":"u"},
{"srcColumn":"updateTimeStamp","destColumn":"updTime", "srcType":"datetime", "destType":"long", "srcAlias":"u"},
{"srcColumn":"insertTimeStamp","destColumn":" insTime", "srcType":"datetime", "destType":"long", "srcAlias":"u"},
{"srcColumn":"updateBy","destColumn":"modifiedBy", "srcAlias":"u"},
{"srcColumn":"insertBy","destColumn":"createdBy", "srcAlias":"u"},
{"srcColumn":"usergroupid","destColumn":"userGroupId","destType":"long", "srcAlias":"u"},
{"srcColumn":"DefaultTravellerTypeId","destColumn":"defaultTravelerTypeId","destType":"long", "srcAlias":"u"},
{"srcColumn":"HideInfoBox","destColumn":"hi deInfoBox", "srcType":"boolean", "destType":"boolean", "srcAlias":"u"},
{"srcColumn":"EnableMapAutoRefresh","destColumn":"enableMapAutoRefresh", "srcType":"boolean", "destType":"boolean", "srcAlias ":"u"},
{"srcColumn":"NUMBEROFHOTELLISTITEMS","dest Column":"numberOfHotelListItems", "destType":"long", "srcAlias":"u"},
{"srcColumn":"IsEnabled","destColumn":"syncOutlook", "srcType":"boolean", "destType":"boolean", "srcAlias":"uc"},
{"srcColumn":"UserConfigurationTypeId","value":"4", "srcAlias":" uc"},
{"srcColumn":"abbreviation","destColumn":"languageCode","valueMappings":{"en":"en","fi":"fi","fr":"fr","sv":"sv","de":"de","e s":"es","it":"it","nl":"nl","pt":"pt"}, "defaultValue":"en",
"insertOnly":true, "srcAlias":"l"}
]
}
The mapping has th e following attributes:
• mappingName: It is used to identify a mapping
• mappingType: It describes the type of the mapping and it can be one -to-one or one -to-
many
• source: It corresponds to a DataSource object and it contains info about source data
configs
• destination: It corresponds to a DataSource object and it contains info about destination
data configs
• columnMappings: It corresponds to a ColumnMapping object and contains information
about the correspondence between fields and their specific attributes
DataSource attributes:
• server: it’s the name of the database technology and it has no further use, it’s just for a
user’s ease of finding his way around when “source” and “destination” are too ambiguous
names
19
• database: for source it has no use and for de stination it represents the bucket setting that
should be used. Here is an example of yml config data for destination:
couchbase:
host: localhost
buckets:
users :
name: Users
password: somepass
destination :
name: Destination
password: somepass
connection:
timeout: 30
The highlighted words should be used as database
• tables: It corresponds to a list Table objects. For destination they only contain one item
with only one attribute whi ch will represent the type of the document. For source it is
possible to be any number of tables (at least one). The first table in list will be called main
table if no other table has MainTable attribute.
• columnForDestinationId: it is a column name in sou rce in the main table and represents
the id of the corresponding destination entity. It is used only when the ids are different in
source and destination. (For example in sql is auto incremental int and in couchbase it is a
uuid).
• timestampColumn: represen ts the field name where information about entity update time
is stored. For destination is simply the column name, while in source it should have the
form tableAlias.ColumnName
20
• pkColumn: the identifier field name of the whole entity
• typeColumnName: It is u sed only for destination and it represents the name of the field
that holds the type. It’s value is specified in the previously mentioned list of tables.
• insertTimeColumn: represents the field name where information about entity creation
time is stored.
• includeTypeInId: Tells if an id should have as prefix the document type and an
underscore. It is used only for destination and only when ids from destination and source
are the same. If it is missing, it is set implicit as false. (Source id = 123456, Destin ation id
= ucc_123456)
• nestedColumn: Used only for one -to-many mappings and for destination. In case of one –
to-many mappings, the entity in destination is a nested document and this field represents
the name of the highest level field in nesting hierarch y. Here is an example of nested
document (nestedColumn is highlighted):
{
"type": "ucc",
"updTime": null,
"creditCards ": [
{
"owner": "ss",
"zipCode": null,
"address": null,
"insTime": 1439546160657,
"creditCardTypeId": "93",
"stateId": null,
"description": "Coutts Visa Corporate",
"cityId": null,
"uuid": "BA0EC805 -3E9C -4B50 -9DCA -690E5F0D0A53",
"countryId": null,
"createdBy": "usercc",
"updTime": null,
21
"lastDigits": "1234",
"expiration": "122020",
"modifiedBy": "null, null / null_null"
}
]}
• additionalSyncCondition: used only for source and it is a way of customizing the sync by
manually adding an extra where condition to the select that fe tches changed data. (As in step 2
from chapter 1.2)
• additionalConditionIsUsedInInitialImport: is a boolean that represents whether the
additionalSyncCondition should be used in intial import. (As in chapter 1.1, second paragraph)
Table attributes:
• table: it is the name of the table for sql and the type of the document in destination
NOTE: The rest of the attributes of Table are only for source.
• Alias: an alias used for a table in sql query. It should be unique for every table,
• joinType: the typ e of join to be used in sql query. It is set implicit to inner.
• Editable: it is a boolean that tells if updates can be performed in a table. It is set implicit
to false.
• AdditionalJoinConditions: a customizable way to add more conditions to a join.
• MainTa ble: it tells which one is the main table. It should only be present at one table.
That table will be the first one in sql queries and joins will be performed starting at it.
• JoinColumn: it used as the main condition of join between the table that has it and main
table
• insertable: it is a boolean that tells if inserts can be performed in a table. It is set implicit
to false.
• DeleteType: tells what kind of delete is performed for a certain entity. It is set implicit to
hard meaning that the entity is remove d completely, otherwise a boolean field is set to a
value that represents an entity as inactive. If it is set to soft, softDeleteColumn attribute
column must also be specified.
22
• SoftDeleteColumn: it represents the column name that tells if an entity is acti ve or not in
case of soft DeleteType.
• SafeInsertField: it as source column name that is checked to exist before any operation
• fk: when a row in sql has a foreign key constraint present in other Tables it must be
specified with this attribute in order to be checked before performing operations. For
example when inserting a User entity we must insert first a row in Users then after the pk
of Users is generated, insert in UserConfiguration with User’s pk as fk,
• pk: the pk of that table
• fkTable: in case of fk b eing specified, fkTable must also be specified in order to know in
which table is the fk
ColumnMapping attributes:
• srcColumn: the name of the column in source
• destColumn: the name of the field in destination
• destType: data type of destination field (boolea n, long, etc.). If it is string, this should not
be specified.
• SrcType: data type of destination field (boolean, datetime, etc.). If it is string, this should
not be specified.
• SrcAlias: it’s the alias of the table that contains srcColumn. It must be one f rom tables.
• value: used only when a mapping is missing destCoulmn and the srcColumn is a fixed
value. For example: {"srcColumn":"UserConfigurationTypeId","value":"4",
"srcAlias":"uc"}. Here we consider only UserConfiguration with type 4.
• destFormat: he re is specified a custom formatting of destination value. The custom format
is a string that must contain “$$$” which will be replace with the actual value. For
example:
{"srcColumn":"countryId","destColumn":"userCountryId","destFormat":"cou_$$$",
"srcAlia s":"u"}. Here userCountryId in destination will be something like “cou_123”.
23
• Parent: Used only for one -to-many mappings, shows if the mapped destColumn is part of
the lower level of nesting or not. In the following example, description,
loyaltyCardTypeId, uuid and cardNumber have as parent “ loyaltyCards”, but type and
updTime have no parent:
{
"loyaltyCards": [
{
"description": "Accor Hotels Card",
"loyaltyCardTypeId": 15,
"uuid": "941D9CCF -CD62 -4308 -A680 -A4B40FD08A2B",
"cardNumber": "30123456136193736"
}
],
"type": "ulc",
"updTime": 13144234234223
}
• nestedPkColumn: is a boolean that represents the identifier of every nested element
• destColumnAlias: when some columns have the same destCoulmn name, destColumn
name is changed to something unique and destCoulmnAlias will hold the initial value and
will be the value that will be stored in destination.
• NestedTimestamp: each nested element will have a specific update timestamp which is
held by this attribute.
• Neste dInsertTimestamp: each nested element will have a specific insert timestamp which
is held by this attribute.
• SrcIsnullColumnInSelect: this attribute tells what value to take from source when
srcColumn is null (does not have any value)
• valueMappings: value mappings are correspondences between values in source and
destination. For example:
{"srcColumn":"LOCATIONSEARCHUNIT","destColumn":"measurementUnit","valueM
24
appings":{"1":"mi","0":"km"}, "defaultValue":"km", "srcAlias":"u"}. In this case, when 1
is foun d in source, “mi” will appear in destination and the same for the tuple (0,”km”).
• DefaultValue: When the value found in source or destination is not found in any
valueMappings tuple, default value is used. For the example above, when there is -1 in
source, “km” will be placed in destination and when something like “m” is found in
destination, 0 will be placed in source.
• DestExcluded: DEPRECATED , should not be used in new mappings and removed from
old ones. Use insertOnly instead.
• InsertOnly: is a boolean th at tells if a filed should be synchronized only when the item is
considered as being new and an insert operation is performed. For any other operation, it
is excluded and keeps its old value.
Configurations
Last run configuration
The file that contains th e last run timestamp of every mapping is called
scheduleConfig.json.
It is stored on the disk of every machine that runs an instance of DBS. The path of the file
is expected to be available in the Spring Environment using the following property “sync.last –
run-path”.
A template of this document is available in the resources. When the initial import is
performed, the timestamp should be set to null.
Couchbase connection
Couchbase connection is done using Sping and Couchbase SDK. Connection is open and
managed by the Bean CouchbaseConfiguration,java. At the initialization of the bean, connection
to the cluster is done first on the address kept in Spring Environment at the address
"couchbase.host". Every bucket i s then open using lazy initialization and kept open for following
connections, hence a bucket once open ed it remains open while the instance of DBS is running.
The bucket opening and retrieval is done by the method getBucketByName which takes as
argument t he bucket name, opens it if it wasn’t open yet and returns it. There also must be
25
present a property with the path couchbase.connection.timeout that represents the amount of
seconds before connection timeout.
Each bucket is closed when application is bein g shut down.
SQL connection
SQL connection is managed by Spring and the following structure must be present as properties:
spring:
datasource:
driver -class -name: com.microsoft.sqlserver.jdbc.SQLServerDriver
initial -size: 5
max -active: 50
password: sbora123
test-on-borrow: true
test-while -idle: true
url: jdbc:sqlserver:// localhost ;databaseName= Licenta ;autoReconnect=true
username: dbsync
validation -query: SELECT 1
The host and the database are specified within the co nnection string as underlined .
Scheduled task
The scheduled task cron expression is a property found with the path sync.schedule. It
represents the frequency of full sync of all mappings running.
Database Operations
SQL operations
SELECT
Select is performed in order to fetch the entities that suffered operations after the last run
timestamp. It is a join of all tables specified in the mapping starting from the main table. The
26
result is never mapped to objects but stored in java collections like Map and List in order to keep
the implementation generic. Here is an example of select:
SELECT u.userid,
u.userid, ← –– tableAlias.srcColumn names
u.firstname,
u.insertby,
u.usergroupid,
u.de faulttravellertypeid,
u.hideinfobox,
u.enablemapautorefresh,
u.numberofhotellistitems,
uc.isenabled,
l.abbreviation
FROM users AS u WITH (nolock) ← – main table
LEFT JOIN language AS l WITH (nolock)
ON u.languageid = l.languageid ← – other tables
LEFT JOIN userconfiguration AS uc WITH (nolock)
ON u.userid = uc.userid
AND userconfigurationtypeid = 4
WHERE 1 = 1
AND NOT EXISTS (SELECT 1
FROM externalsystemreference esr (nolock) ← additional conditions
WHERE esr.externalsystemid = 42
AND esr.externalsystemreferencetypeid = 68
AND esr.fkey = u.userid)
AND u.updatetimestamp > :from ← – last run timestamp condition
INSERT
Insert is performed only on the tables that have insertable set on true. When
safeInsertFiled is present, a select is performed first and if the entity exists, an update is
27
performed instead of an insert. Insert operation is performed in batches and a query is generated
for every table .
UPDATE
Update is performed similar to insert, but only on ed itable tables. A check is also
performed before each update to be sure that the entity that must be updated already exists. If it
does not exist, it is created using INSERT operation. Update also happens in batches.
DELETE
Delete operation has 2 flows, d epending on the type of the delete, soft or hard. In case of
the hard delete the entities are removed using a delete query with batches. In case of the soft
delete operation, an update is performed on softDelete field and its value is negated. Soft delete is
also performed in batches. Both flows are safe meaning that a “IF EXISTS ” is performed first.
Delete operation is performed only on tables that allow it.
Couchbase operations
SELECT
The logic behind the select is the same as in SQL but instead of tabl es, we fetch based on
document type. Here is an example:
select
meta().id,loyaltyCards.loyaltyCardTypeId,loyaltyCards.cardNumber,loyaltyCards.uuid,t.updTim
e updTimeGlobal,loyaltyCards.description from Users t left unnest t.loyaltyCards where t.type =
'ulc' and t.updTime > $from
In this example, the document is nested so we use unnest to make it flat. “t.type” is the
place where the type of the document is specified. The other condition is made on the timestamp
of the document using $from which is the last run timestamp.
28
INSERT
Insert is performed by generating first a Json and using Couchbase SDK interface for
adding documents. It is done asynchronously and it has a retry system which consist of 3 retries
for every document after it fails to inset. The o peration is actually an upsert in order to be safe in
case the document exists already for some reason.
UPDATE
Update is performed differently based on the mapping type. One -to-one mapping benefits
by an update query which updates the fields present in column mappings. The update is made one
by one.
For one -to-many we perform a fetch before the update and the whole documented is
recreated as a json before upsert is performed on it.
DELETE
Delete is performed based on the type of the delete, hard or sof t. It is not performed using
a query, but the documents are recreated and upsert is performed.
29
Genericity
Data freshness comparison
Task scheduling
Relevant technologies
Core technologies
Java 8
SpringBoot
Spring RESTful webservice
Database technologies
Microsoft SQL Server 2008
JDBC
Couchbase Server
Couchbase Server® Java SDK
//TODO insert explanation about cron expressions here
Running options
30
Result
Run examples
Performance
Limitations
31
Future development
Code improvements
Functionality improvements
New functionalities
32
Conclusions
33
Bibliography
[1] Warden , P., Big Data Glossary. O’Reilly Media, September 2011
[2] ***Marr , B., Big Data: 20 Mind -Boggling Facts , https://www.forbes.com/sites/bernardmarr/
34
[3] ***Project management institute: PULSE of the Profession, https://www.pmi.org/ –
/media/pmi/documents/public/pdf/learning/thought -leadership/pulse/pulse -of-the-profession -2017.pdf
[4] Harris , D., "Couchbase goes 2.0, pushes SQL for NoSQL" , GigaOm, July 2011
[5] Yishan , L., Sathiamoorthy , M., A performance comparison of SQL and NoSQL databases , Conference:
Communications, Comp uters and Signal Processing (PACRIM) , August 2013
[6] Korth, H.F., Silberschatz, A., Data Base System Concepts. McGraw -Hill Book Company, 1986
[7] Codd, E. F (1990), The Relational Model for Database Management, Addison -Wesley, p. 371 –388
[8] Date, C.J., D arwen, H., Databases, Types and the Relational Model: The Third Manifesto , Addison –
Wesley , p.21 -25
[9] ***Chaterjee, J. , Introduction to RDBMS, OODBMS and ORDBMS. January 3, 2005.
http://www.aspfree.com/c/a/Database/Introduction -to-RDBMS -OODBMS -and-ORDBMS/
[10] ***DB -ENGINES, https:/ /db-engines.com/en/ranking_categories
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: FACULTY OF MATHEMATICS AND COMPUTER SCIENCE SPECIALIZATION COMPUTER SCIENCE Diploma thesis SQL/NOSQL DATABASE SYNCHRONIZATION Scientific coordinator… [617609] (ID: 617609)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
