2014-09-29

Course Code                               : MCS-043
Course Title                                 : Advanced Database Management Systems
Assignment Number                  : MCA(4)/041/Assign/2014-15
Maximum Marks                         : 100
Weightage                                   : 25%
Last Dates for Submission        : 15th October, 2014 (For July 2014 Session)
15th April, 2015 (For January 2015 Session)

This assignment has five questions carrying 80 marks. Answer all questions.Rest 20 marks are for viva voce. You may use illustrations and diagrams to enhance the explanations. Please go through the guidelines regarding assignments given in the Programme Guide.

Question: 1
List and explain the two integrity rules of relational model with the help of a suitable example for each.

Solution:-

INTEGRITY RULES OF RELATIONAL MODEL:
> Integrity rules are needed to inform the DBMS about certain constraints in the real world.

> Specific integrity rules apply to one specific database.

Example: part weights must be greater than zero.

> General integrity rules apply to all databases.

Two general rules will be discussed to deal with: primary keys and foreign keys.

PRIMARY KEYS
> Primary key is a unique identifier for a relation.

> There could be several candidate keys, as long as they satisfy two properties:
1. uniqueness

2. minimality
> From the set of candidate keys, one is chosen to be the primary key.The others become alternate keys.

EXAMPLE: The relation R has several candidate keys.

ID SSN License_Number NAME

If we select ID to be the primary key, then the other candidate keys become alternate keys.

1. Entity Integrity
Entity Integrity is the mechanism the system provides to maintain primary keys. The primary key serves as a unique identifier for rows in the table. Entity Integrity ensures two properties for primary keys:

The primary key for a row is unique; it does not match the primary key of any other row in the table.

The primary key is not null, no component of the primary key may be set to null.

The uniqueness property ensures that the primary key of each row uniquely identifies it; there are no duplicates. The second property ensures that the primary key has meaning, has a value; no component of the key is missing. The system enforces Entity Integrity by not allowing operations (INSERT, UPDATE) to produce an invalid primary key. Any operation that creates a duplicate primary key or one containing nulls is rejected.

2. Referential Integrity

Referential Integrity is the mechanism the system provides to maintain foreign keys. The definition of a foreign key must specify the table whose primary key is being referenced. Referential Integrity ensures only one property for foreign keys:

A valid foreign key value must always reference an existing primary key or contain a null

A foreign key may contain a null; it does not need to reference an existing primary key (actually, it can’t reference any primary key since primary keys cannot containnulls). In FirstSQL, foreign keys may be optionally defined as not allowing nulls. To accomplish this with other DBMSs, each component of the foreign key must be separately defined as not allowing nulls.

While the Referential Integrity property looks simpler than those for Entity Integrity, the consequences are more complex since both primary and foreign keys are involved. The rule for foreign keys is:

No operation (INSERT, UPDATE) can create a non-null foreign key unless a corresponding primary key exists.

Any operation that produces a non-null foreign key value without a matching primary key value is rejected. Primary keys are also constrained by Referential Integrity:

No operation (UPDATE, DELETE) can remove or change a primary key while a referencing foreign keys exist.

The Referential Integrity rule for primary keys can be enforced in several ways. FirstSQL supports a complete set of options. These are described in the next  section,Restrict, Null, Cascade.

Question: 2
Define Hash join and explain the process and cost calculation of Hash join with the help of an example.

Solution: -

HASH joins are the usual choice of the Oracle optimizer when the memory is set up to accommodate them. In a HASH join, Oracle accesses one table (usually the smaller of the joined results) and builds a hash table on the join key in memory. It then scans the other table in the join (usually the larger one) and probes the hash table for matches to it. Oracle uses a HASH join efficiently only if the parameter PGA_AGGREGATE_TARGET is set to a large enough value. If MEMORY_TARGET is used, the PGA_AGGREGATE_TARGET is included in the MEMORY_TARGET, but you may still want to set a minimum.

If you set the SGA_TARGET, you must set the PGA_AGGREGATE_TARGET as the SGA_TARGET does not include the PGA (unless you use MEMORY_TARGET as just described). The HASH join is similar to a NESTED LOOPS join in the sense that there is a nested loop that occurs—Oracle first builds a hash table to facilitate the operation and then loops through the hash table. When using an ORDERED hint, the first table in the FROM clause is the table used to build the hash table.

HASH joins can be effective when the lack of a useful index renders NESTED LOOPS joins inefficient. The HASH join might be faster than a SORT-MERGE join, in this case, because only one row source needs to be sorted, and it could possibly be faster than a NESTED LOOPS join because probing a hash table in memory can be faster than traversing a b-tree index.

As with SORT-MERGE joins and CLUSTER joins, HASH joins work only on equijoins. As with SORT-MERGE joins, HASH joins use memory resources and can drive up I/O in the temporary tablespace if the sort memory is not sufficient (which can cause this join method to be extremely slow).

Finally, HASH joins are available only when cost-based optimization is used (which should be 100 percent of the time for your application running on Oracle 11g).

Table 1 illustrates the method of executing the query shown in the listing that follows when a HASH join is used.


Although they are relatively easy to implement, hashing strategies are an important aspect of hash joins.  In the literature, several modifications of a standard hashing algorithm are proposed giving a variety of methods with different time and space requirements. The general problems with hashing are non-uniform key value distributions and dynamic reorganization if the number of records in a hash table increases or decreases significantly.  Most of the proposed strategies assume that the hash method is used to create a primary access path on a disk file.  For hash joins, neither the hash key is unique nor are the records stored in disk blocks  . While it is a big disadvantage to have duplicate key values, having all data in main memory may help make the implementation efficient.

Question: 3

i) List the feature of semantic database
ii) Explain clustering in data mining
iii) Explain the characteristics of mobile database. Also give an application of mobile database.
iv) How is audit trail done in database? How are they related to database security?

Solution:-

1. List the feature of semantic database

A Relational Model uses SQL data and query syntax and the Primary Key Value is searched with

SQL.A   Hierarchical Model uses XML data and a Unique Attribute Key search with XPath.

A Graph Model uses RDF/XML data and a RDFS/OWL syntax semantic search with SPARQL.

The data is organized in such a way that it can be interpreted meaningfully without human intervention. Semantic data has a history dating back to the 1970′s and is currently used in a wide variety of data management systems and applications.

Data is organized based on binary models of objects, usually in groups of three parts: two objects and their relationship. For example, if one wanted to represent a cup sitting on a table, the data organization might look like this: CUP TABLE. The objects (cup and table) are interpreted with regard to their relationship (sitting on). The data is organized linearly, telling the software that since CUP comes first in the line, it is the object that acts. In other words, the position of the word tells the software that the cup is on the table and not that the table is sitting on the cup. Databases designed around this concept have greater applicability and are more easily integrated into other databases.

2. Explain clustering in data mining

Two common data mining techniques for finding hidden patterns in data are clustering and classification analyses. Although classification and clustering are often mentioned in the same breath, they are different analytical approaches. In this column, I describe similarities and differences between these related, but distinct approaches.

Imaging a database of customer records, where each record represents a customer’s attributes. These can include identifiers such as name and address, demographic information such as gender and age, and financial attributes such as income and revenue spent. Clustering is an automated process to group related records together. Related records are grouped together on the basis of having similar values for attributes. This approach of segmenting the database via clustering analysis is often used as an exploratory technique because it is not necessary for the end-user/analyst to specify ahead of time how records should be related together. In fact, the objective of the analysis is often to discover segments or clusters, and then examine the attributes and values that define the clusters or segments. As such, interesting and surprising ways of grouping customers together can become apparent, and this in turn can be used to drive marketing and promotion strategies to target specific types of customers.

There are a variety of algorithms used for clustering, but they all share the property of iteratively assigning records to a cluster, calculating a measure (usually similarity, and/or distinctiveness), and re-assigning records to clusters until the calculated measures don’t change much indicating that the process has converged to stable segments. Records within a cluster are more similar to each other, and more different from records that are in other clusters. Depending on the particular implementation, there are a variety of measures of similarity that are used (e.g. based on spatial distance, based on statistical variability, or even adaptations of Condorcet values used in voting schemes), but the overall goal is for the approach to converge to groups of related records.
Classification is a different technique than clustering. Classification is similar to clustering in that it also segments customer records into distinct segments called classes. But unlike clustering, a classification analysis requires that the end-user/analyst know ahead of time how classes are defined. For example, classes can be defined to represent the likelihood that a customer defaults on a loan (Yes/No). It is necessary that each record in the dataset used to build the classifier already have a value for the attribute used to define classes. Because each record has a value for the attribute used to define the classes, and because the end-user decides on the attribute to use, classification is much less exploratory than clustering. The objective of a classifier is not to explore the data to discover interesting segments, but rather to decide how new records should be classified — i.e. is this new customer likely to default on the loan?

Classification routines in data mining also use a variety of algorithms — and the particular algorithm used can affect the way records are classified. A common approach for classifiers is to use decision trees to partition and segment records. New records can be classified by traversing the tree from the root through branches and nodes, to a leaf representing a class. The path a record takes through a decision tree can then be represented as a rule. For example, “Income<$30,000 and age<25, and debt=High, then Default). But due to the sequential nature of the way a decision tree splits records (i.e. the most discriminative attribute-values [e.g. Income] appear early in the tree) can result in a decision tree being overly sensitive to initial splits. Therefore, in evaluating the goodness of fit of a tree, it is important to examine the error rate for each leaf node (proportion of records incorrectly classified). A nice property of decision tree classifiers is that because paths can be expressed as rules, then it becomes possible to use measures for evaluating the usefulness of rules such as Support, Confidence and Lift to also evaluate the usefulness of the tree.
To conclude, although clustering and classification are often used for purposes of segmenting data records, they have different objectives and achieve their segmentation through different ways. Knowing which approach to use is important for decision-making.
Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University’s Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

3. Characteristics of mobile database

A mobile database is a database that can be connected to by a mobile computing device
Over a wireless Mobile Network.
1. Mobile databases: Physically separate from the central database server.
2. Resided on mobile devices.
3. Capable of communicating with a central database server or other mobile clients   from remote sites.
4. Handle local queries without connectivity

Application of mobile database
As already noticed, the mobile 3D database presented here is built on the code base of the 3D server database system. PSEPro® is used as the underlying mobile object-oriented database system. PSEPro is capable of running on constraint devices (PDAs) and provides a single-user, fully featured database system. Therefore we tested our mobile database also on a PDA.
The mobile database includes the core 3D geometry (0D, 1D, 2D, 3D simplexes) and the object model (simplicial complexes). It is extended to store further version management information for the whole database and each local object. Therefore, each local database object knows its associated design object, version and work space in the server database. Furthermore, the internal index structure of the objects is modified for better build up times. To keep the  build  up time as small as possible, in the mobile database an octree implementation is used as spatial index instead of the R* tree in the server database. This way the build  up time gets reduced by a factor of 5, at the cost of longer search times. As the most complex geometric operations (e.g. 3D intersection computations) are heavily dependent on the efficiency of the internal index, they are computed on the server.

4. AUDIT TRAILS
Audit trails maintain a record of system activity both by system and application processes and by user activity of systems and applications.  In conjunction with appropriate tools and procedures, audit trails can assist in detecting security violations, performance problems, and flaws in applications.  This bulletin focuses on audit trails as a technical control and discusses the benefits and objectives of audit trails, the types of audit trails, and some common implementation issues.

An audit trail is a series of records of computer events, about an operating system, an application, or user activities.  A computer system may have several audit trails, each devoted to a particular type of activity.  Auditing is a review and analysis of management, operational, and technical controls.  The auditor can obtain valuable information about activity on a computer system from the audit trail.  Audit trails improve the audit-ability of the computer system.

Audit trails may be used as either a support for regular system operations or a kind of insurance policy or as both of these.  As insurance, audit trails are maintained but are not used unless needed, such as after a system outage.  As a support for operations, audit trails are used to help system administrators ensure that the system or resources have not been harmed by hackers, insiders, or technical problems.

Audit trails are a technical mechanism that help managers maintain individual accountability.  By advising users that they are personally accountable for their actions, which are tracked by an audit trail that logs user activities, managers can help promote proper user behavior. Users are less likely to attempt to circumvent security policy if they know that their actions will be recorded in an audit log.

For example, audit trails can be used in concert with access controls to identify and provide information about users suspected of improper modification of data (e.g., introducing errors into a database).  An audit trail may record “before” and “after” versions of records. (Depending upon the size of the file and the capabilities of the audit logging tools, this may be very resource-intensive.)  Comparisons can then be made between the actual changes made to records and what was expected.  This can help management determine if errors were made by the user, by the system or application software, or by some other source.

Question: 4
i) How does OLAP support query processing in dataware house?
ii) Differentiate between embedded SQL and dynamic SQL. Give an example of embedded SQL.

Solution:-

i) How does OLAP support query processing in data ware house?

OLAP (Online Analytical Processing) [21] is a data warehouse model widely used in industry. Recently, this model has been extended to work with uncertain and imprecise data [3, 13, 15]. In OLAP, top-k query is the most dominating type of query. Although the semantics of top-k queries for uncertain tuples were discussed by several authors (e.g., [23]), how to process top-k queries in uncertain OLAP is still unexplored. This project presents a solution to the problem of answering top-k queries in uncertain OLAP. The contributions of the project are summarized as follows:

• An extension of top-k query semantics for uncertain OLAP.

• An algorithm to effectively process top-k queries under the proposed semantics.

• A method to materialize data to speed up searching.

• An experimental evaluation to evaluate the proposed model.

In the next section, I provide a brief literature survey on OLAP and probabilistic databases to set the context of the work proposed. I then present the semantics for top-k query in uncertain OLAP (section 3), an algorithm to effectively process top-k queries (section 4) and a method to materialize data (section 5). An experimental evaluation is presented to show the effectiveness of proposed model (section 6). I finish with some conclusions and directions for future work.

Data warehouse and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. A data cube allows data to be  modeled and viewed in multiple dimensions. Dimensions are perspectives or entities with respect to which an  organization wants to keep records. Each dimension has a table associated with it, called a dimension table, which  further describes the dimension.A multidimensional data model is typically organized around a central theme. This theme is represented by a fact table. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

ii) Differentiate between embedded SQL and dynamic SQL. Give an example of embedded SQL.

Differentiate between embedded SQL and dynamic SQL

embedded SQL

dynamic SQL

It is hardcoded in the program itself and the changeable value is the value in the host variables

It has the capability of changing the columns, tables during the execution of prg. It doesn’t contain any host variables.

Access Path is generated during BIND time. Also table authorizations are checked during BIND (if VALIDATE(BIND) is used).

Access path is generated during Run-time. Table authorizations can only be checked during run-time.

Embedded SQL Example
The following code is a simple embedded SQL program, written in C. The program illustrates many, but not all, of the embedded SQL techniques. The program prompts the user for an order number, retrieves the customer number, salesperson, and status of the order, and displays the retrieved information on the screen.

int main() {

EXEC SQL INCLUDE SQLCA;

EXEC SQL BEGIN DECLARE SECTION;

int OrderID;         /* Employee ID (from user)         */

int CustID;            /* Retrieved customer ID         */

char SalesPerson[10]   /* Retrieved salesperson name      */

char Status[6]         /* Retrieved order status        */

EXEC SQL END DECLARE SECTION;

/* Set up error processing */

EXEC SQL WHENEVER SQLERROR GOTO query_error;

EXEC SQL WHENEVER NOT FOUND GOTO bad_number;

/* Prompt the user for order number */

printf (“Enter order number: “);

scanf_s(“%d”, &OrderID);

/* Execute the SQL query */

EXEC SQL SELECT CustID, SalesPerson, Status

FROM Orders

WHERE OrderID = :OrderID

INTO :CustID, :SalesPerson, :Status;

/* Display the results */

printf (“Customer number:  %d\n”, CustID);

printf (“Salesperson: %s\n”, SalesPerson);

printf (“Status: %s\n”, Status);

exit();

query_error:

printf (“SQL error: %ld\n”, sqlca->sqlcode);

exit();

bad_number:

printf (“Invalid order number.\n”);

exit();

}

Question: 5

Explain the following with the help of an example:

i) Application of Data grid
ii) XML and HTML
iii) Data-marts
iv) Security classes
v) Concurrently control
vi) Deductive database
vii) Significance of creating data dictionary in database
viii) Datagrid
ix) Query optimization
x) Heterogeneous database.

Solution:-

i) Application of Datagrid
A data grid is an in-memory distributed database designed for fast access to large volumes of data and scalability. Data grids commonly provide a complementary layer to the relational database and the application.

Key characteristics of a data grid:

In-memory, distributed caching

Map/reduce

Elastic scalability

Advanced querying

Data replication

Processing for streaming data

Transaction capabilities

ii) XML and HTML

XML

XML (Extensible Markup Language) is a flexible way to create common information formats and share both the format and the data on the World Wide Web, intranets, and elsewhere. For example, computer makers might agree on a standard or common way to describe the information about a computer product (processor speed, memory size, and so forth) and then describe the product information format with XML.

HTML
HyperText Markup Language (HTML) is a markup language used for creating and visually representing a webpage. It is the external appearance of every webpage.In essence, HTML adds “markup” to standard English text. “Hyper Text” refers to links that connect Web pages to one another, making the World Wide Web what it is today. By creating and uploading Web pages to the Internet, you become an active participant — a producer and not just a consumer.Apart from visual images, it supports media as well. More precisely, HTML is the language that describes the structure and the semantic content of a web document. Content within a web page is tagged with HTML elements such as <img>, <title>, <p>, <div>, <picture>, and so forth.  These elements form the building blocks of a website

HTML offers authors several mechanisms for specifying lists of information. All lists must contain one or more list elements. Lists may contain:

Unordered information.

Ordered information.

Definitions.

The previous list, for example, is an unordered list, created with the UL element:

An ordered list, created using the OL element, should contain information where order should be emphasized, as in a recipe:

Mix dry ingredients thoroughly.

Pour in wet ingredients.

Mix for 10 minutes.

Bake for one hour at 300 degrees.

Definition lists, created using the DL element, generally consist of a series of term/definition pairs (although definition lists may have other applications). Thus, when advertising a product, one might use a definition list:

Lower cost
The new version of this product costs significantly less than the previous one!
Easier to use
We’ve changed the product so that it’s much easier to use!
Safe for kids
You can leave your kids alone in a room with this product and they won’t get hurt (not a guarantee).

defined in HTML as:

iii) Data-marts
A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data.

iv)  Security Class
The Security Class contains methods that help you create a secure application, processing input data for security.

Here is an usage example:

If you want the filter to run automatically every time it encounters POST or COOKIE data you can enable it by opening yourapplication/config/config.php file and setting this:

v) Concurrently control
The internal concurrency model in Apache Jackrabbit is fairly complex and a number of deadlock issues have been reported and fixed over the Jackrabbit 1.x release cycle. This document is the result of a design and code review targeted at proactively preventing other similar issues.

This document is about the internal concurrency and synchronization model in Jackrabbit, not about the JCR locking feature. Note that the review that led to this document targeted concurrency control at an architectural level and did not focus much on issues like thread-safety of individual classes or components. This review is based on Jackrabbit version 1.5 in default configuration.

In terms of concurrency control, the Jackrabbit architecture can roughly be divided to five main layers:

Cluster

Repository

Workspace

Session

Transaction



vi) Deductive database
A Deductive database is a database system that can make deductions (i.e., conclude additional facts) based on rules and facts stored in the (deductive) database. Data-log  is the language typically used to specify facts, rules and queries in deductive databases. Deductive databases have grown out of the desire to combine logic programming with relational databases to construct systems that support a powerful formalism and are still fast and able to deal with very large data sets. Deductive databases are more expressive than relational databases but less expressive than logic programming systems. In recent years, deductive databases such as Data-log have found new application in data integration, information extraction, networking, program analysis, security, and cloud computing

vii) Significance of creating data dictionary in database
The term “data dictionary” is used by many, including myself, to denote a separate set of tables that describes the application tables. The Data Dictionary contains such information as column names, types, and sizes, but also descriptive information such as titles, captions, primary keys, foreign keys, and hints to the user interface about how to display the field.

A super-simple beginning Data Dictionary might look like this:


viii) Datagrid
The DataGrid is the central component of many applications due to its effective and usable presentation of tabular data. In this tutorial we will look at how to define a grid’s layout and discuss the scrolling mechanism the DataGrid uses.

ix) Query optimization
Indexes improve the efficiency of read operations by reducing the amount of data that query operations need to process. This simplifies the work associated with fulfilling queries within MongoDB.

Example:
An application queries the inventory collection on the type field. The value of the type field is user-driven.

To improve the performance of this query, add an ascending, or a descending, index to the inventory collection on the type field. [1] In the mongo shell, you can create indexes using the db.collection.ensureIndex()  method:

This index can prevent the above query on type from scanning the whole collection to return the results.

x) Heterogeneous database
Easy management and fast access to corporate database software is critical for successful business operations. HiT Software® products provide support for data management and real-time access to all major databases, such as Oracle, MS SQL Server, MySQL, Informix, Sybase and IBM DB2 (AS400, LUW/UDB, mainframe) with minimal configuration and excellent ROI. Take advantage of native database tracking tools such as transaction log reading and high-performance middle ware connectivity products with HiT Software, the leader in Data Integration, Data Replication, Change Data Capture (CDC) and Data Movement solutions for all major relational databases regardless of platform.

This example illustrates using Streams to replicate data between four databases. The environment is heterogeneous because three of the databases are Oracle databases and one is a Sybase database. DML and DDL changes made to tables in the hr schema at the dbs1.net Oracle database are captured and propagated to the other two Oracle databases. Only DML changes are captured and propagated to the dbs4.net database, because an apply process cannot apply DDL changes to a non-Oracle database. Changes to the hr schema occur only at dbs1.net. The hr schema is read-only at the other databases in the environment.

Show more