Skip to main content
main-content
Top

About this book

This easy-to-read textbook/reference presents a comprehensive introduction to databases, opening with a concise history of databases and of data as an organisational asset. As relational database management systems are no longer the only database solution, the book takes a wider view of database technology, encompassing big data, NoSQL, object and object-relational and in-memory databases. The text also examines the issues of scalability, availability, performance and security encountered when building and running a database in the real world.
Topics and features: presents review and discussion questions at the end of each chapter, in addition to skill-building, hands-on exercises; introduces the fundamental concepts and technologies in database systems, placing these in an historic context; describes the challenges faced by database professionals; reviews the use of a variety of database types in business environments; discusses areas for further research within this fast-moving domain.

Table of Contents

Databases in Context

Frontmatter

1. Data, an Organisational Asset

Abstract
Data has always been at the centre of an organisations operations and decision making. Today an organisation depends on data for its very survival. Data is used to make strategic decisions about the future direction an organisation will take and for that the data must be both current and accurate. Because data is an asset to a company it can be given a value and it can be traded. This chapter will look at data as an organisational asset tracing its importance in recording, analysing and planning as organisations grew and became more sophisticated. The impact of industrialisation on data requirements will be examined and finally with the advent of electronic data storage and processing, the emergence of data as an important component of a company’s assets. Finally there will be a discussion of big data and the current issues surrounding privacy when everything that can be recorded is recorded.
Peter Lake, Paul Crowther

2. A History of Databases

Abstract
In Chap. 1 we discussed data as an organisational asset. We saw data, usually in the form of records has been with us since at least ancient Egyptian times. We also so that the big drivers of the need to keep detailed records were trade and taxation. Electronic computers provided an accurate and fast way of recording and calculating data. The first commercial computers appeared in the early 1950’s and with these Data could now be processed much faster and with fewer errors compared with manual systems and a greater range of reports which were more up to date could be produced. Initially the processing required whole files to be processed—sequential processing but with advancing hardware technologies and falling prices direct access to data meant it was possible to update individual records without having to process an entire file. It also meant data needed to be structured to avoid duplication and make efficient use of electronic storage. Initially this was in the form of hierarchical and network structures. In 1970 Edgar Codd provided the theoretical framework which was developed into relational data structures. This became the basis of many of the systems which are in use today. Relational databases were designed to optimise disk usage and performance, but the very nature of a hard disk meant there were speed limitations. Now memory based systems which require different structuring to data are appearing. The evolution of database is continuing.
Peter Lake, Paul Crowther

3. Physical Storage and Distribution

Abstract
In this chapter we consider how we actually store data. Turning information into a series of 1s and 0s is at the heart of every current database system and so an understanding of issues like physical storage and distribution are important concepts to understand. We explore the ways in which database architectures are built up from the building blocks of bytes and data blocks, and the way that physical elements such as RAM and Disk memory are used by the database system. We illustrate the way that these elements interact by walking through a simple read operation. For most databases permanency is of critical importance and we examine how data is stored and how memory structures are used to attempt to minimise the comparatively slow process of reading data from disk. Most users, however, interact with a modern database system by using logical concepts such as tables and rows, and so we also explain how physical storage can map to logical objects. Database professionals deal with data on a daily basis. We look at the ways that this raw material can be moved between disparate systems using tools like SQL Loader and Import/Expotr. In addition we look at how professionals can use the supporting files, such as redo logs and alert logs. Finally we discuss placement of the data, and when it may be appropriate to distribute the data stored amongst a number of servers.
Peter Lake, Paul Crowther

Database Types

Frontmatter

4. Relational Databases

Abstract
No discussion of relational databases would be complete without a reference to Edgar Codd’s 1970 paper “A Relational Model of Data for Large Shared Data Banks”. This was a mathematical description of what we now call Relational databases and operations to manipulate them. This is called relational algebra where data could be organised into relations (tables) consisting of tuples (records), each with consistent attributes. The tables could be linked or joined by using relationships based on a unique identifier of each of the records, the primary key. As well as describing the structure of the data, a method called normalisation was developed to assist with the design of relational databases. An alternative approach to developing a logical design of a database is to identify entities in the system and then map the relationships between them Codd also described a number of operations which could be performed on it. These included selection, projection and joins between tables. Selection and joins are common terminology in databases, but projection defines conditions on the data you want to retrieve, for example people with an age greater than 18. Ultimately these operations were formalised into a structured query language or SQL. This was released in 1979 by what was to become Oracle Corporation.
Peter Lake, Paul Crowther

5. NoSQL Databases

Abstract
The NoSQL movement is still relatively new. Databases which store data without schemas and which do not necessarily provide transactional security may seem like a bad idea to experienced relational database practitioners, but these tools do certainly have their place in today’s data rich society. We examine the area in general and then look at specific examples of a Column-based and a Document-based database, with hands-on tutorials for each, using Cassandra and MongoDB as examples. Today’s sales battlegrounds are often based around Scalability and Performance. And it is precisely these two aspects that tend to be pushed by NoSQL tool providers as they claim their products outperform and out-scale traditional RDBMS. Naturally the traditional vendors are fighting back and making counter claims. Equally naturally the real position is somewhere between the two, with most answers starting with “It depends …”. So we look at the difference in philosophy between the transactional ACID databases and the No-SQL approach. The concept of “eventual consistency” is explored. The concept of a schema-less database is discussed and then turned into reality with the tutorial material. And we then we review some of the strengths and weaknesses of these new approaches and examine the business case for a non-relational approach.
Peter Lake, Paul Crowther

6. Big Data

Abstract
If you look at many leading database vendors’ web sites and you will see that we are in the Big Data era. We explore what this actually means and, using a tutorial, review one of the key concepts in this era—that of MapReduce. We discover that Big Data is not just about volume. We consider the idea that Big Data needs to have four key attributes to be really called Big Data: Volume, Variety, Velocity and Veracity. We see how Big Data depends greatly upon Cloud Computing and will need to become very mobile device driven. Experts in Big Data are typically known as Data Scientists. The Harvard Business Review recently described this as The Sexiest Job of the 21st Century. We see what skills go to make up a good data scientists, and review some of the tools available for the job. MapReduce has become a core technique for data scientists. Recognising that analysing high volumes of potentially unstructured data that may be spread across many data nodes is not something that the relational model would be good at handling, tools like Hadoop and MongoDB have come to the forefront. We provide a tutorial which builds on the MongoDB skills learned in Chap. 5 to allow readers to attempt some MapReduce exercises.
Peter Lake, Paul Crowther

7. Object and Object Relational Databases

Abstract
One of the issues with data is that it has become more and more complex. As we saw in Chap. 2 there was a transition from early file based systems to more complex database systems which could be accessed using a query language. The format of data in relational systems is primarily alphabetic, numeric or alphanumeric with some special types such as date. In a relational database one is able to search, manipulate and select data with a powerful query language, SQL. As seen in Chap. 4, relational databases became the de facto standard for this type of data. However there is a growing demand to store complex objects and retrieve them. Databases are also being increasingly used in systems written in object oriented programming languages such as C++ and java which does not fit well with SQL. To overcome these problems standards for object oriented databases have been proposed. As a compromise solution some vendors, including Oracle have developed an object relational system where object structures have been incorporated into the relational model.
Peter Lake, Paul Crowther

8. In-Memory Databases

Abstract
Disk based database technology has influenced database design since the inception of electronic databases. One of the issues with disk based media is that the physical design of systems tries to speed up processing by reducing disk access, in other words disk input/output (I/O) is a limiting factor which needs to be optimised. In-memory databases have been described as a disruptive technology or disruptive tipping point because it provides a significant improvement in performance and use of system resources.
In-memory databases systems are database management systems where the data is stored entirely in main memory. In memory systems have been shown to be 50,000 times faster than disk based systems. This increase in speed along with a falling cost and non-volatility of computer memory has led to a greater interest in in-memory databases from the mid-2000’s. Preimesberger (2013) coins the term ‘data half-life’ referring to data being more valuable in real time and diminishing in value over time. Therefore speed of processing giving real time insights which can be used immediately is important. In-memory databases therefore allow real-time online analytical processing (OLAP) analysis on data from online transaction processing (OLTP).
Peter Lake, Paul Crowther

What Database Professionals Worry About

Frontmatter

9. Database Scalability

Abstract
Once you have designed your database, especially when supporting a web- or cloud-based solution, you need to be sure that it can grow if the business that the application supports is successful. Scalability is about ensuring that you can cope with many concurrent users, or huge amounts of data, or both. This is becoming ever more important in the era of Big Data. This chapter is in two main parts. We first examine strategies for coping with increases in the number of users of a system, and then look at the alternative approaches to handling the growth of data being stored and queried. As always there are balances to be found. Some scalability solutions can cause performance issues, and can be quite expensive, for example. We look at the difference between standalone, client/server, and distributed approaches to scalability, and explore the design constraints that each approach imposes upon database professionals. We also review the difference between horizontal and vertical scaling, and the Shared Nothing or Shared Everything approaches.
Peter Lake, Paul Crowther

10. Database Availability

Abstract
Once your system is built, you need to be able to have it available for use permanently (or as close to permanently as can be achieved within the financial resources at your disposal). We review key concepts such as back-up, recovery, and disaster recovery.
The holy grail for Availability is often thought to be Five-Nines—that is having the database available for 99.999 % of the time. To achieve this a database would need to be on-line for all but a few minutes a year, and, unless organisations are willing to spend heavily, this is an almost impossible target, as we explain in the chapter. We look at the things that do go wrong in database systems to try and understand how we can get to as close to the target as possible.
As we see, we can’t rely on the database platform being error free. To safeguard against unexpected data loss we need to take copies of the data known as Back-ups. If you should lose any data, then you use the copies stored as a backup to replace the lost data. We review alternative approaches to Backup and Recovery, and then go on to discuss Disaster Recovery. Cloud computing does change the way we think of Availability and we examine a variety of differences that Cloud brings.
Peter Lake, Paul Crowther

11. Database Performance

Abstract
For a DBA the dreaded phone call is “my report is running very slowly”. For a start, what is mean by slowly? What is the user used to? Then there is the problem of how you establish where the problem is—is it hardware related? Or Network related? At the Server or Client end? We see that performance problems can be hard to diagnose because of the complexity of the tiers involved (physical, OS, network, servers, etc). Moreover, even if we do find the cause and resolve the problem, there is no guarantee that it will remain solved because, as we add data to our system, the data distribution may make some of our previous design assumptions invalid. The solution may be indexes, or partitions, or a variety of other performance enhancing techniques that we review in this chapter. We include some tutorial material which explores some performance management tools. We look at some of the important elements of a databases performance, discussing issues like throughput, workload and contention. We note that there are many levels at which we can apply performance tools, including the physical level (increasing RAM for example), and at the Schema level. However, much of our focus is on SQL Optimisation, and we provide examples of using tools like Explain Plan to assist with the tuning of SQL.
Peter Lake, Paul Crowther

12. Security

Abstract
Data is one of organisations most important assets, therefore steps need to be taken to protect it. Security generally has three aspects to it: physical security, software security and procedures. With physical security the main issue is how to protect the servers and infrastructure from damage or loss. This may range from a catastrophic event affecting the data centre through to someone stealing the actual server. It also includes how to recover operations if there is a major incident. A more recent physical security threat has emerged from the increase in use of mobile devices. Loss and theft of these devices which may contain sensitive information or have the ability to automatically connect to sensitive systems has been a growing problem for some time. Software security is the more important security consideration on a day to day basis. Many databases have internet access to them, although this is usually through another server. It is however where most threats come from. Keeping virus checking and firewall software up to date will prevent most external attacks but internal abuse of user privileges also needs to be considered. The final aspect is the procedures which are in place. These will also have an impact on how physical and software security is implemented and maintained. Questions to ask are what procedures are in place, are they adequate and how are these audited. A major component of this is a risk register which needs to be regularly updated and reviewed.
Peter Lake, Paul Crowther
Additional information