Weeks 7 & 8 CST 363: Map-Shuffle-Reduce and the FINAL

During Week 7, we learned about MapReduce (or Map-Shuffle-Reduce), which is used in processing large amounts of data over distributed systems.

During the mapping stage, relevant data is selected from tables on each worker node. Then, during the shuffle stage, this data is redistributed and sorted into temporary tables on each node. These two stages may be repeated any number of times, depending on the number of joins is necessary to get the desired results. Finally, we enter the reduce stage, which combines the results from every worker node sent to the controller.

While the concept of MapReduce is easy to understand, it took a while for me to wrap my head around how it should be implemented in our Python model.

However, I ultimately realized that every temporary table created in the map-shuffle stages were equivalent to SELECT subqueries in the FROM clause in MySQL. The reduce stage is equivalent to the main query that uses these subqueries and JOINs to get a desired result. Once I understood this, crafting programs that would return result tables using MapReduce became incredibly easy.

I look forward to learning more about utilizing cluster computing in the future.

Entering week 8, I only have the final to worry about and I feel extremely prepared. I have read over the questions and I have no doubt that I will do well. I have learned much during this class and I look forward to integrating my knowledge of MySQL to Web Programming this summer.

Week 6 CST 363: ACID Properties

This week we focused on the importance of maintaining the ACID properties of databases.

Before understanding what these processes entail, one must understand that a database management system’s (DBMS) transactions are comprised of a group of tasks and that a task is the minimum processing unit that cannot be further broken down.

ACID

A: Atomicity

Each transaction must be treated as an atomic unit. That is, either ALL of its tasks are executed and committed or NONE.

C: Consistency

The database must remain in a consistent state after any transaction.

I: Isolation

Concurrent transactions (e.g. in the case of multiple users accessing the database at once) should be executed as if they are the ONLY transaction in the system. Transactions should not affect one another.

D: Durabilty

The database should maintain its latest updates even if the system fails/restarts. The system should hold commits or finish updating data once the system comes back online.

I feel pretty good about solving these past two assignments. They were challenging, but doable and enjoyable. On a lighter note, here is a picture of me drawn by one of my high school computer science students:

Week 5 CST 363 – B Trees

This week we learned about B-Trees and B+ Trees as a way of more efficiently reading and writing data from a hard disk drive (HDD).

In order to understand B Trees, it is important to understand a HDD’s structure and how data is read and written to it. A HDD is split up into tracks and sectors, two vital pieces of information utilized in block addresses. In addition to track and sector, we also need to know the offset (from byte 0) on which to read/write. In order to change sector, a HDD spins; in order to change tracks a HDD’s head shifts.

Once we understand a HDD’s structure, we can understand how data is stored on it, e.g. in terms of a database. Database records are rows within a table. The amount of space a row takes up on the HDD depends on how many bytes each column has been allocated. We can then use that information to determine how many records (rows) we can store per block. Block size is usually 512 bytes, but it depends on the HDD manufacturer. For example, if we create a database that has 5 columns: eid (10 bytes), name (50 bytes), dept (10 bytes), section (8 bytes), address (50 bytes), each record in that database would have a size of 128 bytes. We could then calculate that we could store up to 4 records per block (of 512 bytes). Furthermore, we would need 25 blocks to store a database with 100 rows in it.

Considering a database of this size, in the worst case scenario, we would have to access all 25 blocks on the HDD to return the results of a query. To make this process more efficient, we use indexing. An index is used to store a key (a column from the database) and a pointer to the record on the HDD. This reduces the number of blocks to access during search because the size of the index (key and pointer) will be smaller than the size of an entire record. Smaller size, means less blocks of storage and less blocks to search through. If our indexes, for instance, only take up 16 bytes, we would only need 4 blocks to store the indices for 100 records. This means, at maximum, we would only need to access 5 blocks (all 4 index blocks and the block to which the pointer points to).

But simple indexing is not the most efficient. Sparse indexing builds upon simple dense indexing by adding higher level layers of indexing on the lower level index table previously mentioned. These higher level indexes point to index blocks in the lower level table. Therefore, a single index could point to a block of 32 indexes. This is the notion of multilayer indexing.

Once we understand multilevel indexing, we can understand M-Way Trees. The m  stands for the degree and represents the maximum number of children a node can have. The number of keys a parent node can have is m – 1. To use M-Way Trees/Search, our data should be sorted from smallest to largest. A Binary tree is an example of an m-way Tree with a degree of 2. M-way search trees have child pointers associated with each of their keys. In multidimensional indexes, they also have record pointers. This is the node structure of a B-Tree.

B-Trees are M-Way trees with rules. These rules include:

  1. Each node should have m/2 children before filling the next node.
  2. Root can have minimum 2 children
  3. All lead nodes must be at the same level
  4. Creation process is bottom-up

The difference between a B-Tree and a B+ Tree is that B+ Trees do not have record pointers from every node — just the lead modes. Therefore, each key in a node is also copied on to one of the leaves. Leaves are linked together like a linked list. The leaf level is a dense index. The upper node levels are spare indices. For an illustration of how B+ trees work, use the B+ Tree Simulator!

Week 4 CST 363

This has been a busy week! We had a midterm and the final iteration of our DB/WI project due. I have learned quite a bit about the utility of data warehouses to create simple and understandable systems for end users to query a database. I have gained insight for how to continue with my Staff Directory System at my school site and how to structure different relational databases to the extract data into a central data warehouse to create different applications for different user purposes.

There are so many needs that my school has — teacher location, monitoring student success, preventing teacher burnout, ensuring equity — and I am beginning to see how an OLAP system can be applied to the decision making processes that make a school successful.

Week 3 CST 363: OLTP vs. OLAP

During our third week, we have been finishing up Part 1 of our first project, which has us designing a database with data capture or OLTP (online transaction processing) in mind. This initial design of our database relies primarily on what we have learned last week about normalizing databases to reduce data redundancy. By contrast, this week’s readings have focused on OLAP (online analysis processing), which serves a different purpose than OLTP. My understanding of these differences are outlined in the table below:

OLTP OLAP
Purpose is operational record keeping Purpose is analytical decision making / evaluation of performance (answer how/what/why questions)
Optimized to process transactions quickly — usually processes one transaction at a time Must handle up to hundreds of thousands of transactions at a time
Updated to maintain the current state (no historical records are kept) Historical records are important to provide context for the evaluation of performance over time

OLTP and OLAP differ fundamentally because the goals of the end user are different. While OLTP typically use relational databases that are normalized (typically in the third normal form) to reduce data redundancy and to be able to process single transactions quickly, OLAP makes use of dimensional modeling to deliver data that is understandable and appealing to the end user.

Kimball and Ross (2013) use a restaurant metaphor to describe one form of OLAP architecture — Kimball’s Data Warehouse/ Business Intelligence (DW/BI) architecture .

Figure 2 in Ross (2004)

In the restaurant metaphor, the kitchen is the staging area of the DW/BI. Just as kitchen staff must ensure that the ingredients they procure are high quality and are prepared in a manner that suits the diner’s palate, we who design the data staging area are concerned with cleaning the data (e.g. by fixing misspellings or formatting inconsistencies) that we extract from multiple sources. This process is called the ETL (Extract, Transformation, and Load) System.

Just as we would not expect diners to wander into the back kitchen, we would not expect users to have access to the staging area / ETL. Diners belong in the dining room, where they are presented with prepared food. A DW/BI user’s domain is in the presentation area, where they should be presented with data in a meaningful and easy-to-understand format.


References

Kimball, R. and Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd Ed.). Indianapolis, IN: John Wiley and Sons.

Ross, M. (2004). Differences of Opinion. Kimball Group. https://www.kimballgroup.com/2004/03/differences-of-opinion/

Week 2 CST 363

This week we learned about grouping results (summaries) and subqueries before moving on to database design.

Summaries

When creating summary queries, you use one or more aggregate functions: AVG, SUM, MMIN, MAX, COUNT (or COUNT (*) ). Any SELECT clause having an aggregate function CANNOT have a non-aggregate column UNLESS it is used to GROUP results in a GROUP BY clause. As an example, you can display a SUM OF INVOICES (aggregate column) GROUPED BY vendor name (non-aggregate column).

Syntax is very important when crafting summary queries. The GROUP BY clause needs to occur before the search condition in the HAVING clause and the order by list in the ORDER BY clause to produce the intended results of a summary query.

Subqueries

Most JOINs can be restated as a subquery and vice versa.

Subqueries can be coded in an outer SELECT statement in a:

  • WHERE clause as a search condition
  • HAVING clause as a search condition
  • FROM clause as a table specification
  • SELECT clause as a column specification

Subqueries can return a:

  • single value (generally used in WHERE and HAVING)
  • list of values, i.e. one column (generally used the IN of a WHERE / HAVING clause or in a SELECT clause
  • table of values, i.e. multiple columns (generally used in FROM)

JOINs versus Subqueries

JOINSSUBQUERIES
SELECT clause of JOIN can include columns from both tables.Can pass an aggregate value to main query
more intuitive when existing relationships between tables (primary and foreign keys)more intuitive when forming an ad hoc relationship between tables
long/complex queries sometimes easier to code/read
  • both used for queries that interact with 2+ tables
  • usually they translate back and forth

Database Design

An important aspect of database design is normalization–separating data in a data structures to separate and related tables to reduce data redundancy. However, in the real world, databases are not completely normalized. Out of the seven normal forms outlined by the Murach text, only the first, second, and third normal forms are generally used in the real world.

Below is the first draft of my team’s database design for a system that will be able to give users the ability to search for a teacher’s schedule (the classes they teach, where they teach them, and when they teach them).

EER Model for the Staff Directory System

CST 363 Week 1

Today marks the first official day of CST 363, a Databases course taught with MySQL and Python 3. Because I am returning to work next week and because the professor posted our weekly work in advance, I made sure to get a head start to manage my time effectively. Therefore, I have already completed all of Week 1’s readings and the assignments.

I can remember the term CGI as far back as I started using the Internet back in the mid to late 90s on AOL. I recall seeing member pages with a directory named cgi-bin. Now I actually know what CGI (Common Gateway Interface) is and what the cgi-bin directory is for. A Common Gateway Interface allows webservers to execute applications and create webpages dynamically. The cgi-bin directory is the directory where such programs must be stored (on a webserver) in order to be executed correctly by the webserver.

This week, I was able to created a web-based form that simulating either a user attempting to login or to register on a server. First, I ran Python script (provided my the professor) that started an HTTP server on my local machine. An HTTP server allows you to access and HTML file on your browser via a GET request to that server:

127.0.0.1 - - [09/Jan/2019 18:02:09] "GET /login.html HTTP/1.1" 200 -

As previously mentioned, our assignment was to create a web-based form to allow a user to login or register. A user’s login information (userid and password) had to be checked against a MySQL database in order to determine if the credentials were valid for either registration (userid can’t already exist) or logging in (password must match for an existing userid).

The HTML <form> tag requires two attributes: action and method. The action is the name of a file or the name of a program the HTTP server should send form-data to. The method specifies whether the HTTP server should send the form-data with a GET or POST request. In this case, our web form sent a POST request encoding key-value pairs (userid: ‘your_user_id’ and password: ‘your_password’) to a Python 3 program that would use this data to authenticate the credentials (or insert in the case of valid registration) against our MySQL database.

127.0.0.1 - - [09/Jan/2019 18:53:04] "POST /cgi-bin/login.py HTTP/1.1" 200 -