W4121: Non-graded Quiz, Wu’s Lectures

Lec2: Data Modeling

The point of data modeling is to think about what characteristics of the data you will need for your application.

Q1: Data Independence

What is data independence and why does it matter?
What are the main two reasons why hierarchical data model is a bad idea?
Give an example that illustrates the bad idea.
When could it make sense to use a hierarchical data model?
What did the graph model try to solve? What limitations did it still have?

Q2: Queries

Suppose our data adheres to the following schema

    animals(id, species, name, gender, keeper_id)
    keepers(id, name, birthday)
    feedings(keeper_id, animal_id, date, time, food, pounds)

Describe in english what each of the following operations will do:

filter(feedings, time < 12pm)
filter(feedings, time < 12pm and pounds > 10)
project(animals, [name])
project(animals, [species])
join(feedings, keepers, keeper_id = id)
join(animals, feedings, keeper_id = id)
groupby(feedings, [food], SUM(pounds), COUNT(pounds))

What are the operator symbols (greek symbols) for the above operations?

Describe in english what the following operations do:

Π_food σ_time=12pm(feedings)
Π_name (σ_time=12pm(feedings) ⋈_{id=animal_id} σ_{species=gorilla}(animals))

Q3: Logging

We have seen that having a schema is helpful because then data can be easily queried. Before the solution described in the paper, Twitter’s event messages DID adhere to a schema of (category, message).

Even though the events had a schema, why did data scientists still have trouble analyzing logging data?
- What was the source of this trouble?
- Why do you think Twitter let this happen? Couldn’t they have prevented it from the very beginning?
What was the main solution to address the above troubles?

W4121: Non-graded Quiz, Wu’s Lectures

Lec2: Data Modeling

Q1: Data Independence

Q2: Queries

Q3: Logging

Lec 3/4: Query Execution