Line 33: | Line 33: | ||

Suppose we have the following database | Suppose we have the following database | ||

− | + | <math>R \\ | |

− | R \\ | + | |

\begin{array}{| c c |} | \begin{array}{| c c |} | ||

\hline | \hline | ||

Line 44: | Line 43: | ||

5 & 5 \\ | 5 & 5 \\ | ||

\hline | \hline | ||

− | \end{array} | + | \end{array}</math><math>S \\ |

− | + | ||

− | + | ||

− | S \\ | + | |

\begin{array}{| c |} | \begin{array}{| c |} | ||

\hline | \hline | ||

Line 53: | Line 49: | ||

7 \\ | 7 \\ | ||

\hline | \hline | ||

− | \end{array} | + | \end{array}</math> |

− | + | ||

the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t. | the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t. | ||

Line 110: | Line 105: | ||

* CQs can be translated to [[First Order Logic]] expressions | * CQs can be translated to [[First Order Logic]] expressions | ||

* for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$ | * for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$ | ||

− | * FOL expressions is $\{ x_1, ..., x_n \ | + | * FOL expressions is $\{ \, x_1, ..., x_n \ \mid \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \, \}$ |

** $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential | ** $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential | ||

Line 185: | Line 180: | ||

** this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution) | ** this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution) | ||

** since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism) | ** since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism) | ||

− | ** $\to$ $f(h(\text{body}_2)) \subseteq f(\text{body}_1) \subseteq D$ | + | ** $\to$ $f\big(h(\text{body}_2)\big) \subseteq f(\text{body}_1) \subseteq D$ |

** in other words, $f \circ h$ is a matching of $Q_2$ into $D$ | ** in other words, $f \circ h$ is a matching of $Q_2$ into $D$ | ||

* hence | * hence | ||

− | ** $f(h(\text{body}_2)) \in Q_2(D)$ | + | ** $f\big(h(\text{body}_2) \big) \in Q_2(D)$ |

− | ** and finally $t = f(\text{head}_1) = f(h(\text{head}_2)) \in Q_2(D)$ | + | ** and finally $t = f(\text{head}_1) = f\big(h(\text{head}_2) \big) \in Q_2(D)$ |

Proof of $\Rightarrow$ (only if) | Proof of $\Rightarrow$ (only if) | ||

Line 234: | Line 229: | ||

* [[Database Systems Architecture (ULB)]] | * [[Database Systems Architecture (ULB)]] | ||

* Database Systems Architecture lecture notes #2 by S. Vansummeren [https://dl.dropboxusercontent.com/sh/r0zvy3zaycbevx8/U0XnqCSwGZ/lect2-notes-conjunctive.pdf] | * Database Systems Architecture lecture notes #2 by S. Vansummeren [https://dl.dropboxusercontent.com/sh/r0zvy3zaycbevx8/U0XnqCSwGZ/lect2-notes-conjunctive.pdf] | ||

− | * Web Data Management book | + | * [[Web Data Management (book)]] |

− | + | ||

[[Category:Relational Databases]] | [[Category:Relational Databases]] |

A *Conjunctive Query* (CQ) is

- a First Order Logic expression without negations and disjunctions

A CQ is an expression of the following form:

$\underbrace{Q(x_1, ..., x_n)}_{\text{head}} \leftarrow \underbrace{R(a_1, ..., a_m), ..., S(c_1, ..., c_k)}_{\text{body}}$

- it consists of two parts: head and body
- $V = (a_1, ..., a_m, ..., c_1, ..., c_k)$ - variables and constants from the body
- $X = (x_1, ..., x_n)$ is a set of variables from the head
- $X \subseteq V$
- the variables from $X$ are called
*distinguished variables*(also*free*or*placeholder*variables)

- $Y = V - X$ - variables from the body which aren't in the head
- $Y \subseteq V$
- variables from $Y$ are called
*existential variables*(also*bound*variables)

- note that the head doesn't necessarily have to contain only variables, it might as well contain constants
- these constants will be returned in the results

$R(a_1, ..., a_m)$ is called an *atom*

- if an atom does not contain any variables, only constants, it's a
*fact* - so we can view a database as a set of facts

Alternative vector form:

- $Q( \vec{x} ) \leftarrow A_1(\vec{u}_1), \ ..., \ A_k(\vec{u}_k)$

Suppose we have the following database

[math]R \\ \begin{array}{| c c |} \hline 1 & 2 \\ 2 & 3 \\ 2 & 5 \\ 6 & 7 \\ 7 & 5 \\ 5 & 5 \\ \hline \end{array}[/math][math]S \\ \begin{array}{| c |} \hline 2 \\ 7 \\ \hline \end{array}[/math]

the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t.

- $(x, y)$ occurs in $R$
- $y$ then occurs in $R$ together with 5
- and that $y$ occurs as a value in $S$

**def**: a *substitution* $f$ of $Q$ into database $D$ is

- a function that maps all variables from $Q$ to constants from $D$

**def**: a substitution $f$ of $Q$ into a database $D$ is a *matching* if

- $f(\text{body}) \subseteq D$, i.e. if we evaluate $f$ on the body of the query $Q$, we will get a subset of $D$
- so when we apply $f(\text{head})$ we get a tuple that is a part of the result of evaluating $Q$ on $D$

Then the result of a conjunctive query can be shown as:

- $Q(D) \leftarrow \{ f(\text{head}) | f \text{ is a matching of $Q$ to $D$ } \}$

Let's consider the previous example: $Q(x, y) \leftarrow \underbrace{R(x, y)}_{(1)}, \underbrace{R(y, 5)}_{(2)}, \underbrace{S(y)}_{(3)}$ for the same database

Example with matching

- $(1)$ gets a tuple form $R$
- say it's (1, 2)
- so it assigns $x \mapsto 1, y \mapsto 2$

- so our substitution so far is $f: x \mapsto 1, y \mapsto 2$

- $(2)$ looks for a tuple in $R$ with $y = 2$
- because we established the matching $f$ with $y \mapsto 2$
- second value of this tuple must be a constant 5
- we find such tuple: it's (2, 5)
- nothing gets assigned at this step

- $(3)$ looks for a value in $S$ with $y = 2$
- same reason: we have $y \mapsto 2$
- it finds a fact $S(2)$ in the database

- so it returns a matching pair (1, 2)
- we say there is a matching (1, 2) in the database $D$

Example without matching

- $(1)$ we try another tuple from $R$, say (7, 5)
- we assign $x \mapsto 7, y \mapsto 5$

- $(2)$ now we try to find a tuple ($y$, 5) = (5, 5) in $R$
- no such tuple

- therefor (7, 5) is not a matching and will not be returned by $Q$

So this way we try all values of our database table and return only matching ones

- for this particular example the query returns two tuples: (1, 2) and (6, 7)
- i.e. $Q(D) = \{ (1, 2), (6, 7) \}$

- CQs can be translated to First Order Logic expressions
- for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$
- FOL expressions is $\{ \, x_1, ..., x_n \ \mid \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \, \}$
- $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential

- CQs can be translated to a subset of RA expressions - Select-Project-Join Expressions
- see Conjunctive Query/Translation

CQs have an interesting property

- The query containment problem is undecidable for SQL and Relational Algebra (see Logical Query Plan Optimization), but it is decidable for Conjunctive Queries
- The decidability of containment is NP-complete problem, but usually CQs are not big, so it is acceptable

This makes CQs very suitable for Logical Query Plan Optimization, namely, for removing redundant joins

$Q_1$ *is contained in* $Q_2$ (denoted as $Q_1 \subseteq Q_2$) if

- for any database $D$ holds $Q_1(D) \subseteq Q_2(D)$
- i.e. the result of $Q_1$ is always a subset of the result of $Q_2$, no matter what database $D$ they are evaluated on

$Q_1$ *is equivalent to* $Q_2$ (denoted as $Q_1 \equiv Q_2$)

- iff $Q_1 \subseteq Q_2 \land Q_2 \subseteq Q_1$

Example

- $A(x, y) \leftarrow R(x, w), G(w, z), R(z, y)$
- $B(x, w) \leftarrow R(x, w), G(w, w), R(w, y)$
- $B \subseteq A$

To show that we use the definition

- let $D$ be an arbitrary DB and
- let $t \in B(D)$ (one arbitrary result of $B$ evaluated on $D$)
- there exists a matching $f$ of $B$ into $D$ s.t. $t = (f(x), f(y))$ (by definition of matching)
- so we need to show that $t = (f(x), f(y)) \in A(D)$

- let $h$ be a substitution:
- $h: x \mapsto f(x), y \mapsto f(y), w \mapsto f(w), z \mapsto f(w)$

- then $h$ is a matching of $A$ into $D$
- $t = (f(x), f(y)) = (h(x), h(y)) \in A$

It is also possible to show that containment is decidable.

**def**: a *homomorphism* of $Q_2$ to $Q_1$ is

- a function $h$ that maps each variable in $Q_2$ to either
- a variable from $Q_1$ or
- a constant form $Q_1$

- s.t.
- $h(\text{head}_2) = \text{head}_1$
- $h(\text{body}_2) \subseteq \text{body}_1$
- i.e. the values returned by queries are always the same, and body of one query is a subset of the other's query body

Examples

**Thm**: $Q_1 \subset Q_2 \iff $ there's a homomorphism from $Q_2$ to $Q_1$

Proof of $\Leftarrow$ (if)

- let $h: Q_2 \to Q_1$ be a homomorphism
- let $D$ be a database
- if we fix an arbitrary tuple $t \in Q_1(D)$, we need to prove that $t \in Q_2(D)$
- since $t \in Q_1(D)$ we know that
- $t = f(\text{head}_1)$
- with a matching $f$ of $Q_1$ into $D$
- (by semantics of CQs)

- let's consider a composition $f \circ h$ (composition of $f$ with the homomorphism $h$)
- this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution)
- since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism)
- $\to$ $f\big(h(\text{body}_2)\big) \subseteq f(\text{body}_1) \subseteq D$
- in other words, $f \circ h$ is a matching of $Q_2$ into $D$

- hence
- $f\big(h(\text{body}_2) \big) \in Q_2(D)$
- and finally $t = f(\text{head}_1) = f\big(h(\text{head}_2) \big) \in Q_2(D)$

Proof of $\Rightarrow$ (only if)

- suppose that $Q_1 \subseteq Q_2$
- let's consider variables in $Q_1$ as constants
- we can view $\text{body}_1$ as a mini-database $D_{Q_1}$
- (this database is called a
*canonical database*of $Q_1$)

- the identity function is a matching of $Q_1$ to $D_{Q_1}$
- hence $\text{head}_1 \in Q_1(D_{Q_1})$

- since $Q_1 \subseteq Q_2$ we know that $\text{head}_1 \in Q_2(D_{Q_1})$
- by construction of our database $D_{Q_1}$

- so there exists a matching $f$ from $Q_2$ to $D_{Q_1}$ s.t.
- $\text{head}_1 = f(\text{head}_2)$ and
- $f(\text{body}_2) \subseteq D_{Q_1} = \text{body}_1$

- $f$ is a homomorphism of $Q_2$ to $Q_1$ by the definition (by considering variables again as variables)

$\blacksquare$

From the second part of the proof we may get a way of checking for containment: the Golden Method

To decide whether $Q_1 \subseteq Q_2$

- evaluate $Q_2$ on a canonical database $D_{Q_1}$ (which is a body of $Q_1$, see #Containment Theorem)
- check if the head of $Q_1$ is in the results

QueryContainment($Q_1$, $Q_2$)

- input
- $Q_1(\vec{x}) \leftarrow g_1(\vec{x}_1), ..., g_n(\vec{x}_n)$
- $Q_2(\vec{y}) \leftarrow h_1(\vec{y}_1), ..., h_m(\vec{y}_m)$

- freeze $Q_1$: construct a canonical database $D_{Q_1} \equiv \{ g_i \big( v( \vec{x}_i ) \big) \}$
- with $v$ being a matching

- if $v(\vec{x}) \in Q_2(D_{Q_1})$ return
**yes**, otherwise**no**

- Database Systems Architecture (ULB)
- Database Systems Architecture lecture notes #2 by S. Vansummeren [1]
- Web Data Management (book)