An Introduction to Relational Algebra Using PLRelational
We recently announced PLRelational, our framework for storing and working with data that is based on relational algebra. This raises the question: what exactly is relational algebra? Most of the material out there is either highly academic, or focused on SQL databases. This article will give an overview of the basics of relational algebra in a way that programmers can easily understand, aimed at explaining the foundations that PLRelational is built on. Terminology will match what PLRelational uses.
Relational algebra looks a lot like set theory, with some extra constraints and specialized operations. If "set theory" scares you, not to worry: for our purposes, just think of Swift's Set
type. We'll be dealing with the same things here: an unordered collection of unique items which you can iterate over, ask about membership, and combine in various ways. PLRelational even uses Set
to represent certain relations.
Terminology
Let's build up terminology from the simplest elements. The most basic component is a value, which PLRelational represents using the RelationValue
type. Values represent your data, such as a username, a timestamp, or a boolean flag. Conceptually a value can be anything that can be checked for equality, but practically we need some limits on what they can be. PLRelational mimics SQLite's value types, and allows values to be 64-bit integers, double-precision floating-point numbers, strings, blobs (raw data expressed as a byte array), and null.
Another basic component is an attribute, which PLRelational represents with the Attribute
type. This acts like a dictionary key and is essentially a string. Each value is stored under an attribute.
Values and attributes are combined into rows. Rows are essentially dictionaries, with attributes as the keys and values as the values. In fact, PLRelational's Row
type originally stored its contents as [Attribute: RelationValue]
. The current implementation is more sophisticated, but the functionality is the same.
A relation, represented with the Relation
type, is conceptually a set of rows. All rows within a given relation have the same set of attributes, which is called a scheme.
To summarize:
- Relation: a set of unique rows, all of which have the same scheme.
- Row: a bunch of values, each associated with a unique attribute.
- Scheme: the set of attributes in a row.
- Attribute: a string describing the meaning or purpose of a value.
- Value: a primitive chunk of data, holding a string, integer, or similar.
Example
Let's quickly look at a concrete example. We'll track a list of employees, where each employee has a name and an ID. We can set this up in PLRelational with a little bit of code:
let employees = MemoryTableRelation(scheme: ["name", "id"])
employees.asyncAdd(["name": "Jane Doe", "id": 1])
employees.asyncAdd(["name": "John Smith", "id": 2])
This example uses MemoryTableRelation
, which as the name indicates stores the data directly in memory. This same code could easily use a different backing store, such as an SQLite database or a property list file, just by changing MemoryTableRelation
to the appropriate type. We can also use the MakeRelation
function as a convenient shorthand for creating a MemoryTableRelation
without having to type attribute names over and over:
let employees = MakeRelation(
["name", "id"],
["Jane Doe", 1],
["John Smith", 2])
When pretty-printed, it looks like this:
id name
1 Jane Doe
2 John Smith
The employees
variable holds a relation. Its scheme consists of the attributes "name"
and "id"
. It holds two rows, both of which have those attributes. The row values hold the employees' names and IDs.
Basic Set Operations
Relations are sets of rows. What sorts of operations can you do on them? To start with, you can do the same things you can do to sets.
To start with something really simple, you can union two relations. The result contains everything that was in either original relation. In PLRelational, you can create a new relation that represents this operation by using the union
method:
let allEmployees = oldEmployees.union(newEmployees)
This can also be done using the +
operator:
let allEmployees = oldEmployees + newEmployees
If oldEmployees
contains this:
id name
2 John Smith
1 Jane Doe
And newEmployees
contains this:
id name
3 Tim S
4 Susan Johnson
Then allEmployees
contains all entries from both:
id name
2 John Smith
1 Jane Doe
3 Tim S
4 Susan Johnson
When it comes to PLRelational, it's important to note that creating allEmployees
like this does not actually perform any work on the data! It just creates a new relation object which represents the union of the others. The actual work of gathering the data and unioning it together only happens when you ask allEmployees
(or some other relation derived from it) for its contents. This is true for all relation operators: you build up new relation objects representing the operations applied to the given operands, and work only happens when you request data.
The difference operation works in much the same way. It produces only the rows contained in the first operand, but not rows also contained in the second. Similar to union
, you can use the difference
method to make a new relation representing the operation:
let managementEmployees = allEmployees.difference(frontlineEmployees)
As with union, you can also use an operator:
let managementEmployees = allEmployees - frontlineEmployees
As an example, if these are the frontlineEmployees
:
id name
2 John Smith
4 Susan Johnson
And allEmployees
is as shown above, then managementEmployees
contains this:
id name
1 Jane Doe
3 Tim S
There's also an intersection operation, which produces only the rows contained in both operands. The intersection
method produces a relation representing an intersection:
let nightManagers = nightEmployees.intersection(managementEmployees)
And there's also an operator, although this one is difficult to type:
let nightManagers = nightEmployees ∩ managementEmployees
For an example here, if nightEmployees
contains this:
id name
1 Jane Doe
4 Susan Johnson
Then nightManagers
contains this:
id name
1 Jane Doe
Select
A select is a filter operation on a relation. It takes a relation and a predicate and produces a relation containing only the rows where the predicate is true.
In PLRelational, predicates are values which conform to the SelectExpression
protocol. A SelectExpression
is something that can take a row and produce a RelationValue
. If the result of a SelectExpression
is an integer zero, it's considered to be false. All other values are considered to be true.
PLRelational provides a bunch of built-in SelectExpression
types. Simple values like String
and Int64
conform, and they're implemented to ignore the passed-in row and produce their value as a RelationValue
. Attribute
also conforms, and it produces the row's value for that attribute.
It also provides a bunch of operators such as equality, comparison, and logical AND/OR. Because these operators build expressions rather than producing results immediately, they are prefixed with a *
to distinguish them from the standard operators like ==
or <
.
To filter a relation in PLRelational, call the select
method with a select expression:
let employeeFour = employees.select(Attribute("id") *== 4)
let earlyEmployees = employees.select(Attribute("id") *<= 10)
As before, this creates a relation which will perform the given operation on demand, but doesn't do any filtering work until then.
Project and Rename
Sometimes it's useful to manipulate the attributes in a relation without manipulating the underlying data.
Rename is a really simple operation: it takes a list of attribute pairs, and produces a new relation where the first attribute in each pair is renamed to the second one. For example, imagine that for some reason we need our employees to have "employee_name"
and "employee_id"
instead of just "name"
and "id"
. In PLRelational, you can call the renameAttributes
method and tell it to make those changes:
let renamedEmployees = employees.renameAttributes(["name": "employee_name",
"id": "employee_id"])
The result looks like this:
employee_id employee_name
1 Jane Doe
2 John Smith
A project lets you remove unneeded attributes. For example, if you just wanted a list of employee IDs but not their names, you can eliminate the names by projecting onto the "id"
attribute:
let employeeIDs = employees.project("id")
id
1
2
Note that relations are always sets, and each row is unique. If a projection creates multiple identical rows due to eliminating the attribute that makes them unique (in this example, that would be two employees who somehow have the same ID but different names) then those rows are coalesced.
Joins
A join combines two relations with different schemes, and produces a new relation whose scheme is the combination of the two. The contents of the relation come from matching up values within the rows on each side.
The fundamental operation is called an equijoin. An equijoin takes two relations and a list of attributes to match. It then produces rows by gluing together rows from the operands where the values of those attributes are the same on both sides.
Let's look at a quick example. Here's a relation containing equipment registered to our employees:
let equipment = MakeRelation(
["owner_id", "serial_number", "comment"],
[1, "88842", "Computer"],
[1, "123", "Mouse"],
[2, "X427A", "Car"],
[2, "FFG77", "Cordless drill"],
[2, "7", "Seven"])
We have each owner's ID, but not their name. We can include the name by using the equijoin
method and telling it to match id
in employees
to owner_id
in equipment
:
let employeeEquipment = employees
.equijoin(equipment, matching: ["id": "owner_id"])
Pretty-printing this, we get:
comment id name owner_id serial_number
Cordless drill 2 John Smith 2 FFG77
Seven 2 John Smith 2 7
Car 2 John Smith 2 X427A
Mouse 1 Jane Doe 1 123
Computer 1 Jane Doe 1 88842
Note that the values in id
and owner_id
will always be identical here. We can eliminate this redundancy with a project:
let employeeEquipment = employees
.equijoin(equipment, matching: ["id": "owner_id"])
.project(dropping: ["owner_id"])
comment id name serial_number
Seven 2 John Smith 7
Cordless drill 2 John Smith FFG77
Computer 1 Jane Doe 88842
Car 2 John Smith X427A
Mouse 1 Jane Doe 123
A join is a special case of an equijoin, where the matching attributes are those attributes that both sides have in common. For example, we could replicate the above result by renaming "owner_id"
to "id"
and then doing a join:
let employeeEquipment = employees
.join(equipment.renameAttributes(["owner_id": "id"]))
comment id name serial_number
Seven 2 John Smith 7
Cordless drill 2 John Smith FFG77
Computer 1 Jane Doe 88842
Car 2 John Smith X427A
Mouse 1 Jane Doe 123
Joins act a lot like a select, where the predicate involves matching values between two relations rather than a constant value in the predicate expression. Joins can be really useful for tracking a selection. For example, let's say you have a relation which contains the ID of the employee currently selected in your app's UI:
let selectedEmployeeID = MakeRelation(["id"], [1])
You can get all information about the selected employee by joining this relation with employees
:
let selectedEmployee = selectedEmployeeID.join(employees)
You can then project that relation down to just the name so that you can bind it to a UI control that will show the currently selected employee's name:
let selectedEmployeeName = selectedEmployee.project("name")
Pretty-printing this on our test data produces:
name
Jane Doe
Wrapping Up
A relation is a set of rows. A row is effectively a string-to-value dictionary, where the keys are attributes. All rows in a given relation have the same attributes. That set of attributes is called the relation's scheme.
Since relations are sets, they support basic set operations like union, intersection, and difference. They also support filtering in the form of the select operation.
Relation attributes can be manipulated by renaming them, and attributes can be removed by projecting the relation. This is useful to get different forms of a relation into different parts of your program.
Joins allow combining two relations with different schemes. They produce new rows by gluing together rows from the relations where those rows have matching values. The equijoin operation allows matching arbitrary pairs of attributes, while the join operation handles the common case of matching the attributes that exist in both relations.
PLRelational provides all of these operations, and more, as methods on the Relation
type. These methods don't perform the work immediately, but rather produce a new Relation
that represents the operation. The work is performed only when data is requested. To see it in action, check out PLRelational's tests, in particular the RelationTests.swift
file, which has extensive examples of the various operations.
Need help? Plausible Labs offers consulting services for software engineering. If you'd like some professional assistance, whether with PLRelational or for something entirely different, consider us. More information can be found on our consulting page.