
Introducing CTEs
Any DML statement (such as SELECT, INSERT, UPDATE, and DELETE) can be enhanced by placing a WITH predicate before the query. Any sub-query appearing in the WITH clause can be thought of as a materialized result set over the data, or as a temporary table that exists throughout the duration of a statement. CTEs have many applications—the primary one being the ability to split a complex query into a set of smaller queries. A CTE can be made up of the following two parts:
- A main query, also known as a top-level statement.
- One or more sub-queries, which materialize data for the duration of a main query. Each sub-query is also known as an auxiliary statement.
Every auxiliary statement has a name that is used to reference it from the main query or other sub-queries. It may also have a list of returned column names. The template of a CTE, therefore, is as follows:
WITH -- sub-queries sub-query_1 AS ( <query> ) , sub-query_2 AS ( <query> ) -- main query <query referencing sub-query_1 and sub-query_2>
The following listing shows a very simple CTE that selects all file names from a sub-query that, in turn, select specific hash values:
testdb=> WITH get_files_by_hash AS (
SELECT pk, f_name FROM files WHERE f_hash like 'abc%' )
SELECT f_name FROM get_files_by_hash;
There are a few rules to bear in mind relating to the use of CTEs. The first important rule is that PostgreSQL does not optimize any query outside of a WITH boundary. This means that each auxiliary statement is considered on its own by the optimizer, and that constraints are not pushed outside of sub-query boundaries.
Another important rule is that every auxiliary statement of a CTE is evaluated no more than once. This means that CTEs can be used to wrap complex computations that will be executed once, no matter how many times the result is requested in the main query. If the CTE auxiliary statement is read-only (if it is a SELECT statement, for example), it will be executed only once it is referenced. This means that it is possible to insert a CTE auxiliary statement that has never been executed. Similarly, if the CTE auxiliary statement is writable (if it contains any INSERT, UPDATE, or DELETE statements), it will only be executed once and without any regard to whether the auxiliary statement has been referenced in the main query.
Let's take a look at an example to make this clearer. As you can see in the following listing, the CTE defines two read-only sub-queries, but only get_random_number is referenced across the whole CTE. Therefore, the get_current_time auxiliary statement will never be executed. On the other hand, the delete_files auxiliary statement, as it is writable, will be executed exactly once, even if it is not referenced anywhere:
testdb=> WITH get_current_time AS ( SELECT now()::time ) ,
get_random_number AS ( SELECT random() ) ,
delete_files AS ( DELETE FROM files )
SELECT r, s FROM get_random_number r, generate_series( 1, 3 ) s;
If you execute the statement in the preceding snippet, the output should be as shown in the following listing. From there, we can clearly see that the random number generated by the get_random_number auxiliary statement is always the same, confirming that the auxiliary statement has been invoked exactly once:
r | s ---------------------+--- (0.0959695233032107) | 1 (0.0959695233032107) | 2 (0.0959695233032107) | 3
We can also get a better overview of the execution using the EXPLAIN command to get the plan, as shown in the following listing:
QUERY PLAN -------------------------------------------------------------------------------- Nested Loop (cost=18.82..38.84 rows=1000 width=28) CTE get_random_number -> Result (cost=0.00..0.01 rows=1 width=8) CTE delete_images -> Delete on files (cost=0.00..18.80 rows=880 width=6) -> Seq Scan on files (cost=0.00..18.80 rows=880 width=6) -> CTE Scan on get_random_number r (cost=0.00..0.02 rows=1 width=24) -> Function Scan on generate_series s (cost=0.00..10.00 rows=1000 width=4)