PostgreSQL Profiling Queries with SQL Explain Analyze

PostgreSQL Profiling Queries with SQL Explain Analyze

Using the query planner within SQL is the equivalent of using the debugger tools in browsers. By using explain with options for formatting and actually executing with analyze, we can see how our query performs.

Maybe we need to implement our query differently. Which includes possibly adding an index, using a sub-query instead of another table, or dozens of other options.

Query Performance and Optimization

EXPLAIN SELECT * FROM employees;
Seq Scan on employees  (cost=0.00..22.12 rows=1012 width=64)

This will return an execution plan which is generated by Postgres. For the command above, the execution plan will show how the table reference by statement will be scanned.

The most important thing to note is the cost.

How cost Works

The cost output shows two different values –> startup cost..total cost

Startup cost: The time it takes for the first row to be returned. Total cost: The time it will take for all of the rows from our query to be returned.

We can get as much info as possible by stating verbose with our explain. Some additional info we get here is which schema this table exists on and the table columns.

explain verbose select * from employees;
Seq Scan on public.employees  (cost=0.00..22.12 rows=1012 width=64)
"  Output: employee_id, first_name, last_name, email, hire_date, department, gender, salary, region_id"

With this select, we are performing a sequential scan, explain select * from employees;. This means that the select statement will loop over every single row one after another until the end.

 There’s another scan, which is typically faster, which is called the index scan. We can see that because our employees table has a primary key, this automatically creates an index on employee_id, and then it’ll use an index scan using for that user.

EXPLAIN SELECT * FROM employees WHERE employee_id =252;
Index Scan using employees_pkey on employees  (cost=0.28..8.29 rows=1 width=64)
  Index Cond: (employee_id = 252)

As we note that index scan is typically is faster as we see here in our total time to return the row. However, it has a startup cost and can actually be slower than a sequential scan in some cases. As we run our queries through the planner, we’re going to see that for each aggregation, for each where clause that we use, for every join that we end up using, there’s going to be a cost.

EXPLAIN SELECT first_name, email, department, salary,
       nth_value(salary,5) OVER(PARTITION BY department ORDER BY salary DESC) AS NTH_VALUE_OF_SAL
FROM employees;
WindowAgg  (cost=72.63..92.87 rows=1012 width=46)
  ->  Sort  (cost=72.63..75.16 rows=1012 width=42)
"        Sort Key: department, salary DESC"
        ->  Seq Scan on employees  (cost=0.00..22.12 rows=1012 width=42)

Our planner will break down and guess as much as it can when using just explain.

Now if we actually run the query with EXPLAIN ANALYSE

EXPLAIN ANALYSE SELECT first_name, email, department, salary,
       nth_value(salary,5) OVER(PARTITION BY department ORDER BY salary DESC) AS NTH_VALUE_OF_SAL
FROM employees;
WindowAgg  (cost=72.63..92.87 rows=1012 width=46) (actual time=1.773..2.719 rows=1000 loops=1)
  ->  Sort  (cost=72.63..75.16 rows=1012 width=42) (actual time=1.760..1.835 rows=1000 loops=1)
"        Sort Key: department, salary DESC"
        Sort Method: quicksort  Memory: 113kB
        ->  Seq Scan on employees  (cost=0.00..22.12 rows=1012 width=42) (actual time=0.013..0.252 rows=1000 loops=1)
Planning Time: 0.074 ms
Execution Time: 2.795 ms

We get the actual values as well as we’re going to be told which algorithms will be used with resources used for that.

 One of the biggest debates when SQL query profiling is when to use which option. Should you use a group by? Should you use a distinct? Would a sub-query work better here or possibly a timetable?

While some principles like adding indexes are usually true, there is no silver bullet. Each table, number of rows of data, constraints, foreign keys, etc. are different, and all options need to be run through a profiler to find the best option.

Now let us examine two queries which returns same set of values

SELECT first_name, department, (SELECT COUNT(*) FROM employees e2 WHERE e1.department = e2.department)
FROM employees e1
GROUP BY department, first_name
EXCEPT
SELECT first_name, department,
       count(*) OVER(PARTITION BY department)
FROM employees;
 first_name | department | count 
------------+------------+-------
(0 rows)

If we use EXPLAIN ANALYSE on these queries we will get drastically different results.

EXPLAIN ANALYSE SELECT first_name, department, (SELECT COUNT(*) FROM employees e2 WHERE e1.department = e2.department)
FROM employees e1
GROUP BY department, first_name;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=27.18..23427.74 rows=945 width=24) (actual time=0.610..117.820 rows=999 loops=1)
   Group Key: e1.department, e1.first_name
   ->  Seq Scan on employees e1  (cost=0.00..22.12 rows=1012 width=16) (actual time=0.008..0.097 rows=1000 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=24.74..24.75 rows=1 width=8) (actual time=0.117..0.117 rows=1 loops=999)
           ->  Seq Scan on employees e2  (cost=0.00..24.65 rows=37 width=0) (actual time=0.005..0.114 rows=42 loops=999)
                 Filter: ((e1.department)::text = (department)::text)
                 Rows Removed by Filter: 958
 Planning Time: 0.131 ms
 Execution Time: 117.914 ms
(10 rows)
EXPLAIN ANALYSE SELECT first_name, department,
       count(*) OVER(PARTITION BY department)
FROM employees;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=72.63..90.34 rows=1012 width=24) (actual time=0.941..1.533 rows=1000 loops=1)
   ->  Sort  (cost=72.63..75.16 rows=1012 width=16) (actual time=0.920..0.978 rows=1000 loops=1)
         Sort Key: department
         Sort Method: quicksort  Memory: 82kB
         ->  Seq Scan on employees  (cost=0.00..22.12 rows=1012 width=16) (actual time=0.014..0.208 rows=1000 loops=1)
 Planning Time: 0.067 ms
 Execution Time: 1.602 ms
(7 rows)

1 st query –>

Planning Time: 0.131 ms
Execution Time: 117.914 ms

2nd one –>

Planning Time: 0.067 ms
Execution Time: 1.602 ms

I could rerun these multiple times and they would probably flip flop as far as who is quicker. The point is, there is no silver bullet. Make sure you try all of your options when you’re trying to profile your queries.

To get more pictorial representation

Query 1

Query 1

Query 2

Query 2

Conclusion

As we observe from the above results the way one uses the aggregate functions one with Hash Agg and other with Window Agg shows 100X improvement in query costs. This is for a sample table with less no of records, Imagine in production systems of which having tens of millions of rows.

It can be hard to determine which options will have the most optimal results with the cost. But each table is different. To find the best, most optimal options, you’ll have to run them through a profiler.

0

Leave a Comment

Your email address will not be published. Required fields are marked *