Don't Forget The Elephant

A Review of Modern PostgreSQL

Presented by Keith Fiske / @keithf4

Database Administrator @ OmniTI



Follow along at
http://slides.keithf4.com/dontforget

OmniTI, Inc

  • Full-stack support for high-traffic websites & applications
    • Millions+ of users
    • Terabytes+ of data
    • Gilt, Etsy, Ora.TV, Freelotto
  • Surge Conference
  • We're hiring!

What is PostgreSQL?

  • Open Source RDBMS
  • Started at UC Berkley 1986, open sourced in 1996
  • BSD-type License
  • Follows SQL Standard very closely
  • Yearly major version releases (x.x)
    • Minor (x.x.x) patch releases as needed
    • Generally patch all supported versions at one time
  • Third-party Plugin Support
    • Procedural Languages
      • C, Java, Python, Perl, JavaScript, PHP, R, Ruby, etc
    • Extensions
    • Background Workers
  • Mailing Lists, IRC, Wiki, planet.postgresql.org, Conferences & User Groups

A Brief History of Releases

Version Released Final Version EOL
PostgreSQL 8.3 Feb 2008 8.3.23 Feb 2013
PostgreSQL 8.2 Dec 2006 8.2.23 Dec 2011
PostgreSQL 8.1 Nov 2005 8.1.23 Oct 2010
PostgreSQL 8.0 Jan 2005 8.0.26 Oct 2010
PostgreSQL 7.4 Nov 2003 7.4.30 Oct 2010
PostgreSQL 7.3 Nov 2002 7.3.21 Nov 2007
PostgreSQL 7.2 Feb 2002 7.2.8 Feb 2007
PostgreSQL 7.1 Apr 2001 7.1.3 Apr 2006
PostgreSQL 7.0 May 2000 7.0.3 May 2005
PostgreSQL 6.5 June 1999 6.5.3 June 2004
PostgreSQL 6.4 Oct 1998 6.4.2 Oct 2003
PostgreSQL 6.3 March 1998 6.3.2 March 2003

8.0 Features

(Released Jan 2005 - EOL Oct 2010)
  • Native Microsoft Windows release
  • Savepoints
  • Point-In-Time Recovery (PITR)
  • Tablespaces
  • Change column types

8.1 Features

(Released Nov 2005 - EOL Nov 2010)
  • Two-phase commit
  • Role system simplified separate user/group
  • Greatly improved memory management for increased concurrency performance
  • Auto-vacuum moved into main server (was contrib module)

8.2 Features

(Released Dec 2006 - EOL Dec 2011)
  • Concentrated on user requested features for easier administration and better performance
    • INSERT/UPDATE/DELETE ... RETURNING ...
    • Concurrent index creation
    • Many query planner improvements
    • More efficient locking for better concurrency
    • More efficient vacuuming

8.3 Features

(Released Feb 2008 - EOL Feb 2013)
  • Full text search integrated into core system
  • ENUM & UUID types
  • Concurrent autovacuum allowed spawning of multiple vacuum jobs automatically
  • Updatable cursors

8.4 Features

Released July 2009 - EOL July 2014
Final Version 8.4.22

Window Functions

  • Performs a calculation across a set of table rows that are somehow related to the current row
  • The OVER clause determines exactly how the rows of the query are split up for processing by the window function

SELECT salary, sum(salary) OVER () FROM empsalary;

 salary |  sum  
--------+-------
   5200 | 47100
   5000 | 47100
   3500 | 47100
   4800 | 47100
   3900 | 47100
   4200 | 47100
   4500 | 47100
   4800 | 47100
   6000 | 47100
   5200 | 47100
(10 rows)


SELECT salary, sum(salary) OVER (ORDER BY salary) FROM empsalary;

 salary |  sum  
--------+-------
   3500 |  3500
   3900 |  7400
   4200 | 11600
   4500 | 16100
   4800 | 25700
   4800 | 25700
   5000 | 30700
   5200 | 41100
   5200 | 41100
   6000 | 47100
(10 rows)
                        

Window Functions

  • The PARTITION BY list within OVER specifies dividing the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s).

SELECT depname, empno, salary, avg(salary) 
OVER (PARTITION BY depname) 
FROM empsalary;

  depname  | empno | salary |          avg          
-----------+-------+--------+-----------------------
 develop   |    11 |   5200 | 5020.0000000000000000
 develop   |     7 |   4200 | 5020.0000000000000000
 develop   |     9 |   4500 | 5020.0000000000000000
 develop   |     8 |   6000 | 5020.0000000000000000
 develop   |    10 |   5200 | 5020.0000000000000000
 personnel |     5 |   3500 | 3700.0000000000000000
 personnel |     2 |   3900 | 3700.0000000000000000
 sales     |     3 |   4800 | 4866.6666666666666667
 sales     |     1 |   5000 | 4866.6666666666666667
 sales     |     4 |   4800 | 4866.6666666666666667
(10 rows)


SELECT depname, empno, salary, rank() 
OVER (PARTITION BY depname ORDER BY salary DESC) 
FROM empsalary;

  depname  | empno | salary | rank 
-----------+-------+--------+------
 develop   |     8 |   6000 |    1
 develop   |    10 |   5200 |    2
 develop   |    11 |   5200 |    2
 develop   |     9 |   4500 |    4
 develop   |     7 |   4200 |    5
 personnel |     2 |   3900 |    1
 personnel |     5 |   3500 |    2
 sales     |     1 |   5000 |    1
 sales     |     4 |   4800 |    2
 sales     |     3 |   4800 |    2
(10 rows)

                        

Common Table Expression

  • WITH provides a way to write auxiliary statements for use in a larger query
  • These statements, referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query
  • WITH queries are useful because they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries.

WITH regional_sales AS (
        SELECT region, SUM(amount) AS total_sales
        FROM orders
        GROUP BY region
     ), top_regions AS (
        SELECT region
        FROM regional_sales
        WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
     )
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;
                        

Common Table Expressions

Allow recursive queries

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

 sum  
------
 5050
(1 row)

                        

Parallel pg_restore

  • Prior versions, each object was restored serially one after the other
  • Parallel option tells it to always keep restoring that many objects at all times until complete.
  • Significantly reduced downtime for major version upgrades

Additional 8.4 Features

  • Default & variadic parameters for functions
  • Column permissions
  • Per-database locale settings
  • SSL certificate authentication
  • Automatic sizing of Free Space Map (eliminated significant administrative headache)
  • Live editing of functions (\ef in psql)
  • New contrib modules: pg_stat_statements, auto_explain, btree_gin

9.0 Features

Released Sept 2010 - EOL Sept 2015
Final Version 9.0.23

Streaming Replication

  • Prior replication methods required significant setup and relied on third-party tools (commands given to archive_command)
  • Simpler, TCP-based connection method
  • Significantly reduced lag time between master & slave systems (less than 1 second)

Hot Standby

  • Allow readonly connections to slave servers
  • Offset load on master systems by sending non-write queries to slave(s)
  • Better security: allow access to data but not to a system that permits writes

pg_upgrade

  • Previously, all major version (x.x) upgrades required a full dump & restore
  • Can either copy existing files to new data directory or can use hard links to reuse old files
  • Link method can allow very quick upgrade of large databases (ex: 1.2TB in < 2min; does not include stats update)
  • Minimum version to use: 8.4.7
    • Anything older still requires dump & restore

Additional 9.0 Features

  • Enhanced permissions management
    • Grant to all objects in schema with single command
    • Alter initial, default privileges of objects created by specific roles
  • Better VACUUM FULL
  • Windows 64-bit support
  • Anonymous code blocks (DO statement)
  • Deferrable unique constraints
  • Enhanced EXPLAIN plans. Export in JSON, XML or YAML

Friendly Reminder

  • If you are running any version <= 9.0, you are no longer receiving support
  • 9.0 support ended Sept 2015
  • No security updates & no data corruption fixes!
  • Significant performance increases from 8.4 and up
  • Getting to new version now will make future upgrades much easier (pg_upgrade)
  • We can help! - omniti.com

9.1 Features

Released Sept 2011 - EOL Sept 2016
Current verison 9.1.21

Foreign Data Wrappers

  • SQL/MED (Management of External Data) part of SQL Standard
  • Connect to remote databases/filesystems/services as if they were local tables inside PostgreSQL
  • Oracle, MySQL, SQL Server, SQLite, MongoDB, Redis, CSV Files, Twitter, Blackhole, etc
  • Joins & conditions pushed to remote system
  • Still performance issues joining local tables w/ remote
  • Ease the migration to PostgreSQL from different databases

Extensions

  • Package related objects into a managed bundle
  • Versioned
    • Ensure all systems have the same code
    • Controlled upgrades & downgrades
  • All contrib modules are now extensions
  • Examples: PostGIS, pg_repack, pg_partman

Writable CTE

  • WITH statements can now contain INSERT, UPDATE & DELETE
  • Use RETURNING clause to do fun things

WITH moved_rows AS (
    DELETE FROM products
    WHERE
        "date" >= '2010-10-01' AND
        "date" < '2010-11-01'
    RETURNING *
)
INSERT INTO products_log
SELECT * FROM moved_rows;
                        

Writable CTE

  • Simplify transactional logic that contains multiple, related insert/update/delete statements
  • As a side-effect, significantly improves performance!
  • 
    BEFORE:
    BEGIN;
    INSERT INTO users (name, email) VALUES (?,?) RETURNING userid;
    
    INSERT INTO addresses (userid, address, city, state, zip) 
    VALUES (?,?,?,?,?) RETURNING addressid;
    
    INSERT INTO user_history (userid, addressid, action) VALUES (?,?,?) RETURNING historyid;
    COMMIT;
    
    
    AFTER:
    WITH userdata AS (
      INSERT INTO users (name, email) VALUES (?,?)
      RETURNING userid
    ), addressdata AS (
      INSERT INTO addresses (userid, address, city, state, zip)
      SELECT userid,?,?,?,?
      FROM userdata
      RETURNING addressid 
    ), historydata AS (
      INSERT INTO user_history (userid, addressid, action)
      SELECT userid, addressid,?
      FROM userdata, addressdata 
      RETURNING historyid
    )
    SELECT userid, addressid, historyid 
    FROM userdata, addressdata, historydata;
                                

Additional 9.1 Features

  • Synchronous Replication
  • True serializable transaction isolation
  • Unlogged tables

9.2 Features

Released Sept 2012 - EOL Sept 2017
Current version 9.2.16

Range Data Type

  • Data type representing a range of values of some element type (time, id, custom)
  • Scheduling, probablilty, intersection of ordered data

-- includes 3, does not include 7, and does include all points in between
SELECT '[3,7)'::int4range;

-- does not include either 3 or 7, but includes all points in between
SELECT '(3,7)'::int4range;

-- includes only the single point 4
SELECT '[4,4]'::int4range;

-- includes no points (and will be normalized to 'empty')
SELECT '[4,4)'::int4range;


CREATE TABLE reservation (room int, during tsrange);
INSERT INTO reservation VALUES
    (1108, '[2010-01-01 14:30, 2010-01-01 15:30)');


-- Containment
SELECT int4range(10, 20) @> 3;
 ?column? 
----------
 f

-- Overlaps
SELECT numrange(11.1, 22.2) && numrange(20.0, 30.0);
 ?column? 
----------
 t

-- Extract the upper bound
SELECT upper(int8range(15, 25));
 upper 
-------
    25

-- Compute the intersection
SELECT int4range(10, 20) * int4range(15, 25);
 ?column? 
----------
 [15,20)

                        

Range Types

  • Practical Example: Buying a car in budget (Thank you Jonathan Katz)
  • Without range type, table below would have had to have a min_price & max_price column to track its possible range of values


test=# SELECT * FROM cars ORDER BY lower(cars.price_range);
 id  | name                | price_range
-----+---------------------+-----------------
 2   | Buick Skylark       | [2000,4001)
 3   | Pontiac GTO         | [5000,7501)
 4   | Chevrolet Camero    | [10000,12001)
 5   | Ford Mustang        | [11000,15001)
 6   | Lincoln Continental | [12000,14001)
 7   | BMW M3              | [35000,42001)
 8   | Audi RS4            | [41000,45001)
 9   | Porsche 911         | [47000,58001)
 10  | Lamborghini LP700   | [385000,400001)

                        

Range Type

  • Budget of $13,000 - $15,000
  • Find all cars that have any values in that price range
  • With old min_price / max_price columns, you write fun queries like this:

SELECT *
FROM cars
WHERE
    (
        cars.min_price ≤ 13000 AND
        cars.min_price ≤ 15000 AND
        cars.max_price ≥ 13000 AND
        cars.max_price ≤ 15000
    ) OR
    (
        cars.min_price ≤ 13000 AND
        cars.min_price ≤ 15000 AND
        cars.max_price ≥ 13000 AND
        cars.max_price ≥ 15000
    ) OR
    (
        cars.min_price ≥ 13000 AND
        cars.min_price ≤ 15000 AND
        cars.max_price ≥ 13000 AND
        cars.max_price ≤ 15000
    ) OR
    (
        cars.min_price ≥ 13000 AND
        cars.min_price ≤ 15000 AND
        cars.max_price ≥ 13000 AND
        cars.max_price ≥ 15000
    )
ORDER BY cars.min_price;
                        

Range Types

  • Range types & operators make this just slightly easier

SELECT *
FROM cars
WHERE cars.price_range && int4range(13000, 15000, '[]')
ORDER BY lower(cars.price_range);

 id | name                | price_range
----+---------------------+---------------
 5  | Ford Mustang        | [11000,15001)
 6  | Lincoln Continental | [12000,14001)
                        

Cascading Replication

  • Database can be a streaming slave of another slave
  • Every streaming slave off of a master causes additional load for reading its WAL stream
  • Distribute slave load among several systems in large, multi-cluster setups

Additional 9.2 Features

  • Index-Only scan
  • JSON Data Type (limited)

9.3 Features

Released Sept 2013 - EOL Sept 2018
Current Version 9.3.12

Reduce System V shared memory

  1. Install OS
  2. Install PostgreSQL package
  3. Do tuning adjustments to fix your shared_buffers (32mb is enough for anyone right?)
  4. Restart to put new settings into place
  5. Uh oh...
IpcMemoryCreate: shmget(key=5432001, size=415776768, 03600) failed: Invalid argument 

This error usually means that PostgreSQL's request for a shared memory 
segment exceeded your kernel's SHMMAX parameter. You can either 
reduce the request size or reconfigure the kernel with larger SHMMAX. 
To reduce the request size (currently 415776768 bytes), reduce 
PostgreSQL's shared_buffers parameter (currently 50000) and/or 
its max_connections parameter (currently 12).
                        

Custom Background Worker

  • PostgreSQL already used distinct background processes for its own inner workings (postmaster, writer, logger, autovacuum, stats collector, etc)
  • Now able to program your own workers and let PostgreSQL manage them for you
  • Auto-start & stop along with the database
  • Use custom configuration options in postgresql.conf
  • Example: pg_partman now has a BGW scheduler so separate cronjob is no longer required for partition maintenance

Additional 9.3 Features

  • JSON Operators - Actually useful!
  • LATERAL queries/joins
  • Writable Foreign Data Wrappers
  • An actual PostgreSQL FDW!
  • Parallel pg_dump
  • Materialized Views (kind of useless)
  • Checksum data pages

9.4 Features

Released Dec 2014 - EOL Dec 2019
Current Version 9.4.7

Binary JSON data type

  • Even more useful! Possibly replace MongoDB.
  • Same data type validation as before
  • Stored in format that does not require reparsing original text
  • Allows indexing and more efficient storage
  • New operators and functions for better interaction with JSON data
  • Fields are de-dupped. If you need dupped fields in tact, use normal json type
  • Great blogpost with JSONB examples: Unleash the Power of Storing JSON in Postgres

Binary JSON data type


CREATE TABLE cards (
  id integer NOT NULL,
  board_id integer NOT NULL,
  data jsonb
);
 
INSERT INTO cards VALUES (1, 1, '{"name": "Paint house", "tags": ["Improvements", "Office"], "finished": true}');

SELECT data->>'name' AS name FROM cards;

SELECT count(*) FROM cards WHERE data->>'finished' = 'true';

Aggregate (cost=335.12..335.13 rows=1 width=0) (actual time=4.421..4.421 rows=1 loops=1) -> Seq Scan on cards (cost=0.00..335.00 rows=50 width=0) (actual time=0.016..3.961 rows=4938 loops=1) 
    Filter: ((data ->> 'finished'::text) = 'true'::text) 
    Rows Removed by Filter: 5062 
Planning time: 0.071 ms 
Execution time: 4.465 ms

CREATE INDEX idxfinished ON cards ((data->>'finished'));

Aggregate (cost=118.97..118.98 rows=1 width=0) (actual time=2.122..2.122 rows=1 loops=1) -> Bitmap Heap Scan on cards (cost=4.68..118.84 rows=50 width=0) (actual time=0.711..1.664 rows=4938 loops=1) 
    Recheck Cond: ((data ->> 'finished'::text) = 'true'::text) 
    Heap Blocks: exact=185 
    -> Bitmap Index Scan on idxfinished (cost=0.00..4.66 rows=50 width=0) (actual time=0.671..0.671 rows=4938 loops=1) 
        Index Cond: ((data ->> 'finished'::text) = 'true'::text) 
Planning time: 0.084 ms 
Execution time: 2.199 ms
                         

Replication Slots

  • Allows master servers to be aware of slave replication state
  • Master keeps WAL in pg_xlogs for known slaves if they disconnect
  • Previously had to either set wal_keep_segments high or have secondary WAL storage to avoid slave rebuild
  • Ties in with upcoming logical replication features

Additional 9.4 Features

  • ALTER SYSTEM command to dynamically change postgresql.conf within the database
  • Materialzed view refresh no longer blocks reads.
    • Actually useful! Still not auto-refreshed
  • Dynamic loading/unloading of Background Workers
  • pg_prewarm
    • Allows immediate loading of relation data during cluster start into shared_buffers to "warm" them up
  • Logical decoding in WAL stream. First steps to native logical replication & multi-master.
  • Distinct output of Planning & Execution time from EXPLAIN ANALYZE output

9.5 Features

Released Jan 2016- EOL Jan 2021
Current Version 9.5.2

UPSERT

  • INSERT ... ON CONFLICT ...
    • ON CONSTRAINT ... (better to name columns)
    • DO NOTHING
    • DO UPDATE SET ... [WHERE ...]
  • "... optimistic variant of regular insertion that first does a pre-check for existing tuples and then attempts an insert."
  • Not quite the same as MERGE. This is an actual upsert.

Row Level Security

  • Control access to data by conditions on the values of that data
  • Adds to already extensive privilege controls (database, schema, table, column)
  • Superusers and object owners bypass RLS. Unless...
  • New GUC - row_security (ON, OFF, FORCE)
  • pg_dump sets row_securty to OFF (--enable-row-security)
  • Multiple policies per table (Results are OR'ed)
  • BYPASSRLS/NOBYPASSRLS - new role property to allow individual roles to bypass RLS
  • Row level security shows up in the \d results
  • Talk on RLS from Joe Conway @ 3:00pm in Room 303

BRIN Indexes

  • Block Range Index / MinMax Index
  • Stores only the bounds-per-heap-page vs BTree method of one index entry for each value
  • Configure how many heap pages contribute to index entry
  • Significantly faster index creation, update & smaller size
  • Sometimes slower index searches than BTree
  • Works better with statically ordered data
  • Really good explanation & examples at python sweetness blog
  • Talk on using BRIN with geospatial data from Giuseppe Broccolo @ 2:15pm in Room 303

Additional 9.5 Features

  • Sorting Speed Improvements (Abbreviated Keys)
  • checkpoint_segments replaced by min_wal_size & max_wal_size
  • pg_rewind - no longer have to rebuild failed master from scratch
  • Add IF NOT EXISTS clause to CREATE INDEX, CREATE TABLE AS, CREATE SEQUENCE, and CREATE MATERIALIZED VIEW
  • Additional GROUP BY options (GROUPING SETS, CUBE, ROLLUP)
  • JSONB improvements
  • Foreign Table Inheritance - child tables on remote databases

9.6 Features

Current Status - Development (not even alpha)

Upcoming 9.6 Features

  • Parallel Sequential Scan, Aggregation & Some Joins
  • Built in uni-directional logical replication (merge in pglogical extension)
  • Vacuum progress reporting
  • More details about lock waits in pg_stat_activity
  • Support multiple -c & -f options for psql
  • %n log_line_prefix value for unix epoch