반응형
/********************************************************************************
-- Title : [PGS 9.x] PostgreSQL Tips & Tricks - 4
-- Reference : postgres.cz
-- Key word : 
********************************************************************************/

Slow queries in stored procedures

Sometimes some users reports slow queries in stored procedure. Direct execution of same query is fast. Reason is simple. Stored procedures uses prepared statements and optimalization of prepared statements runs without knowledge of real parameters. This behave ensure safety against sql injection, but carry some problems, planner is blind. This problem isn't too often and solution is simply - we have to use dynamic queries. But we cannot to forget carefully check against sql injection. This kind of problems we can find with analyzing of prepared statements.

t2=# PREPARE pp(integer) AS SELECT count(*) FROM foo WHERE a BETWEEN $1 + 100 AND $1 + 200; 
t2=# EXPLAIN ANALYZE execute pp(100);
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=438.60..438.61 rows=1 width=0) (actual time=16.149..16.150 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=13.39..437.35 rows=500 width=0) (actual time=1.391..5.458 rows=1033 loops=1)
         Recheck Cond: ((a >= ($1 + 100)) AND (a <= ($1 + 200)))
         ->  Bitmap Index Scan on fx  (cost=0.00..13.26 rows=500 width=0) (actual time=1.131..1.131 rows=1033 loops=1)
               Index Cond: ((a >= ($1 + 100)) AND (a <= ($1 + 200)))
 Total runtime: 16.340 ms
(6 rows)
t2=# EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE a BETWEEN 100 + 100 AND 100 + 200; 
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=432.21..432.22 rows=1 width=0) (actual time=15.930..15.932 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=22.21..429.78 rows=971 width=0) (actual time=1.179..13.338 rows=1033 loops=1)
         Recheck Cond: ((a >= 200) AND (a <= 300))
         ->  Bitmap Index Scan on fx  (cost=0.00..21.97 rows=971 width=0) (actual time=0.917..0.917 rows=1033 loops=1)
               Index Cond: ((a >= 200) AND (a <= 300))
 Total runtime: 16.058 ms
(6 rows)

Execution plans are nearly identical, so we can use prepared statements.

PostgreSQL 9.2 has redesigned implementation of prepared statements - described performance issues should be history.

Slow DISTINCT

Current implementation of DISTINCT needs sort, that can be slow. We should to use GROUP BY clause - this clause can to use hash aggregate - and it's significantly faster than DISTINCT.

postgres=# select count(*) from (select distinct i from g) a;
 count 
-------
 10001
(1 row)

Time: 1563,109 ms

postgres=# select count(*) from (select i from g group by i) a;
 count 
-------
 10001
(1 row)

Time: 594,481 ms

This topis is related to PostgreSQL 8.2 and older. PostgreSQL 8.3 and higher can to use a hash aggregation for distinct.

Taking quarter from interval

Interval arithmetic is little bit alchemy. Internally type interval carries days, months and years. Input value isn't normalized, it stays in entered units (i.e. 200 days are stored as 0 years, 0 months and 200 days). Some functions can't to work well with similar entered values (etc. EXTRACT QUARTER returns first quarter). We have to do normalization first with justify_interval function (so interval 200 days will be stored as 0 years, 6 months and 20 days). From this value we take correct information about quarter:

postgres=# SELECT interval '200 days', EXTRACT(QUARTER FROM interval '300 days');
 interval | date_part 
----------+-----------
 200 days |         1
(1 row)
postgres=# SELECT justify_interval(interval '200 days'),
                  EXTRACT(QUARTER FROM justify_interval('200 days'));
 justify_interval | date_part 
------------------+-----------
 6 mons 20 days   |         3
(1 row)

Author: Tom Lane from pgsql-bugs

Select random row

Hubert Lubaczewski showed in his [blog] problem with select random row from table. When I did one database audit I found typical but slow solution (from FAQ) ORDER BY random() LIMIT 1. This method never uses index and every similar query do: complete scan of table and sort of table. Next possible method WHERE id >= (random()*C+1)::int LIMIT 1 is much faster, but when series has some gaps then prefer first rows after gaps. I used with success method based on using set of random numbers. Disadvantage of this method is risk of returning zero rows (then we have to call statement again):

SELECT id, ...
   FROM data
  WHERE id = ANY(ARRAY(
                       SELECT (random()*max_id)::int 
                          FROM generate_series(1,20)))
  LIMIT 1;
-- max_id is constant equal max(id) from table + 20% reserve
-- id is primary key of table data

Use exception carefully

Trapping of exceptions has some overhead. So don't use it too often. In majority of cases this overhead is insignificant, but that can change when you are using cycles. Next two procedures add new item into table of unique records. If item exists, it returns false, else true.

CREATE OR REPLACE FUNCTION addnew1(integer) 
RETURNS boolean AS $$
BEGIN
  -- test(a) .. primary key
  INSERT INTO test(a) VALUES($1);
  RETURN true;
EXCEPTION WHEN OTHERS
  RETURN false
END; $$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION addnew2(integer)
RETURNS boolean AS $$
BEGIN
  IF NOT EXISTS(SELECT a FROM test WHERE a = $1) THEN
    INSERT INTO test(a) VALUES($1);
    RETURN true;
  END IF;
  RETURN false;
END; $$ LANGUAGE plpgsql;

Procedure addnew1 is about 2 times slower than addnew2, but addnew2 is more robust. Every protected block in plpgsql creates new subtransaction. And in this case time for creating new subtrans + INSERT > SELECT + INSERT.

Fast compare variables NEW and OLD in trigger's body

We can run trigger effectively if trigger's body is executed only when any columns are changed. This test can be simple and fast with new (from 8.2) row operator IS DISTINCT FROM.

IF ROW(NEW.*) IS DISTINCT FROM ROW(OLD.*) THEN
  -- continue only if some is changed
  IF NEW.a1 IS DISTINCT FROM OLD.a1 
     OR NEW.a2 IS DISTINCT FROM OLD.a2 THEN
     -- continue only if important columns are changed
     ...
  END IF;
END IF;
RETURN NEW;

or effectively:

IF ROW(NEW.a1, NEW.a2) IS DISTINCT FROM ROW(OLD.a1, OLD.a2) THEN
  -- important columns are changed
  ...
END IF;
RETURN NEW;

From 8.4 you can compare record values simply:

IF NEW IS DISTINCT FROM OLD THEN
  -- continue only if some is changed
  ...
END IF;
RETURN NEW;

Fast first n rows removing

PostgreSQL doesn't support LIMIT for UPDATE or DELETE statements. Sometimes we would to use it (for queue emulation, ...). Classic solution with IN and subquery isn't too effective:

postgres=# explain analyze delete from del where ctid  in (select ctid from del limit 10 offset 0);
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Merge IN Join  (cost=804.79..854.94 rows=10 width=6) (actual time=63.392..63.464 rows=10 loops=1)
   Merge Cond: (public.del.ctid = "IN_subquery".ctid)
   ->  Sort  (cost=804.39..829.39 rows=10000 width=6) (actual time=63.205..63.217 rows=11 loops=1)
         Sort Key: public.del.ctid
         Sort Method:  quicksort  Memory: 646kB
         ->  Seq Scan on del  (cost=0.00..140.00 rows=10000 width=6) (actual time=0.045..27.366 rows=9970 loops=1)
   ->  Sort  (cost=0.41..0.43 rows=10 width=6) (actual time=0.172..0.187 rows=10 loops=1)
         Sort Key: "IN_subquery".ctid
         Sort Method:  quicksort  Memory: 17kB
         ->  Limit  (cost=0.00..0.14 rows=10 width=6) (actual time=0.021..0.127 rows=10 loops=1)
               ->  Seq Scan on del  (cost=0.00..140.00 rows=10000 width=6) (actual time=0.016..0.030 rows=10 loops=1)
 Total runtime: 63.661 ms
(12 rows)

Tom Lane's proposal is based on using of array:

postgres=# explain analyze delete from del where ctid  = any (array(select ctid from del limit 10));
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Tid Scan on del  (cost=0.14..40.37 rows=10 width=6) (actual time=0.200..0.237 rows=10 loops=1)
   TID Cond: (ctid = ANY ($0))
   InitPlan
     ->  Limit  (cost=0.00..0.14 rows=10 width=6) (actual time=0.046..0.139 rows=10 loops=1)
           ->  Seq Scan on del  (cost=0.00..140.00 rows=10000 width=6) (actual time=0.040..0.065 rows=10 loops=1)
 Total runtime: 0.376 ms

The core of this trick is protection from JOIN, what is not (in this case) effective as seq. scan with filter. This trick is usable in other cases. Generally we should to say, so for small N (N < 100) is predicate = ANY better than predicate IN (for PostgreSQL 8.2 and 8.3). But real efficiency depends on indexes.

Getting list of comma separated items

D. Dante Lorenso sent note about getting list from content of table:

SELECT a.id, a.name,
    array_to_string(ARRAY(
      SELECT b.name
         FROM b
        WHERE b.id = a.id
        ORDER BY b.name ASC
    ), ',') AS b_names
   FROM a
  ORDER BY a.id ASC;

[table a]
 id | name
----+------
1   | one
2   | two
3   | three
4   | four

[table b]
 id | name
----+------
1   | pizza
1   | hot dog
2   | gorilla
2   | monkey
3   | apple
4   | cheese
4   | milk
4   | eggs

--result
 id | name  | b_names
----+-------+---------
1   | one   | pizza,hot dog
2   | two   | gorilla,monkey
3   | three | apple
4   | four  | cheese,milk,eggs

Using functions pgstattuple and pgstatindex

You can find nice contrib package pgstattuple in PostgreSQL 8.1 and higher. This contrib module contains some diagnostics functions for detection of fraction of dead tuples and for detection of index's fragmentation. Using of these functions is more practical with next two table generating functions:

CREATE OR REPLACE FUNCTION print_table_dead_tp(OUT table_name varchar, OUT tuple_count int8, OUT table_len varchar, OUT free_space varchar,
                                               OUT dead_tuple_percent numeric(5,2), OUT dead_tuple_count integer) 
RETURNS SETOF RECORD AS $$
DECLARE r record; s record; 
BEGIN 
  FOR r IN SELECT c.oid, n.nspname || '.' || c.relname as "tablename"
              FROM pg_catalog.pg_class c
                   LEFT JOIN 
                   pg_catalog.pg_namespace n 
                   ON n.oid = c.relnamespace
             WHERE c.relkind = 'r'
               AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
               AND pg_catalog.pg_table_is_visible(c.oid)
             ORDER BY 2
  LOOP
    s := pgstattuple(r.oid);
    table_name := r.tablename;
    tuple_count := s.tuple_count; dead_tuple_percent :=  s.dead_tuple_percent; 
    dead_tuple_count := s.dead_tuple_count;
    table_len :=  pg_size_pretty(s.table_len);
    free_space := pg_size_pretty(s.free_space);
    RETURN NEXT; 
  END LOOP; 
  RETURN;
END; $$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION print_indexes_info(OUT index_name varchar, OUT indexsize varchar, 
                                              OUT leaf_pages integer, OUT deleted_pages integer, 
                                              OUT leaf_fragmentation double precision, OUT avg_leaf_density double precision) 
RETURNS SETOF RECORD AS $$
DECLARE r record; o pgstatindex_type;
BEGIN
  FOR r IN SELECT schemaname || '.' || indexname AS indexname FROM pg_indexes LOOP
    BEGIN
      o := pgstatindex(r.indexname);
      index_name := r.indexname; 
      indexsize := pg_size_pretty(o.index_size); deleted_pages := o.deleted_pages;
      leaf_pages := o.leaf_pages; avg_leaf_density := o.avg_leaf_density;
      leaf_fragmentation := o.leaf_fragmentation; 
  
      RETURN NEXT;
    EXCEPTION WHEN OTHERS THEN
      indexsize := NULL; deleted_pages := NULL; leaf_fragmentation := NULL;
      index_name := r.indexname;
      RETURN NEXT;
    END;
  END LOOP;
  RETURN;
END; $$ LANGUAGE plpgsql;

Diacritic removing

We can simply remove diacritic with calling to_ascii function. There is small problem. This function doesn't support UTF8 encoding, so we have to use convert function.

postgres=# select to_ascii(convert('Příliš žlutý kůň' using utf8_to_iso_8859_2),'latin2');
     to_ascii     
------------------
 Prilis zluty kun
(1 row)

Note for 7.4, name of conversion is "utf_8_to_iso_8859_2". For 8.3 and 9.0 we have to use small workaround:

CREATE FUNCTION to_ascii(bytea, name) 
RETURNS text AS 'to_ascii_encname' LANGUAGE internal;

SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');

Predicate IN optimalization

I found note pg_performance about optimalization possibility of IN predicate when list of values is longer than eighty numbers. For longer list is better create constant subqueries with using multi values:

SELECT * 
   FROM tab 
  WHERE x IN (1,2,3,..n); -- n > 70

-- faster case
SELECT * 
   FROM tab
  WHERE x IN (VALUES(10),(20));

Using VALUES is faster for bigger number of items, so don't use it for small set of values.

BAR code validation

SQL Guru Joe Celko described DBAzinu interesting technique for BAR code validation with multi values.

CREATE FUNCTION barcode_sum(text)
RETURNS bigint AS 
$$
  SELECT ABS(SUM(CAST(SUBSTRING($1 FROM Weights.seq FOR 1) AS INTEGER) * Weights.wgt))
     FROM (VALUES (CAST(1 AS INTEGER), CAST(-1 AS INTEGER)),
                  (2, +1), (3, -1), (4, +1), (5, -1), 
                  (6, +1), (7, -1), (8, +1), (9, -1), (10, +1), 
                  (11,-1), (12, +1)) AS weights(seq, wgt)
$$ LANGUAGE sql;
Our function can be used like:
CREATE TABLE products(
  barcode char(13) NOT NULL
  CONSTRAINT valid_checkdigit
    check (MOD(barcode_sum(barcode),10) = CAST(substring(barcode from 13 for 1) AS integer))
  CONSTRAINT all_numeric_checkdigit CHECK(barcode NOT SIMILAR TO '%[^0-9]%')
);

INSERT INTO products VALUES('2837232811227');

This trick is based on behave of char type. Shorted text than 13 chars is filled with spaces, so test NOT SIMILAR ensure correct number of digits.

Using hash functions for ensuring uniqueness of texts

Classic unique index isn't well solution for longer texts. It's natural using some of hash functions. Output of these functions is 32 char long text. This is hexadecimal number, so we can cut it by casting to bytea type. Attention: hash doesn't ensure 100% uniqueness.

root=# CREATE TABLE test(a text);
CREATE TABLE

root=# CREATE UNIQUE INDEX uidx ON test((decode(md5(a),'hex')));
CREATE INDEX
root=# CREATE UNIQUE INDEX uidx ON test((decode(md5(lower(replace(a,' ',''))),'hex')));
CREATE INDEX
root=# INSERT INTO test VALUES('příliš žluťoučký kůň');
INSERT 0 1
root=# INSERT INTO test VALUES('příliš žluťoučký kůň ');
ERROR:  duplicate key violates unique constraint "uidx"
root=# INSERT INTO test VALUES('Příliš žluťoučký kůň');
ERROR:  duplicate key violates unique constraint "uidx"

-- less risk, using two hash functions
-- CREATE UNIQUE INDEX uidx ON TEST((decode(md5(a),'hex')),(hashtext(a)));

Getting OID of function

Every database object in PostgreSQL has Oid - unique identifier. We can find it in database system's tables or we can get it with casting:

CREATE OR REPLACE FUNCTION a(integer, varchar) RETURNS void AS 
$$
  BEGIN
  END;
$$ LANGUAGE plpgsql;

root=# SELECT 'a'::regproc::int;
ERROR:  more than one function named "a"
root=# SELECT 'a(integer, varchar)'::regprocedure::int;
 int4  
-------
 57507
(1 row)

root=# SELECT 57507::regprocedure;
         regprocedure         
------------------------------
 a(integer,character varying)
(1 row)

Table's Oid we could to get with cast to regclass type:

SELECT 'oo'::regclass::int;

General array sort

Very nice exhibition of PostgreSQL's features is general procedure for sorting by David Fetter.

CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT ARRAY(
    SELECT $1[s.i] AS "foo"
    FROM
        generate_series(array_lower($1,1), array_upper($1,1)) AS s(i)
    ORDER BY foo
);
$$;

little bit faster shorter modernised variant:

CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT ARRAY(SELECT unnest($1) ORDER BY 1)
$$

or

-- by Craig Ringer
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT array_agg(x ORDER BY x) FROM unnest($1) x;
$$

If you need really fast sort of int arrays, then you have to look on intarray contrib module.

Array transformation in PL/Perl

PL/Perl doesn't support OUT arrays. We can use undocumented procedure _plperl_to_pg_array:

CREATE OR REPLACE FUNCTION foo(OUT a integer[], OUT b integer[]) AS $$
  return { a=> _plperl_to_pg_array([1,2,3]),
           b=> _plperl_to_pg_array([4,5,6]) };

UPC code validation

This is sample of validation of code with control number in Perl (PL/perl). I used this function for compare with PL/pgSQL and C languages. Perl is about 20% faster, but C is 10 times faster than Perl.

CREATE OR REPLACE FUNCTION ean_perl (
    TEXT
) RETURNS boolean AS $_$
    my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0];
    # Make sure we really have an EAN.
    return 'false' unless $ean =~ /^\d{13}$/;
    my @nums = split '', $ean;
    return 10 - (
        # Sum even numerals.
        (   (   $nums[1] + $nums[3] + $nums[5] + $nums[7] + $nums[9]
                    + $nums[11]
            ) * 3 # Multiply total by 3.
        # Add odd numerals except for checksum (12).
        ) + $nums[0] + $nums[2] + $nums[4] + $nums[6] + $nums[8] + $nums[10]
    # Compare to the checksum.
    ) % 10 == $nums[12] ? 'true' : 'false';
$_$ LANGUAGE plperl immutable;

I found interesting solution by Joe Celko in article. After rewriting to PostgreSQL:

SELECT (
   SELECT mod(
      sum(
         CAST(substring('2837232811227' from s.seq for 1) AS integer) * 
         CASE mod(s.seq,2) WHEN 0 THEN 1 ELSE -1 END),
      10) FROM generate_series(1,12) as s(seq)) =
   CAST(substring('2837232811227' FROM 13 FOR 1) AS integer);

Show only every n row

When we would to skip some rows from table, we should to use temporary sequences and modulo function:

  CREATE TEMP SEQUENCE number; 
  SELECT * FROM ( SELECT *, nextval('number') AS number FROM datatable ) foo 
    WHERE foo.number % 5 = 0;

With 8.4 this can be done without the use of a temporary sequence

  SELECT * FROM ( SELECT *, row_number() over () AS number FROM datatable ) foo 
    WHERE foo.number % 5 = 0;

Array to table

We can simply unpack array with generate_series function:

 SELECT (ARRAY[1,3,4,6,7])[idx.n] FROM generate_series(1,7) idx(n);

CREATE OR REPLACE FUNCTION unpack(anyarray)
RETURNS SETOF anyelement AS $$ 
SELECT $1[i] 
   FROM generate_series(array_lower($1,1), 
                        array_upper($1,1)) g(i);
$$ LANGUAGE sql STRICT IMMUTABLE;

postgres=# select unpack(array['a','b','c']);
 unpack 
--------
 a
 b
 c
(3 rows)


With 8.4, we can use the unnest()-function:

test=*# select unnest(array['a','b','c']);
 unnest
--------
 a
 b
 c
(3 rows)

Additionally, we can extend the unnest() function to return one row with a default value if the array is empty or null:

create or replace function unnest(anyarray, anyelement)
returns setof anyelement as $$
select unnest(case when array_upper($1, 1) > 0 then $1 else array[$2] end);
$$ language sql immutable;

gp=> create temp table foo (x text[], y int) distributed by (y);
CREATE TABLE
gp=> insert into foo values
(array['a','b','c'], 1),
(null, 2),
('{}'::text[],3);
INSERT 0 3

gp=> select *,array_upper(x,1) as size from foo;
    x    | y | size
---------+---+------
 {}      | 3 |
         | 2 |
 {a,b,c} | 1 |    3
(3 rows)

gp=> select unnest(x),y from foo;
 unnest | y
--------+---
 a      | 1
 b      | 1
 c      | 1
(3 rows)

gp=> select unnest(x,'empty'),y from foo;
 unnest | y
--------+---
 empty  | 3
 empty  | 2
 a      | 1
 b      | 1
 c      | 1
(5 rows)

LIKE optimalization

PostgreSQL use index for LIKE only when:

  • pattern doesn't start with % and _ symbols,
  • database cluster is initialized with C locale.

Last constraint we can pass by special index:

CREATE INDEX like_index ON people(surname varchar_pattern_ops);

When we work wit pattern like %some (for internet domain searching) we should to use trick:

  • create functional index on mirrored values.
  • search with mirrored pattern

PostgreSQL doesn't allow reverse (mirror) function. We can use rvrs function from Orafunc package or we can use PL/Perl.

CREATE OR REPLACE FUNCTION reverse(varchar) RETURNS varchar AS $$ 
  $reversed = reverse $_[0]; 
  return $reversed; 
$$ LANGUAGE plperlu IMMUTABLE;

CREATE INDEX rev_email ON users( (reverse(email) ) varchar_pattern_ops );
SELECT * FROM _users WHERE reverse(email) LIKE reverse ('%.cz');

Without plperl:

CREATE OR REPLACE FUNCTION reverse(TEXT) RETURNS TEXT AS $$
 SELECT 
    array_to_string( 
      ARRAY
        ( SELECT substring($1, s.i,1) FROM generate_series(length($1), 1, -1) AS s(i) ), 
      '');
$$ LANGUAGE SQL IMMUTABLE;

Note: PostgreSQL 9.1 has buildin function for reverse string.

Deferred constraints

PostgreSQL run all constrains immediately. SQL know deferred constraints, that allows delay tests to time closely to commit. PostgreSQL hasn't full support. You can deferred only referential integrity constraints. But PostgreSQL supports undocumented deferred triggers, so we should use it. Syntax is similar to trigger definition:

  CREATE CONSTRAINT TRIGGER sc
  AFTER INSERT 
  ON fieldtrip_students
  INITIALLY DEFERRED
  FOR EACH ROW EXECUTE PROCEDURE field_trip();

Removing of duplicate rows

Any table created without parameter WITHOUT OIDS has row unique identifier OID. We can use it for removing duplicate rows:

 DELETE FROM people WHERE 
   NOT oid IN (SELECT MIN(oid) FROM people GROUP BY name, surname);

But new tables are created with this parameter usually. So we can't use OID. Every PostgreSQL row has another unique identifier - ctid. For ctid type we have not defined any aggregate functions, we have to use correlated subquery. I found this trick in Greg Mullane presentation.

 DELETE FROM lidi WHERE
   NOT ctid = (SELECT ctid FROM lidi l WHERE 
     prijmeni=l.prijmeni AND jmeno=l.jmeno LIMIT 1);

Select first n rows from group

   SELECT * FROM people WHERE id IN (
     SELECT id FROM people s 
       WHERE people.category = s.category 
       ORDER BY age LIMIT 2) 
     ORDER BY category, age;

Other solution without correlated subquery is SELF JOIN. I have to look on problem from different perspective. I have to ask on persons for which only one person in group is older.

SELECT s1.* 
   FROM people s1
        LEFT JOIN
        people s2
        ON s1.category = s2.category AND s1.age < s2.age
  GROUP BY s1.id, s1.category
  HAVING COUNT(s2.id) <= 1
  ORDER BY s1.category, COUNT(s2.id);

Cast to varchar

Every data type in PostreSQL has input and output functions. We can use these functions for support casting to varchar (if we miss it). Don't use this trick on PostgreSQL 8.3 and higher

  testdb011=# SELECT '(1,1)'::point::varchar;
  ERROR:  cannot cast type point to character varying

but

 testdb011=# SELECT textin(point_out('(1,1)'::point))::varchar;
  textin
 --------
  (1,1)
 (1 row)

Create Aggregates

I needed to be able to do a fast

SELECT count(*) FROM sometable WHERE field='x'
instead of a simple
SELECT count (*) FROM sometable

Both of the above queries initiate sequential scans on a table (which takes too much time). Here is what I created:

CREATE TABLE aggregate AS 
     (SELECT result,count(result) AS count FROM original_table 
      GROUP BY result ORDER BY count DESC);

Further, my data contains many one-offs which I don't want to put into my aggregate table (I assume that if a value isn't in my aggregate table that it's too small to be useful)... here is what I do to leave those values out:

CREATE TABLE aggregate AS 
     (SELECT * FROM 
          (SELECT result,count(result) AS count FROM original_table 
           GROUP BY result ORDER BY count DESC) 
     AS countquery WHERE count > 3);

This method is fast enough. Because I have an index on the "result" field, It runs in under a minute against 3 million records. It takes approximately the same amount of time to generate counts for all the result values as it does to count the total number of records.

Now I can do a simple (very fast) query to get a count:

SELECT * FROM aggregate where result='xxxxx'
Newer tricks 
반응형

+ Recent posts