Skip to content

Segfault with TPC-DS query #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RyanMarcus opened this issue May 9, 2018 · 9 comments
Closed

Segfault with TPC-DS query #6

RyanMarcus opened this issue May 9, 2018 · 9 comments
Assignees
Labels

Comments

@RyanMarcus
Copy link

I apologize in advance for this messy bug report.

Executing the following SQL query, an instance of a TPC-DS template, occasionally causes a segmentation fault in PG10.3 when pg_query_state is called. Here is the query:

select  *
from
 (select count(*) h8_30_to_9
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk   
     and ss_hdemo_sk = household_demographics.hd_demo_sk 
     and ss_store_sk = s_store_sk
     and time_dim.t_hour = 8
     and time_dim.t_minute >= 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2)) 
     and store.s_store_name = 'ese') s1,
 (select count(*) h9_to_9_30 
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk 
     and time_dim.t_hour = 9 
     and time_dim.t_minute < 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s2,
 (select count(*) h9_30_to_10 
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk
     and time_dim.t_hour = 9
     and time_dim.t_minute >= 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s3,
 (select count(*) h10_to_10_30
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk
     and time_dim.t_hour = 10 
     and time_dim.t_minute < 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s4,
 (select count(*) h10_30_to_11
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk
     and time_dim.t_hour = 10 
     and time_dim.t_minute >= 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s5,
 (select count(*) h11_to_11_30
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk 
     and time_dim.t_hour = 11
     and time_dim.t_minute < 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s6,
 (select count(*) h11_30_to_12
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk
     and time_dim.t_hour = 11
     and time_dim.t_minute >= 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s7,
 (select count(*) h12_to_12_30
 from store_sales, household_demographics , time_dim, store
 where ss_sold_time_sk = time_dim.t_time_sk
     and ss_hdemo_sk = household_demographics.hd_demo_sk
     and ss_store_sk = s_store_sk
     and time_dim.t_hour = 12
     and time_dim.t_minute < 30
     and ((household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or
          (household_demographics.hd_dep_count = 2 and household_demographics.hd_vehicle_count<=2+2) or
          (household_demographics.hd_dep_count = 4 and household_demographics.hd_vehicle_count<=4+2))
     and store.s_store_name = 'ese') s8
;

I apologize for the length... I am not sure what elements of the query causes the failure.

Here are the steps to reproduce:

  1. Download PostgresQL 10.3
  2. Download the PG10 branch
  3. Patch and build:
cd postgresql-10.3
patch -p1 < /home/postgres/pg_query_state/custom_signals.patch
patch -p1 < /home/postgres/pg_query_state/runtime_explain.patch
./configure --prefix=/home/postgres/local/
make -j 2

make install
 
export PATH=$PATH:/home/postgres/local/bin
 
cd /home/postgres/pg_query_state
make install USE_PGXS=1
  1. Enable the extension:
# wherever your postgresql.conf is...
echo "shared_preload_libraries = 'pg_query_state'">> /media/data/pg/postgresql.conf
  1. Start the DB and load TPC-DS data (scale factor of 1GB will work)
  2. Start a PSQL session, get the backend PID. Use this session to execute the above query over and over again
  3. In another PSQL section, run select * from pg_query_state($PID); while the query is running. After 1-5 tries, Postgres has a segfault.

Is there a way I can build PG to provide more useful info about what is going on? Are there log files I can provide?

@maksm90
Copy link
Collaborator

maksm90 commented May 9, 2018

Hi @RyanMarcus ! Thanks for your bug report.

You could catch a core dump and provide stack trace of failed postgres backend https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Getting_a_trace_from_a_randomly_crashing_backend .

Anyway I'll try to reproduce your case to deal with it.

@RyanMarcus
Copy link
Author

Ok, I've got the coredump. I had to turn my shared_buffer size down to 512MB in order to make it a reasonable size.

I compiled Postgres using --enable-cassert --enable-debug CFLAGS="-fno-omit-frame-pointer". Here is the stack trace from the core dump with GDB:

[root@archlinux vagrant]# coredumpctl gdb 1562
           PID: 1562 (postgres)
           UID: 1001 (postgres)
           GID: 1001 (postgres)
        Signal: 11 (SEGV)
     Timestamp: Mon 2018-05-14 16:15:00 UTC (1min 40s ago)
  Command Line: postgres: tpcds tpcds 10.0.2.2(42330) SELECT
    Executable: /home/postgres/local/bin/postgres
 Control Group: /user.slice/user-1000.slice/session-c8.scope
          Unit: session-c8.scope
         Slice: user-1000.slice
       Session: c8
     Owner UID: 1000 (vagrant)
       Boot ID: 9030a36a31884eae9d35574b5df78f0f
    Machine ID: e0ba44ce944440869a9a9bde320000a0
      Hostname: archlinux
       Storage: /var/lib/systemd/coredump/core.postgres.1001.9030a36a31884eae9d35574b5df78f0f.1562.1526314500000000.lz4
       Message: Process 1562 (postgres) of user 1001 dumped core.
                
                Stack trace of thread 1562:
                #0  0x00007f14e3eb0691 n/a (/home/postgres/local/lib/pg_query_state.so)

GNU gdb (GDB) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/postgres/local/bin/postgres...done.
[New LWP 1562]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `postgres: tpcds tpcds 10.0.2.2(42330) SELECT       '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f14e3eb0691 in pg_query_state () from /home/postgres/local/lib/pg_query_state.so
(gdb) bt
#0  0x00007f14e3eb0691 in pg_query_state () from /home/postgres/local/lib/pg_query_state.so
#1  0x00005618646b2589 in ExecMakeTableFunctionResult ()
#2  0x00005618646be179 in FunctionNext ()
#3  0x00005618646b1a3a in ExecScan ()
#4  0x00005618646b0298 in ExecProcNodeInstr ()
#5  0x00005618646abb1b in standard_ExecutorRun ()
#6  0x00007f14e3eaf86d in qs_ExecutorRun () from /home/postgres/local/lib/pg_query_state.so
#7  0x00005618647e0ecb in PortalRunSelect ()
#8  0x00005618647e232e in PortalRun ()
#9  0x00005618647de301 in exec_simple_query ()
#10 0x00005618647dfad2 in PostgresMain ()
#11 0x000056186477095c in ServerLoop ()
#12 0x00005618647717d5 in PostmasterMain ()
#13 0x00005618645273f6 in main ()
(gdb) quit

I see that while I built PG with debug symbols, it doesn't look like I build the pg_query_state module with debug symbols, so I will attempt to do that now and get another coredump...

Also, I am used to GDB backtraces including source file names and line numbers... any idea why I'm not seeing those here? Do I need additional compile options?

Let me know if it would be useful to pack up my postgres executable and the coredump for you to examine.

@RyanMarcus
Copy link
Author

Here is a backtrace after rebuilding the extension.

[root@archlinux vagrant]# coredumpctl gdb 2400
           PID: 2400 (postgres)
           UID: 1001 (postgres)
           GID: 1001 (postgres)
        Signal: 11 (SEGV)
     Timestamp: Mon 2018-05-14 16:30:23 UTC (29s ago)
  Command Line: postgres: tpcds tpcds 10.0.2.2(42668) SELECT
    Executable: /home/postgres/local/bin/postgres
 Control Group: /user.slice/user-1000.slice/session-c8.scope
          Unit: session-c8.scope
         Slice: user-1000.slice
       Session: c8
     Owner UID: 1000 (vagrant)
       Boot ID: 9030a36a31884eae9d35574b5df78f0f
    Machine ID: e0ba44ce944440869a9a9bde320000a0
      Hostname: archlinux
       Storage: /var/lib/systemd/coredump/core.postgres.1001.9030a36a31884eae9d35574b5df78f0f.2400.1526315423000000.lz4
       Message: Process 2400 (postgres) of user 1001 dumped core.
                
                Stack trace of thread 2400:
                #0  0x00007f5172f1cfa5 n/a (/home/postgres/local/lib/pg_query_state.so)
                #1  0x000055d15ea19589 n/a (/home/postgres/local/bin/postgres)

GNU gdb (GDB) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/postgres/local/bin/postgres...done.
[New LWP 2400]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `postgres: tpcds tpcds 10.0.2.2(42668) SELECT       '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  pg_query_state (fcinfo=0x7ffe02334eb0) at pg_query_state.c:643
643			stack_frame	*frame = (stack_frame *) lfirst(p_state->frame_cursor);
(gdb) bt
#0  pg_query_state (fcinfo=0x7ffe02334eb0) at pg_query_state.c:643
#1  0x000055d15ea19589 in ExecMakeTableFunctionResult ()
#2  0x000055d15ea25179 in FunctionNext ()
#3  0x000055d15ea18a3a in ExecScan ()
#4  0x000055d15ea17298 in ExecProcNodeInstr ()
#5  0x000055d15ea12b1b in standard_ExecutorRun ()
#6  0x00007f5172f1bf97 in qs_ExecutorRun (queryDesc=0x55d1600fa068, direction=ForwardScanDirection, count=0, 
    execute_once=1 '\001') at pg_query_state.c:317
#7  0x000055d15eb47ecb in PortalRunSelect ()
#8  0x000055d15eb4932e in PortalRun ()
#9  0x000055d15eb45301 in exec_simple_query ()
#10 0x000055d15eb46ad2 in PostgresMain ()
#11 0x000055d15ead795c in ServerLoop ()
#12 0x000055d15ead87d5 in PostmasterMain ()
#13 0x000055d15e88e3f6 in main ()

@maksm90
Copy link
Collaborator

maksm90 commented May 15, 2018

Thanks for the share of stacktrace.

Also, I am used to GDB backtraces including source file names and line numbers... any idea why I'm not seeing those here? Do I need additional compile options?

You could rewrite CFLAGS as follows CFLAGS='-ggdb -O0 -g3 -fno-omit-frame-pointer'

Here is a backtrace after rebuilding the extension.

In your presented stacktrace it's apparent that the problem lies on the side of client who calls pg_query_state function while extracting state from external backend. I'll figure out with this issue.

@RyanMarcus
Copy link
Author

Thanks for looking into this! When I next get time, I'll attempt to prepare a VM that I can send to you so you can reproduce it for yourself.

I will also try those additional options, and I'll reproduce the problem several times to ensure the traces are similar. Would a valgrind log be useful? If so, I can do that too.

@maksm90
Copy link
Collaborator

maksm90 commented May 16, 2018

I would like you share the initial sql script to prepare DB to execute your failed query. It would take too much time if I try to prepare DB from specification TPC-DS myself.

Would a valgrind log be useful? If so, I can do that too.

It's not necessary yet. Thanks for the offer to do it.

@ildus ildus assigned ildus and unassigned ildus Oct 25, 2018
@ildus ildus added the bug label Oct 25, 2018
@ildus ildus assigned ildus and CherkashinSergey and unassigned ildus Oct 29, 2018
@ildus
Copy link
Collaborator

ildus commented Nov 1, 2018

@RyanMarcus should be fixed in latest release, thank you for this bug report.

@ildus ildus closed this as completed Nov 1, 2018
@ildus ildus reopened this Nov 2, 2018
@ildus
Copy link
Collaborator

ildus commented Nov 2, 2018

We found another problems here, so I reopened the issue.

@ololobus
Copy link
Contributor

It seems that all bugs related to this issue have been fixed a while ago. Finally, we have recently merged a PR #17, which integrated pg_query_state testing under TPC-DS load, so I am closing this issue.

@RyanMarcus, thanks again for your report! Feel free to rise new issues if needed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants