-
Notifications
You must be signed in to change notification settings - Fork 12
Postgresql loading practices
Feed back welcomed on approach to loading Postgresql database with pwalk output. Here is what I'm doing.
For starters, I found that I had to postprocess pwalk output before I could
copy pwalk (${PWALK_COLUMN_NAMES}) from stdin csv header escape \'\\\'
More specifically, with this bash function/perl one-liner:
function pppwalk {
## PURPOSE: pre-process pwalk output for postregsql csv:
## - Parse csv into columns on-the-cheap using Perl negative lookbehind assertion
## to not split on escaped commas embedded filenames (alas which exist).
## PERHAPS not longer needed? c.f. 2.6.4 "postgres CSV fixes"
## - Filter rows lacking 14 columns, reporting as anomaly
## - Convert unix epochs to timestamps
perl -F'(?<!\\),' -lane '
if (14 ne $#F) {warn "PPPWALK\t$.\tSKIPPED\t$#F columns\tPossible commas in filename?\t$_"; next};
@F[10..12] = map {scalar localtime $_} @F[10..12]; $_=join(q{,},@F);
print;
' "$@"
}
export -f pppwalk
I'd love to get rid of this filter, and/or I welcome any suggestions for improvements.
Data is getting loaded into:
CREATE TABLE pwalk (
root character varying NOT NULL,
inode bigint NOT NULL,
"parentInode" bigint NOT NULL,
"directoryDepth" int NOT NULL,
filename character varying NOT NULL,
"fileExtension" character varying,
"UID" bigint NOT NULL,
"GID" bigint NOT NULL,
st_size bigint NOT NULL,
st_blocks bigint NOT NULL,
st_mode bigint NOT NULL,
atime timestamp with time zone NOT NULL,
mtime timestamp with time zone NOT NULL,
ctime timestamp with time zone NOT NULL,
file_count bigint NOT NULL,
size_sum bigint NOT NULL,
--
basename character varying NOT NULL DEFAULT '', -- basename(filename,'\.'||"fileExtension",'/'::bpchar),
dirname character varying NOT NULL DEFAULT '' -- DEFAULT dirname(filename,'/'::bpchar)
);
With basename and dirname being set on load by triggers:
CREATE OR REPLACE FUNCTION pwalk_after_iu_function() RETURNS trigger AS $$
BEGIN
UPDATE NEW
set basename = basename(filename::text,'\.'||"fileExtension"::text,'/'::char),
dirname = dirname(filename,'/'::bpchar);
RETURN NEW;
END;
$$ language plpgsql;
CREATE TRIGGER pwalk_after_iu_trigger
AFTER INSERT OR UPDATE OF filename ON pwalk
FOR EACH STATEMENT
EXECUTE PROCEDURE pwalk_after_iu_function();
What do you think?
Recently I've been experimenting with
loading an ltree column with parsed filename. The filename has to be sanitized first. And the ltree index is pretty big, but it sure provides for some fast and expressive queries.
alter table pwalk drop if exists filename_ltree;
alter table pwalk add filename_ltree ltree;
update pwalk
set filename_ltree= filename_ltree=text2ltree(regexp_replace(TRIM(BOTH '.' FROM regexp_replace(REPLACE(REPLACE(filename,'.','_'),'/','.'),'[^[:alnum:]_.]','','g')),'(\.){2,}','.','g')) ;
CREATE INDEX IF NOT EXISTS pwalk_idx_gist_filename_ltree ON pwalk USING gist (filename_ltree);
I'm also exploring loading a jsonb column with projects/science related metadata extracted from
- YAML - where filename is
project.yml
(a convention some of our researchers use) - regular expression parsing of pathnames - (according to a few different conventions used by various cores)
- (coming soon) (selected) YAML front-matter from markdown files (.md and .Rmd)
I've also been looking at recursive queries after indexing inode and parent-inode, but don't (yet) have a good use case for them.
Our code is all bash, and messy, and here.
Our methodologies:
In the DB I've kept things simple by using the TEXT data type everywhere. Also, I define times as DOUBLE PRECISION and convert to timestamps in the queries. I see now that is stupid as pwalk is outputting long ints for times. Gotta change that!
I am also adding two columns: owner and fs_id. Our namespace crosses two file systems, so we have to deal with duplicate inodes. Each entry is keyed by fs_id (server+export) and inode. We have folder owners that have very little to do with UID/GID. I supply these values at the beginning of each crawl.
We are using our slurm cluster to crawl. Each "top level" folder is run as a separate job, and then pwalk parallelizes from there. Probably less efficient for very small/empty folders.
For the commands... I am using csvquote and other tools like this:
csvquote pwalk_output.csv |
uconv -s -i |
sort -t, -k 4,4 -u |
awk -F, -v fs_id="$fs_id" -v owner="$owner" '{print fs_id","owner","$0}' |
csvquote -u |
psql $db_conn_str -c "COPY pwalk_tbl(fs_id,owner,${PWALK_COLS}) FROM STDIN WITH csv ESCAPE '\'"
I am using csvquote to allow awk to handle the columns, and uconv to remove invalid character bytes. I need to do the sort by filename as the uconv removal has resulted in identically named files, and I want to use filename as a primary key (yes, we just lose the second file).
I am not doing any triggers as we are trying to streamline the import as much as possible. We are currently crawling about a petabyte in just over 500 million files each night. Takes 4-8 hours depending.
I believe the binary import will work faster, and should just pass invalid character bytes through, eliminating the need for all of the above. As it is, the psql COPY command seems to choke on the invalid bytes.
Next steps for us include:
- folder hinting - I have a long tail of about 8 folders that take more than 50% of the time, and many small or empty folders that could be run in the same job
- partition the table - by owner probably with COPYs going to partitions, not the parent table
- postgreSQL binary format - for speed and to eliminate the csvquote/sort/awk/csvquote pipeline above
- tree - recursive queries on inode=parent_inode work, but are slow - also want to store foldernames once only (basename and dirname may be what pwalk should output?)