-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Download ALL the wallpapers from wallbase #13
Comments
Also, could I could do this or execute he script 10 times and play with the ranges I mentioned above. |
The script does skip not found images and it should work with wp range, running it multiple times works if you run it in diferrent directories, if you run it in the same directory the temp files would override themselves or add tmp.1 and so on. But im not sure if you really want to download all the wallpapers, there are over 3 million ;) |
@macearl Yeah I optimized the script a bit(brought it down to less than 200 lines) and I'm running it with GNU parallel. I split the 3 million image range into 60 scripts(each script downloads 50k images) and they are all executing at once. There are extra optimizations that could be done like not use wget and use a faster downloader(like axel or aria2), the URL finding section could be optimized(I would like to have it reversed so I can run another 60 scripts going backwards), we could use parallel to task wget X times over to download quicker and not have to run the entire URL generating section. Also a shared cache file(the downloaded.txt file) would aid this substantially if we are going to use parallel and task wget with fetching stuff. |
well it certainly sounds interesting, i never used gnu parallel or any other download managers, i would like to take a look at your version, if you dont mind sharing it ;) |
Sure thing, I'll write up a little text explaining what I did also. |
@macearl Here is the script: #!/bin/bash
# See Section 13
# Enter your Username
USER="user"
# Enter your password
PASS="pass"
PURITY=111
# For accepted values of topic see Section 4
TOPIC=23
# For download location see Section 7
LOCATION=/home/wallbase/dl/10
# For Types see Section 9
TYPE=1
# See Section 15
CATEGORIZE=1
# See Section 16
WP_RANGE_START=1800001
WP_RANGE_STOP=1850000
# if wished categorize the downloads
# by their PURITY(nsfw,sfw,sketchy)
# and TOPIC (manga, hd, general)
if [ $CATEGORIZE -gt 0 ]; then
LOCATION="$LOCATION/$PURITY/$TOPIC"
fi
if [ ! -d $LOCATION ]; then
mkdir -p $LOCATION
fi
cd $LOCATION
login() {
# checking parameters -> if not ok print error and exit script
if [ $# -lt 2 ] || [ $1 == '' ] || [ $2 == '' ]; then
echo "Please check the needed Options for NSFW/New Content (username and password)"
echo ""
echo "For further Information see Section 13"
echo ""
echo "Press any key to exit"
read
exit
fi
nice -n -20 wget --keep-session-cookies --save-cookies=cookies.txt --referer=http://wallbase.cc/home http://wallbase.cc/user/login
csrf="$(cat login | grep 'name="csrf"' | sed 's .\{44\} ' | sed 's/.\{2\}$//')"
ref="$(rawurlencode $(cat login | grep 'name="ref"' | sed 's .\{43\} ' | sed 's/.\{2\}$//'))"
nice -n -20 wget --load-cookies=cookies.txt --keep-session-cookies --save-cookies=cookies.txt --referer=http://wallbase.cc/user/login --post-data="csrf=$csrf&ref=$ref&username=$USER&password=$PASS" http://wallbase.cc/user/do_login
}
rawurlencode() {
local string="${1}"
local strlen=${#string}
local encoded=""
for (( pos=0 ; pos<strlen ; pos++ )); do
c=${string:$pos:1}
case "$c" in
[-_.~a-zA-Z0-9] ) o="${c}" ;;
* ) printf -v o '%%%02x' "'$c"
esac
encoded+="${o}"
done
echo "${encoded}"
}
# login only when it is required ( for example to download favourites or nsfw content... )
if [ $PURITY == 001 ] || [ $PURITY == 011 ] || [ $PURITY == 111 ] || [ $TYPE == 5 ] || [ $TYPE == 7 ] ; then
login $USER $PASS
fi
if [ $WP_RANGE_STOP -gt 0 ]; then
#WP RANGE
for (( count= "$WP_RANGE_START"; count< "$WP_RANGE_STOP"+1; count=count+1 ));
do
if cat /home/wallbase/download_list | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> /home/wallbase/download_list
nice -n -20 wget --no-dns-cache -4 --tries=2 --keep-session-cookies --load-cookies=cookies.txt --referer=wallbase.cc http://wallbase.cc/wallpaper/$count
cat $count | egrep -o "http://wallpapers.*(png|jpg|gif)" | nice -n -20 wget --no-dns-cache -4 --tries=2 --keep-session-cookies --load-cookies=cookies.txt --referer=http://wallbase.cc/wallpaper/$number -i -
fi
done
else
echo error in TYPE please check Variable
fi
rm -f cookies.txt login do_login This script is the 10th(out of 60) script in the forward series. There are another 60 scripts that count backwards with this line in the for (( count="$WP_RANGE_STOP"; count >= "$WP_RANGE_START"; count=count-1 )); All the 120 scripts have blocks of 50k wallpapers to download. Also, I took a little long to post this because I wanted to slim it down even more and to work on the shared downloaded files list. As you can see, all 120 scripts use the same file to check for downloaded files, I like this because I have two sets going against each other so I dont want to download duplicates. To execute this I used GNU Parallel working in non GNU mode(no nice -n -20 parallel -j 1000 < list You may have noticed that I Another issue I faced was counting the ammount of downloaded wallpapers. To get a rough estimate I created this script(its suited to my needs so it might need some changes to use elsewhere) normal=$(find /home/wallbase/dl -type f | wc -l)
reverse=$(find /home/wallbase/dlR -type f | wc -l)
num=`expr $reverse + $normal`
printf $num All scripts have permissions set to 755(with I also had to change IPs(I'm on a dedicated server and it has some failover IPs) because I noticed timeouts and drops and problems. Every 24-48 hours I restart the VM because it maxes its cache memory and slows down even more. Once I restart the process it takes a few minutes to check all the downloaded wallpapers. It maxes CPU usage :P Till now, I've queried around 400k wallpapers and downloaded some 150k wallpapers(most are 403 or 404(403s are marked for deletion)) that ocupy around 70GBs. If you have any questions let me know, if you want to post the counting script and my slimmer version go ahead. |
Also, I'm not usre of this though, but I believe it ould be best to We could do this with an git clone https://github.com/PrinterLUA/FORGOTTENSERVER-ORTS #--quiet
success=$?
if [[ $success -eq 0 ]]; then |
I updated the counting script to do a few more things. normal=$(find /home/wallbase/dl -type f | wc -l)
reverse=$(find /home/wallbase/dlR -type f | wc -l)
queries=$(wc -l /home/wallbase/download_list | awk '{print $1}')
space=$(du -sh /home/wallbase/ | awk '{print $1}')
num=`expr $reverse + $normal`
printf "Wallpapers so far(not exact): $num\n"
printf "Queries so far: $queries\n"
printf "Used space: $space\n" |
Since wallbase its on it deathbed I want to download all the wallpapers, how can I do that with this script? Could I use the WP_RANGE_START and WP_RANGE_STOP from 1 to 99999999 or something(I'm assuming the script page will skip if it finds 404/403 errors)
Thanks,
@sevensins
@Numn
@macearl
The text was updated successfully, but these errors were encountered: