From ff292deb12c9def19ec3b9d624bc29f396eb2726 Mon Sep 17 00:00:00 2001 From: Nick White Date: Sun, 7 Aug 2011 14:21:47 +0100 Subject: Update documentation, including add README --- README | 31 +++++++++++++++++++++++++++++++ TODO | 23 ++++++++--------------- getgfailed.sh | 13 +++++++++++++ getgmissing.sh | 5 +++++ 4 files changed, 57 insertions(+), 15 deletions(-) create mode 100644 README create mode 100755 getgfailed.sh diff --git a/README b/README new file mode 100644 index 0000000..d76c8a0 --- /dev/null +++ b/README @@ -0,0 +1,31 @@ +# getxbook + +getxbook is a collection of programs to download books from +several websites. + +* getgbook - downloads from google books' "book preview" +* getabook - downloads from amazon's "look inside the book" + (coming soon) +* getbnbook - downloads from barnes and noble's "book viewer" + (coming soon) + +## why + +Online book websites commodify reading. They are engineered not +around reading, but around surveillance. It is not merely the +selection of book that is recorded, but exactly what is read, when, +and for how long. Forever. And it is linked to all other information +the website holds about you (which in the case of Google and Amazon +is likely a great deal). + +Reading books is one of the holiest acts in our society. Limiting +our access, and monitoring it closely is a grave act. It is +censorship, whether done for profit or for any other reason. And +it is dangerous. + +The getxbook programs download books anonymously. Your IP address +will still be logged (use [https://www.torproject.org](torify or torsocks) +to stop this), but your reading won't be automatically linked to +other information websites have about you, as no existing cookies +are used. Once the book is downloaded, you can read it as you +please without any further prying. diff --git a/TODO b/TODO index 6aaf198..7023e06 100644 --- a/TODO +++ b/TODO @@ -1,13 +1,5 @@ -got a stack trace when a connection seemingly timed out (after around 30 successful calls to -p) - -getgmissing doesn't work brilliantly with preview books as it will always get 1st ~40 pages then get ip block. getgfailed will do a better job - -list all binaries in readme and what they do - # other utils -getgbooktxt (different program as it gets from html pages, which getgbook doesn't any more) - getabook getbnbook @@ -24,12 +16,13 @@ write some little tests ## getgbook -have file extension be determined by file type, rather than assuming png - -think about whether default functionality should be dl all, rather than -a +Note: looks like google allows around 3 page requests per cookie session, and exactly 31 per ip per [some time period > 18 hours]. If I knew the time period, could make a script that gets maybe 20 pages, waits for some time period, then continues. -to be fast and efficient it's best to crank through all the json 1st, filling in an array of page structs as we go - this requires slightly fuller json support - could consider making a json reading module, ala confoo, to make ad-hoc memory structures from json +got a stack trace when a connection seemingly timed out (after around 30 successful calls to -p). enable core dumping and re-run (note have done a small amount of hardening since, but bug is probably still there). -Note: looks like google allows around 3 page requests per cookie session, and exactly 31 per ip per [some time period]. If I knew the time period, could make a script that gets all it can, gets a list of failures, waits, then tries failures, etc. Note these would also have to stop at some point; some pages just aren't available +running it from scripts (getgfailed.sh and getgmissing.sh), refuses to Ctrl-C exit, and creates 2 processes, which may be killed independently. not related to torify + multiple processes seems to be no bother + ctrl-c seems to be the loop continuing rather than breaking on ctrl-c; e.g. pressing it enough times to end loop works. + due to ctrl-c on a program which is using a pipe continues the loop rather than breaking it. using || break works, but breaks real functionality in the scripts + see seq 5|while read i; do echo run $i; echo a|sleep 5||break; done vs seq 5|while read i; do echo run $i; echo a|sleep 5; do + trapping signals doesn't help; the trap is only reached on last iteration; e.g. when it will exit the script anyway diff --git a/getgfailed.sh b/getgfailed.sh new file mode 100755 index 0000000..9ecd9e3 --- /dev/null +++ b/getgfailed.sh @@ -0,0 +1,13 @@ +#!/bin/sh +# See COPYING file for copyright and license details. +# +# Tries to download each page listed in a fail log (from a +# previous run of getgbook -a bookid > faillog) + +test $# -ne 2 && echo "usage: $0 bookid faillog" && exit + +sort < $2 | sort | shuf | head -n 5 | while read i +do + code=`echo $i|awk '{print $1}'` + echo $code | getgbook $1 +done diff --git a/getgmissing.sh b/getgmissing.sh index 5d6ee18..e8198d8 100755 --- a/getgmissing.sh +++ b/getgmissing.sh @@ -1,5 +1,10 @@ #!/bin/sh # See COPYING file for copyright and license details. +# +# This gets any pages listed as available that have not been +# downloaded. Note that at present this is not too useful, as +# an IP block will be imposed after the first x pages each run, +# just for checking availaility. test $# -ne 1 && echo "usage: $0 bookid" && exit -- cgit v1.2.3