summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNick White <hg@njw.me.uk>2011-08-07 14:21:47 +0100
committerNick White <hg@njw.me.uk>2011-08-07 14:21:47 +0100
commitff292deb12c9def19ec3b9d624bc29f396eb2726 (patch)
tree936aab50557453accb03cd83e102f83b66739b04
parent101687cd7a85cb83dea95386ee6cdd6259c726c1 (diff)
Update documentation, including add README
-rw-r--r--README31
-rw-r--r--TODO23
-rwxr-xr-xgetgfailed.sh13
-rwxr-xr-xgetgmissing.sh5
4 files changed, 57 insertions, 15 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..d76c8a0
--- /dev/null
+++ b/README
@@ -0,0 +1,31 @@
+# getxbook
+
+getxbook is a collection of programs to download books from
+several websites.
+
+* getgbook - downloads from google books' "book preview"
+* getabook - downloads from amazon's "look inside the book"
+ (coming soon)
+* getbnbook - downloads from barnes and noble's "book viewer"
+ (coming soon)
+
+## why
+
+Online book websites commodify reading. They are engineered not
+around reading, but around surveillance. It is not merely the
+selection of book that is recorded, but exactly what is read, when,
+and for how long. Forever. And it is linked to all other information
+the website holds about you (which in the case of Google and Amazon
+is likely a great deal).
+
+Reading books is one of the holiest acts in our society. Limiting
+our access, and monitoring it closely is a grave act. It is
+censorship, whether done for profit or for any other reason. And
+it is dangerous.
+
+The getxbook programs download books anonymously. Your IP address
+will still be logged (use [https://www.torproject.org](torify or torsocks)
+to stop this), but your reading won't be automatically linked to
+other information websites have about you, as no existing cookies
+are used. Once the book is downloaded, you can read it as you
+please without any further prying.
diff --git a/TODO b/TODO
index 6aaf198..7023e06 100644
--- a/TODO
+++ b/TODO
@@ -1,13 +1,5 @@
-got a stack trace when a connection seemingly timed out (after around 30 successful calls to -p)
-
-getgmissing doesn't work brilliantly with preview books as it will always get 1st ~40 pages then get ip block. getgfailed will do a better job
-
-list all binaries in readme and what they do
-
# other utils
-getgbooktxt (different program as it gets from html pages, which getgbook doesn't any more)
-
getabook
getbnbook
@@ -24,12 +16,13 @@ write some little tests
## getgbook
-have file extension be determined by file type, rather than assuming png
-
-think about whether default functionality should be dl all, rather than -a
+Note: looks like google allows around 3 page requests per cookie session, and exactly 31 per ip per [some time period > 18 hours]. If I knew the time period, could make a script that gets maybe 20 pages, waits for some time period, then continues.
-to be fast and efficient it's best to crank through all the json 1st, filling in an array of page structs as we go
- this requires slightly fuller json support
- could consider making a json reading module, ala confoo, to make ad-hoc memory structures from json
+got a stack trace when a connection seemingly timed out (after around 30 successful calls to -p). enable core dumping and re-run (note have done a small amount of hardening since, but bug is probably still there).
-Note: looks like google allows around 3 page requests per cookie session, and exactly 31 per ip per [some time period]. If I knew the time period, could make a script that gets all it can, gets a list of failures, waits, then tries failures, etc. Note these would also have to stop at some point; some pages just aren't available
+running it from scripts (getgfailed.sh and getgmissing.sh), refuses to Ctrl-C exit, and creates 2 processes, which may be killed independently. not related to torify
+ multiple processes seems to be no bother
+ ctrl-c seems to be the loop continuing rather than breaking on ctrl-c; e.g. pressing it enough times to end loop works.
+ due to ctrl-c on a program which is using a pipe continues the loop rather than breaking it. using || break works, but breaks real functionality in the scripts
+ see seq 5|while read i; do echo run $i; echo a|sleep 5||break; done vs seq 5|while read i; do echo run $i; echo a|sleep 5; do
+ trapping signals doesn't help; the trap is only reached on last iteration; e.g. when it will exit the script anyway
diff --git a/getgfailed.sh b/getgfailed.sh
new file mode 100755
index 0000000..9ecd9e3
--- /dev/null
+++ b/getgfailed.sh
@@ -0,0 +1,13 @@
+#!/bin/sh
+# See COPYING file for copyright and license details.
+#
+# Tries to download each page listed in a fail log (from a
+# previous run of getgbook -a bookid > faillog)
+
+test $# -ne 2 && echo "usage: $0 bookid faillog" && exit
+
+sort < $2 | sort | shuf | head -n 5 | while read i
+do
+ code=`echo $i|awk '{print $1}'`
+ echo $code | getgbook $1
+done
diff --git a/getgmissing.sh b/getgmissing.sh
index 5d6ee18..e8198d8 100755
--- a/getgmissing.sh
+++ b/getgmissing.sh
@@ -1,5 +1,10 @@
#!/bin/sh
# See COPYING file for copyright and license details.
+#
+# This gets any pages listed as available that have not been
+# downloaded. Note that at present this is not too useful, as
+# an IP block will be imposed after the first x pages each run,
+# just for checking availaility.
test $# -ne 1 && echo "usage: $0 bookid" && exit