summaryrefslogtreecommitdiff
path: root/TODO
blob: 8a1deb757695bb6b168004062f82c57e6022d13a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# other utils

getabook

getbnbook

use "" rather than "\0" in headermax

# other todos

use HTTP/1.1 with "Connection: close" header

try supporting 3xx in get, if it can be done in a few lines
 by getting Location line, freeing buf, and returning a new
 iteration.

add https support to get

write some little tests

create man pages

have websummary.sh print the date of release, e.g.
  getxbook 0.3 (sig) (2011-08-02)

## getgbook

Google will give you up to 5 cookies which get useful pages in immediate succession. It will stop serving new pages to the ip, even with a fresh cookie. So the cookie is certainly not everything.

If one does something too naughty, all requests from the ip to books.google.com are blocked with a 403 'automated requests' error for 24 hours. What causes this ip block is less clear. It certainly isn't after just trying lots of pages with 5 cookies. It seems to be after requesting 100 new cookies in a certain time period - 100 in 5 minutes seemed to do it, as did 100 in ~15 minutes.

So, if no more than 5 useable cookies can be gotten, and many more than this cause an ip block, a strategy could be to not bother getting more than 5 cookies, and bail once the 5th starts failing. of course, this doesn't address getting more pages, and moreover it doesn't address knowing which pages are available.

all pages available (includes page code & order (even when not available from main click3 part) (& title sometimes, & height), though not url): curl 'http://books.google.com/books?id=h3DSQ0L10o8C&printsec=frontcover' | sed -e '/OC_Run\(/!d' -e 's/.*_OC_Run\({"page"://g' -e 's/}].*//g'

TODO, THEN:
	at start (if in -p or -a mode), fill a Page struct (don't hold url in struct any more)
	in -a, go through Page struct, if file exists, skip, otherwise get the url for the page (don't bother about re-getting order etc). this means that getgfailed and getgmissing can go away
	in -p, just go through Page struct and print each entry
	when 5 cookies have been exhausted, quit, saying no more cookies available for now (and recommending a time period to retry)
	have -a be default, and stdin be -

	so, usage should be
	 getgbook [-] bookid
	  if - is given, read page codes from stdin
	  otherwise, just download everything (skipping already
	  downloaded pages)