From 3afe70f3cd0a19465ef9f8bbaf6a0961d9eb6d3a Mon Sep 17 00:00:00 2001 From: Nick White Date: Wed, 17 Aug 2011 18:57:02 +0100 Subject: Started rewrite (not there yet) --- TODO | 2 ++ 1 file changed, 2 insertions(+) (limited to 'TODO') diff --git a/TODO b/TODO index 558b8d8..8a1deb7 100644 --- a/TODO +++ b/TODO @@ -4,6 +4,8 @@ getabook getbnbook +use "" rather than "\0" in headermax + # other todos use HTTP/1.1 with "Connection: close" header -- cgit v1.2.3 From be77fe85042dfcc4a943c4c979ba7b990d6a124f Mon Sep 17 00:00:00 2001 From: Nick White Date: Sun, 21 Aug 2011 17:00:00 +0100 Subject: Tighten sscanf usage, add TODOs --- TODO | 24 ++++++++---------------- 1 file changed, 8 insertions(+), 16 deletions(-) (limited to 'TODO') diff --git a/TODO b/TODO index 8a1deb7..4703148 100644 --- a/TODO +++ b/TODO @@ -8,6 +8,10 @@ use "" rather than "\0" in headermax # other todos +use wide string functions when dealing with stuff returned over http; it's known utf8 + +bug in get(): if the \r\n\r\n after http headers is cut off between recv buffers + use HTTP/1.1 with "Connection: close" header try supporting 3xx in get, if it can be done in a few lines @@ -25,23 +29,11 @@ have websummary.sh print the date of release, e.g. ## getgbook +mkdir of bookid and save pages in there + Google will give you up to 5 cookies which get useful pages in immediate succession. It will stop serving new pages to the ip, even with a fresh cookie. So the cookie is certainly not everything. If one does something too naughty, all requests from the ip to books.google.com are blocked with a 403 'automated requests' error for 24 hours. What causes this ip block is less clear. It certainly isn't after just trying lots of pages with 5 cookies. It seems to be after requesting 100 new cookies in a certain time period - 100 in 5 minutes seemed to do it, as did 100 in ~15 minutes. -So, if no more than 5 useable cookies can be gotten, and many more than this cause an ip block, a strategy could be to not bother getting more than 5 cookies, and bail once the 5th starts failing. of course, this doesn't address getting more pages, and moreover it doesn't address knowing which pages are available. - -all pages available (includes page code & order (even when not available from main click3 part) (& title sometimes, & height), though not url): curl 'http://books.google.com/books?id=h3DSQ0L10o8C&printsec=frontcover' | sed -e '/OC_Run\(/!d' -e 's/.*_OC_Run\({"page"://g' -e 's/}].*//g' - -TODO, THEN: - at start (if in -p or -a mode), fill a Page struct (don't hold url in struct any more) - in -a, go through Page struct, if file exists, skip, otherwise get the url for the page (don't bother about re-getting order etc). this means that getgfailed and getgmissing can go away - in -p, just go through Page struct and print each entry - when 5 cookies have been exhausted, quit, saying no more cookies available for now (and recommending a time period to retry) - have -a be default, and stdin be - - - so, usage should be - getgbook [-] bookid - if - is given, read page codes from stdin - otherwise, just download everything (skipping already - downloaded pages) +NOTE!!: the method of getting all pages from book page does miss some; they aren't all listed +* these pages can often be requested, though -- cgit v1.2.3 From 043da4609ae6f9e229f0f03d602f57908f66879a Mon Sep 17 00:00:00 2001 From: Nick White Date: Sun, 21 Aug 2011 17:22:35 +0100 Subject: Fix reporting of no pages available --- TODO | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'TODO') diff --git a/TODO b/TODO index 4703148..4eb35e4 100644 --- a/TODO +++ b/TODO @@ -31,9 +31,14 @@ have websummary.sh print the date of release, e.g. mkdir of bookid and save pages in there +add cmdline arguments for stdin parsing + +merge pageinfo branch + +### notes + Google will give you up to 5 cookies which get useful pages in immediate succession. It will stop serving new pages to the ip, even with a fresh cookie. So the cookie is certainly not everything. If one does something too naughty, all requests from the ip to books.google.com are blocked with a 403 'automated requests' error for 24 hours. What causes this ip block is less clear. It certainly isn't after just trying lots of pages with 5 cookies. It seems to be after requesting 100 new cookies in a certain time period - 100 in 5 minutes seemed to do it, as did 100 in ~15 minutes. -NOTE!!: the method of getting all pages from book page does miss some; they aren't all listed -* these pages can often be requested, though +The method of getting all pages from book webpage does miss some; they aren't all listed. These pages can often be requested, though. -- cgit v1.2.3 From 6b059ae1888b0cf8d38c7fe9b4f5c10ec28ab7b6 Mon Sep 17 00:00:00 2001 From: Nick White Date: Sun, 21 Aug 2011 21:14:24 +0100 Subject: Restructure getgbook code --- TODO | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) (limited to 'TODO') diff --git a/TODO b/TODO index 4eb35e4..6b08e9f 100644 --- a/TODO +++ b/TODO @@ -31,14 +31,10 @@ have websummary.sh print the date of release, e.g. mkdir of bookid and save pages in there -add cmdline arguments for stdin parsing - -merge pageinfo branch - ### notes Google will give you up to 5 cookies which get useful pages in immediate succession. It will stop serving new pages to the ip, even with a fresh cookie. So the cookie is certainly not everything. If one does something too naughty, all requests from the ip to books.google.com are blocked with a 403 'automated requests' error for 24 hours. What causes this ip block is less clear. It certainly isn't after just trying lots of pages with 5 cookies. It seems to be after requesting 100 new cookies in a certain time period - 100 in 5 minutes seemed to do it, as did 100 in ~15 minutes. -The method of getting all pages from book webpage does miss some; they aren't all listed. These pages can often be requested, though. +The method of getting all pages from book webpage does miss some; they aren't all listed. These pages can often be requested, though, though at present getgbook can't, as if a page isn't in its initial structure it won't save the url, even if it's presented. -- cgit v1.2.3