1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
|
before 1.0: create bn tool, fix http bugs, be unicode safe, package for osx & windows
# getbnbook
# other todos
mention in getgbook man page that not all pages may be available in one run, but try later / from a different ip and it will try to fill in the gaps (can replace notes section here, too)
use the correct file extension depending on the image type (for google and amazon
the first page is a jpg, all the others are png)
use wide string functions when dealing with stuff returned over http; it's known utf8
http://triptico.com/docs/unicode.html#utf-8
http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
this means c99, rather than plain ansi c. worth it.
alternative is to just use our own bit of utf-8 handling; we only need to know to skip x number of bytes to get one char at a time, to find next char etc. whether this would get more tricky, being unable to use strcmp etc, to make it not worthwhile, is not yet certain. try it and see if it fits. note st has nice homemade utf8 support.
OR
use custom string functions where needed (prob only strstr needed), which work on utf8 specifically, and just skip the appropriate # of chars if it's not an ascii char
BUT
see how things are done in plan9, as they're good there
bug in get() & post(): if the \r\n\r\n after http headers is cut off between recv buffers
what happens if we receive not a http header? does recv loop forever, in a memory killing manner?
package for osx
package for windows
have tcl as a starpack. have it always reference the executables in its directory, and we're golden.
http://www.digital-smarties.com/Tcl2002/tclkit.pdf
try supporting 3xx in get, if it can be done in a few lines
by getting Location line, freeing buf, and returning a new
iteration.
add https support to get
write some little tests
would likely be rather tricky, but building for android
would be nice. how it would work would be modifying the
getgbook src slightly, redefining function calls to be
findable by the java, and then writing java stuffs to call
it. gui could either be done from the java directly, or from
xml; both are gross options. see:
http://developer.android.com/resources/tutorials/hello-world.html
http://marakana.com/forums/android/examples/49.html
### notes
Google will give you up to 5 cookies which get useful pages in immediate succession. It will stop serving new pages to the ip, even with a fresh cookie. So the cookie is certainly not everything.
If one does something too naughty, all requests from the ip to books.google.com are blocked with a 403 'automated requests' error for 24 hours. What causes this ip block is less clear. It certainly isn't after just trying lots of pages with 5 cookies. It seems to be after requesting 100 new cookies in a certain time period - 100 in 5 minutes seemed to do it, as did 100 in ~15 minutes.
The method of getting all pages from book webpage does miss some; they aren't all listed. These pages can often be requested, though, though at present getgbook can't, as if a page isn't in its initial structure it won't save the url, even if it's presented.
|