blob: 1e788d99ada808ac3c2fbea6670772eeb4f421d9 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
# Getgbook
## TOS
Google's terms of service are ambiguous. On the one hand they
forbid using anything but a browser to access their sites.
This is absurd and ruinous. On the other hand, however, they
state that one should abide by the rules of robots.txt, which
are only relevant for non-browser access. A reasonable
interpretation would be that non-browsers are allowed to
access Google's services as long as they abide by robots.txt
See section 5.3 of http://www.google.com/accounts/TOS.
## robots.txt
Their robots.txt allows certain book URLs, but disallows
others.
We use three types of URL:
http://books.google.com/books?id=<bookid>&printsec=frontcover
http://books.google.com/books?id=<bookid>&pg=<pgcode>&jscmd=click3
http://books.google.com/books?id=<bookid>&pg=<pgcode>&img=1&zoom=3&hl=en&<sig>
robots.txt disallows /books?*jscmd=* and /books?*pg=*. However,
Google consider Allow statements to overrule disallow statements
if they are longer. And they happen to allow /books?*q=subject:*.
So, we append that to the urls (it has no effect on them), and
we are obeying robots.txt
Details on how Google interprets robots.txt are at
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
|