1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
|
# Overview
While legal "agreements" (which consider themselves signed
by using a service they mention) tend to be hostile, using
the standard for determining what an automated tool is
allowed to do on a server, robots.txt, all of the getxbook
tools are permitted.
# getgbook
## Terms of Service
Google's terms of service are ambiguous. On the one hand they
forbid using anything but a browser to access their sites.
This is absurd and ruinous. On the other hand, however, they
state that one should abide by the rules of robots.txt, which
are only relevant for non-browser access. A reasonable
interpretation would be that non-browsers are allowed to
access Google's services as long as they abide by robots.txt
See the section "Using our Services" of
http://www.google.com/intl/en/policies/terms/.
## robots.txt
Their robots.txt allows certain book URLs, but disallows
others.
We use three types of URL:
http://books.google.com/books?id=<bookid>&printsec=frontcover
http://books.google.com/books?id=<bookid>&pg=<pgcode>&jscmd=click3
http://books.google.com/books?id=<bookid>&pg=<pgcode>&img=1&zoom=3&hl=en&<sig>
robots.txt disallows /books?*jscmd=* and /books?*pg=*. However,
Google consider Allow statements to overrule disallow statements
if they are longer. And they happen to allow /books?*q=subject:*.
So, we append that to the urls (it has no effect on them), and
we are obeying robots.txt
Details on how Google interprets robots.txt are at
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
# getabook
## Conditions of Use
With Amazon, massive overreach rules the day. In the "license and
site access" section of Amazon.com's Conditions of Use, they state
that downloading any part of their website except for page caching
is forbidden, and that using "robots, or similar data gathering
and extraction tools" is also not allowed.
http://www.amazon.com/gp/help/customer/display.html/ref=x?nodeId=508088
## robots.txt
Thankfully, the rules set out in Amazon's robots.txt tell a
different story to the conditions of use. Given that these
explicitly lay down the rules for automated downloading tools, it
seems reasonable too take them as representative of accepted policy.
Amazon's main robots.txt allows all of the request types we make.
## Curiousities
One other curious sentiment in the Conditions of Use is the clause
"we each waive any right to a jury trial." Amazon's is truly a
Brave New World.
# getbnbook
## Terms of Service
Again, we see the terms of service disallow "automated means to
access or index the Barnes & Noble.com Site or its systems, the
Content or any portion or derivative thereof for any purpose", as
well as any downloading other than page caching, from their
website. See section I, "Licenses and restrictions."
http://www.barnesandnoble.com/include/terms_of_use.asp
## robots.txt
Their robots.txt again tells a different story, however. Again,
it seems reasonable to use these as guidelines for how getbnbook
may access the site.
Barnes and Noble's main robots.txt allows all of the request types
we make.
## Curiousities
As with Amazon, Barnes and Noble also happily proclaim in their
Terms of Service that "each of the parties hereby knowingly,
voluntarily and intentionally waives any right it may have to a
trial by jury." Dangerous, radical things, those juries.
|