Page 1 of 1

Baen Books, CD Mirrors

Posted: Wed Aug 05, 2009 8:51 am
by sandstrahler1
No 18 Eye of the Storm is now available on baencd.thefifthimperium.com/

Re: Baen Books, CD Mirrors

Posted: Thu Aug 06, 2009 12:58 am
by andy_t_roo
a nifty command : (you need linux or cygwin for wget)

Edited by Admin: see Spec8472's post below, before using this - it's potentially harmful to the server to do the below

wget -N -np -l15 -k -A css,htm,html,jpeg,gif,jpg,bmp http://baencd.thefifthimperium.com
This recursively downloads all html and pictures from the list of cd's -- the first 9 cd's are blocked (wiki robots.txt), as the contents are duplicated on the other cd's.
The total size of cd's 10-17, and the specials (gutenburg, P04) retrieved this way are 488mb. It only gets html files and pics in the specified format, skipping the audiobooks, and the pdf, doc, reader, etc versions of the books.

-N -- check timestamps (so if you restart, it doesn't re-download)
-np -- no parents, don't go to a higher directory (useful to grab just 1 cd, and not go up to the root folder and get the whole set)
-l15 recurse 15 links deep (this needs to be set, the default of 5 gets the title page, the cd frame, the cd listing, the link to the book, and the book formats listing. The extra steps are the frame for the book, the chapters listing, the chapters, and any pictures in the chapters
-k convert links to the local directory structure (some of the links are absolute and won't work without this)
-A -- just get these file types

A few of the links don't work as they work off the assumption that index.htm is the default file if no .htm is specified in a link (a modified html file which does link to the cd's without the 2 intermediate clicks is found at http://pastebin.ca/1519955 (it is set to timeout 1 year from today) )


edit:
in reply to the comments;
Spec8472 , sorry for not thinking before posting, i would of thought that crawling just the html pages, when the content is static, would have less load on the server. bandwidth wise, just the html and pics is 500mb across 9 cd's, rather than the full ~5000mb of all the extra content.
re: wget mirroring an entire site -- this site only contains the cd's -- if you want to mirror just the html components of the cd's then you have to work from the root folder. -- the wget only gets 1 folder for each cd, a folder called css and a folder called images. -- i am aware that this could go too far, hence the inclusion of the (useless in this specific case) -np (if you start it from a folder, it won't go to parents, or to other websites)

re: robots -- i removed the comment concerning it; it deliberately wasn't included in the actual command listed. I should give myself a slap on the back of the head for even mentioning it.

Re: Baen Books, CD Mirrors

Posted: Thu Aug 06, 2009 12:47 pm
by Spec8472
andy_t_roo wrote:a nifty command : (you need linux or cygwin for wget)
First: You can get wget for win32. Works the same.

Second: This is exceptionally bad behavour - requesting thousands of files and mirroring the entire site is bad enough. Also actively disabling the robots.txt is even worse.

Before you go and do this, consider just downloading the .zip files for the entire CD and then extracting the zip file on your computer. This is less load on the server, as a rule, than crawling the site.