Baen Books, CD Mirrors

General announcements about the forum and its related services.

Moderator: Sennadar Moderators

Locked
sandstrahler1
Novice
Posts: 10
Joined: Mon Mar 22, 2004 3:44 pm
Location: Switzerland

Baen Books, CD Mirrors

Post by sandstrahler1 »

No 18 Eye of the Storm is now available on baencd.thefifthimperium.com/
andy_t_roo
Sorcerer
Posts: 69
Joined: Sat Oct 18, 2008 1:37 pm

Re: Baen Books, CD Mirrors

Post by andy_t_roo »

a nifty command : (you need linux or cygwin for wget)

Edited by Admin: see Spec8472's post below, before using this - it's potentially harmful to the server to do the below

wget -N -np -l15 -k -A css,htm,html,jpeg,gif,jpg,bmp http://baencd.thefifthimperium.com
This recursively downloads all html and pictures from the list of cd's -- the first 9 cd's are blocked (wiki robots.txt), as the contents are duplicated on the other cd's.
The total size of cd's 10-17, and the specials (gutenburg, P04) retrieved this way are 488mb. It only gets html files and pics in the specified format, skipping the audiobooks, and the pdf, doc, reader, etc versions of the books.

-N -- check timestamps (so if you restart, it doesn't re-download)
-np -- no parents, don't go to a higher directory (useful to grab just 1 cd, and not go up to the root folder and get the whole set)
-l15 recurse 15 links deep (this needs to be set, the default of 5 gets the title page, the cd frame, the cd listing, the link to the book, and the book formats listing. The extra steps are the frame for the book, the chapters listing, the chapters, and any pictures in the chapters
-k convert links to the local directory structure (some of the links are absolute and won't work without this)
-A -- just get these file types

A few of the links don't work as they work off the assumption that index.htm is the default file if no .htm is specified in a link (a modified html file which does link to the cd's without the 2 intermediate clicks is found at http://pastebin.ca/1519955 (it is set to timeout 1 year from today) )


edit:
in reply to the comments;
Spec8472 , sorry for not thinking before posting, i would of thought that crawling just the html pages, when the content is static, would have less load on the server. bandwidth wise, just the html and pics is 500mb across 9 cd's, rather than the full ~5000mb of all the extra content.
re: wget mirroring an entire site -- this site only contains the cd's -- if you want to mirror just the html components of the cd's then you have to work from the root folder. -- the wget only gets 1 folder for each cd, a folder called css and a folder called images. -- i am aware that this could go too far, hence the inclusion of the (useless in this specific case) -np (if you start it from a folder, it won't go to parents, or to other websites)

re: robots -- i removed the comment concerning it; it deliberately wasn't included in the actual command listed. I should give myself a slap on the back of the head for even mentioning it.
Last edited by andy_t_roo on Fri Aug 07, 2009 12:31 am, edited 2 times in total.
Spec8472
Weavespinner
Posts: 1536
Joined: Sun Apr 06, 2003 12:00 am

Re: Baen Books, CD Mirrors

Post by Spec8472 »

andy_t_roo wrote:a nifty command : (you need linux or cygwin for wget)
First: You can get wget for win32. Works the same.

Second: This is exceptionally bad behavour - requesting thousands of files and mirroring the entire site is bad enough. Also actively disabling the robots.txt is even worse.

Before you go and do this, consider just downloading the .zip files for the entire CD and then extracting the zip file on your computer. This is less load on the server, as a rule, than crawling the site.
Locked