O’Reilly news

Fetching Web Pages, Parsing HTML, Writing Spiders, and More: O'Reilly Releases "Perl & LWP"

July 29, 2002

Sebastopol, CA--The Swiss Army Knife of programming languages, Perl turns up in diverse and sundry applications. Its flexibility makes it a favorite of coders, and with its multi-purpose modules--like the tools and gadgets on a pocketknife--there are very few tasks that Perl is not applied to. One of Perl's handiest and most practical tools is LWP (Library for WWW in Perl), the suite of modules for fetching and processing web pages. There is a wealth of information on the Web: news, weather, government info, shopping, discussion groups, product info, reviews, games, and other entertainment, and LWP can help automate all of it. In his book, Perl & LWP (O'Reilly, US $34.95), author Sean Burke shows how to use the powerful LWP library and its related HTML tools to build useful web client applications to automate various tasks on the Web.

LWP is the most frequently downloaded Perl distribution in all of CPAN (Comprehensive Perl Archive Network). It enables programmers to write "spiders" to automatically fetch web pages, extract information from HTML pages, submit forms, and write homegrown servers. With LWP, programmers can dispense with graphical web browsers such as Netscape Navigator and interact with web servers directly, making it ideal for repetitive tasks that would be cumbersome to perform with a browser.

"As people deal more and more with the Web, there are more tasks that we routinely carry out over the Web that could be automated using LWP or the HTML-parsing modules," says Burke. "For example, I'm a fan of CSPAN2's weekend programming, Book TV, but sometimes they'll have an interesting author on at 5 a.m. on Saturday morning, when I definitely would not be awake and flipping channels. If I want to catch these things, I have to program my VCR in advance. However, that means I have to remember to look at Book TV's web site on Friday night, and remembering is not one of my strong points. So, I wrote a simple LWP program that emails me the web page from the Book TV web site, and then I scheduled crontab to run that program every Friday afternoon. So, what used to be a matter of often missing really good programs is now convenient: I get an email message every Friday night, skim it for interesting authors or subjects, and program the VCR accordingly."

"Perl and LWP" includes many step-by-step examples that show readers how to apply the various techniques for their own needs. Programs to extract information from the web sites of BBC News, AltaVista, ABEBooks.com, and Weather Underground, as well as others, are explained in detail. The book also covers:

  • Understanding LWP and its design
  • Fetching and analyzing web pages
  • Extracting information from HTML using regular expressions, tokens, and trees
  • Setting and inspecting HTTP headers and response codes
  • Accessing information that requires authentication or cookies
  • Extracting links
  • Cooperating with proxy caches
  • Writing web spiders (a.k.a. robots) in a safe fashion

Says Burke, "Readers will realize that they can make their life simpler by using what they've learned in this book to write a few little LWP programs to automate two or three of their most common tasks that involve the Web. That needn't be something like getting TV listings off the Web; it could be a program that checks the server status page on a dozen different servers and shows them all on a single page, for the convenience of the server administrator."

Perl programmers who want to automate and mine the Web can pick up this book and be immediately productive. Written by a contributor to LWP, with a foreword by one of LWP's creators, "Perl & LWP" is the authoritative guide to this powerful and popular toolkit.

Additional resources:

Perl & LWP
By Sean M. Burke
ISBN 0-596-00178-9, 242 pages, $34.95 (US), $54.95 (CAN)
1-800-998-9938; 1-707-827-7000

About O’Reilly

O’Reilly, the premier learning platform for technology professionals, offers the industry’s most extensive catalog of high-quality technical and professional skills development courses. From AI, programming, and cloud technologies to essential business skills such as leadership training and critical thinking, O’Reilly delivers highly trusted content from its network of renowned experts that meets a diverse array of learning needs, with over 5,000 role-based on-demand courses, nearly 200 live events each month, access to interactive sandboxes and labs, and more. For more information, visit www.oreilly.com.

Email a link to this press release