Pavuk

SourceForge

Last Update: August 16 2005

 
 

Short about :

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

Pavuk is currently maintained by Dirk Stöcker.

Features :

  • recursive downloading based on links inside HTML documents
  • supports CSS and HTML 4.0
  • local tree of documents is similar to original (located on remote server)
  • transformation of Gopher and FTP directories into HTML document
  • HTML links translation from remote to local or local to remote
  • supports proxy servers (HTTP, FTP, SSL, HTTP gateway for FTP, HTTP gateway for Gopher, SOCKS 4/5)
  • supports authentication against HTTP servers and proxy HTTP servers
  • can provide detailed timing information about transfers
  • has many options to define the set of documents for transfer :
    • limit on server
    • limit on domain
    • limit on prefix
    • limit on suffix
    • limit on document tree level
    • limit on maximal and minimal size of file
    • limit on type of document (as yet only for document transfered via HTTP or HTTPS)
    • matching patterns on URLs and document names
    • and many others
  • does restart of transfer (only when the server supports it) after program break, link down, timeout or some other error
  • stalled connection should timeout after given period
  • can be run in different modes:
    • normal - simple recursion
    • sync - pavuk looks for newer versions of already downloaded documents/files
    • singlepage - download of single document with all inline objects (pictures, backgrounds, sounds, ...)
    • resumereget - looks for documents whose transfer was broken and try to download missing parts
    • singlereget - retries to transfer file until it is not succesfully downloaded
    • linkupdate - scans local tree of documents and try to update links inside HTML document when some linked documents are already downloaded, but it is not reflected
    • dontstore - used to fetch files to cache/proxy server
    • reminder - used to inform user about changes on remote HTTP servers.
  • can be run on a terminal or inside an X windows window
  • X windows interface based on the GTK2 toolkit
  • DnD of URLs with GTK2.0
  • fetching URLs from clipboard
  • have Native Language Support based on GNU gettext
  • asynchronous buffered DNS name resolving when running in X windows
  • so called dirty FTP proxy support (using CONNECT request to HTTP proxy)
  • can be used as a full featured FTP mirroring tool (preserves modification time, permissions, symbolic links, ...)
  • optional transfer speed limitation max./min.
  • very customizable URL - local filename mapping algorithm
  • automatically loads copy from Netscape browser cache if enabled
  • can remove advertisement banners from HTML pages
  • HTTP/1.1 support
  • FTP over SSL
  • supports POST requests and the GTK UI also have a dialog for interactive HTML forms filling
  • supports many formats of FTP directory listings (Unix BSD/SYSV, EPFL, Novel, VMS, MS DOS/Windows)
  • optional multithreading support
  • multiple round-robin used HTTP proxies
  • supports javascript via regular expression patterns
  • supports NTLM authorization
  • has JavaScript bindings to allow scripting of particular tasks
  • allows user to define custom FTP login procedures
  • etc.