Pavuk

SourceForge

Last Update: April 19 2005

 
 

Last Changes :

* ---------- released version 0.9.33 (2005-09-27)
* fixed 64bit problems (BUG #1226863)
* updated German locale, fixes done by Debian developers (Hey, please inform
  us about errors. Scanning the net and all distributions for possible fixes
  is not very helpful.)
* ---------- released version 0.9.34 (2006-01-09)
* security fixes
* some minor bug fixes
* reworked build system a lot, fixed RPM spec file
* now builds fine using most of the possibilities pavuk provides
* RPM builds on openSUSE build service for SUSE since version 9.3, Fedora
  since version 4 and Mandriva since version 2006
* RPM packages can be found here:
  http://software.opensuse.org/download/home:/dstoecker/
* ---------- released version 0.9.35 (2007-02-21)
* added -persistent/-nopersistent option

2007-april-30 [notes taken from old work back in 2005/2006 merged into pavuk mainstream source tree]

* bufio has seen a MAJOR overhaul. It is now capable of pushing text &
  binary data to the file system at unprecedented rates. This is done by
  adding a variable sized (and possibly large) memory cache, resulting in
  large size I/O operations. These perform very much faster than the regular
  RTL I/O calls. (tested on quad CPU UNIX Dell servers)

  the new bufio was required as I needed to log/track a huge amount of data
  in the shortest possible time / lowest possible CPU load.

* cookie handling has been fixed/augmented. pavuk can now have the initial
  cookie values that go with a certain web request preconfigured on the
  commandline. Also, several bugs in handling the cookies have been fixed.
  (tested on a wicked ASP.NET intranet site which 'assumed' the use of a
  special web client (a TV set top box) which would transmit it's serial #
  as a client-side created(!) cookie to the web server. This site/client
  combo thus actually transmitted cookies which would first show up in a web
  _request_ instead of the usual: a server-side _response_.)

* several portability items have been changed (h_errno, ...) to make the
  code compile and work on the odd-flavored UNIX box. A native Win32 port is
  under way: it now works, inclusing zlib and OpenSSL, though the latter has
  not been tested recently.

  Note that the changes may have broken GTK support, as I was not able to
  build the code with GTK on my UNIX boxes.

* socket I/O (IP traffic) has been fixed to properly cope with user breaks
  (a user hitting Ctrl+C). Several locations in the software where the
  unexpected signal would cause an infinite loop have been identified and
  fixed.

* added several lines of DEBUG_xxx to aid both developer and user in
  tracking down hard to diagnose issues inside pavuk while scanning a site.

* Accepted-Encoding (more specifically: the handling of x-gzip/gzip/x-
  compress/compress encoding) has been changed to allow for better
  portability: data is expanded in-memory, without the need for an external
  'gzip' tool and/or OS-specific forks & pipes.

  (Win32 wouldn't know a fork if ever it saw one.)

* ALL stdio is now handled through the new bufio system. This not only
  improves performance when you've got -debug and -debuglevel dialed all the
  way up, but also corrected several spots where, depending on your C RTL,
  stdio/stderr traffic would arrive at different moments on your console
  (some of it was written through the FILE I/O, some through direct I/O,
  causing blurbs of output to pass one another along the way to the actual
  console).

* buffer overrun protection has been improved. Note also that every
  snprintf() and derivative thereof is now 'augmented' by an additional line
  of code which ensures that the last character in the buffer is guaranteed
  to be a NUL sentinel, thus ensuring that the buffer will always present
  data in correct C string format (NUL-terminated). (This is an old habit of
  mine as some C RTLs have shown to be kinda flaky on the subject of NUL
  sentinels when snprintf() et al are writing data up to the edge of their
  output buffers: some C RTLs 'forget' to put a NUL there under particular
  circumstances (some commercial Watcom compiler releases come to mind).

* multithreading pavuk has been tested on an high perf MP UNIX box and it
  was like the documentation/notes state somewhere: instable. The thread
  interlocking has now been fixed; one of the hardest to fix proved to be
  the lockup at the end of a pavuk run. The fix also includes the use of
  semaphores and some additional code changes to make the code thread safe;
  critical sections are now handled as such. This includes placing several
  non-threadsafe C RTL calls (e.g. ctime()) inside critical sections!

* auto-form-filling (the feature which led me to select pavuk over wget et
  al when I started the hammer/chunky project) has been fixed for those
  special pages where you have an empty form to submit: the site I had to
  test included such a form, which was submitted using javascript, but did
  not contain _any_ input fields (but cookies were expected to come with
  that request, thank you). Before, pavuk crashed on such a page. This has
  now been fixed.

* added a 'reindent' target to the makefile, using GNU indent to reformat
  the code. (When you're working several weeks on end in crunch time, you
  want to see some proper and consistent looking source code, even when you
  just made it a mess yourself...)

  Also extended the cleanup makefile target to help me in cleaning up any
  backup and/or temporary files created by vi and some log diagnostic
  scripts.

  [edit may/2007: wasn't this already in the makefiles before - see
  ChangeLog entry in 2003?]

* added several commandline parameter types, which allow you to instruct
  pavuk to use OS file handles or file names for logging activity, while you
  can now also specify whether a log file should be overwritten (default) or
  appended to (new feature) by adding another '@' prefix to the file path.

  TODO: document this properly.

* added hammer/crunchy modes: several ways to scan a web site and than
  rescan it. The higher (later) hammer mode has been specifically written to
  use pavuk as a 'replay attack' based DoS tool for testing high performance
  web servers. (bufio was overhauled to allow us to log all I/O data +
  diagnostics to disc while hammering the server while the pavuk system
  _must_ perform better (= faster) than the web server when running both on
  equivalent hardware.)

* The native Win32 port has been overhauled (previous code was never
  released to the public) to make sure I did not have to look for OS-
  specific path elements _everywhere_ in the code (it was becomes a code-
  wise maintainance nightmare while fixing up/down all those 'absolute path'
  and 'path expansion' code sections to handle Win32 drive letters (root is
  '[A-Z]:[\\/]' instead of simply '/').

  This has been fixed by using the cygwin 'path hack' for the native Win32
  port too: root is '/cygdrive/[a-z]/' so it looks exactly like a UNIX path.

  Any places in the codes which need to address the OS while passing an OS-
  specific path are now handled almost invisibly: all relevant C RTL calls
  (fopen/open/stat/lstat/symlink/link/unlink/rename/mkdir/rmdir/opendir) are
  now encapsulated in tl_[sysname] wrapper functions where these
  /cygdrive/[x]/ paths are converted back to native Win32 paths before the
  actual C RTL function is called. Also any debug/print statement, which is
  used to report a file path, is fixed to convert file paths to the native
  representation with a minimum of fuss: see the new tl_native() call for a
  description how this was done. This code has not been tested in a UNIX/MP
  environment, but the design is such that this should not cause any trouble
  (pthread port for Win32 is in progress ATM).

* added -debug_level modes: all/trace/dev/bufio/cookie/htmlform. Also added
  a feature where you can now specify a set of debug levels and have some of
  those levels _removed_, e.g. 'all,!dev' will show anything _except_ 'dev'
  level debug output: note the new '!' prefix.

* -debug_level output is now prefixed with its level in caps and square
  brackets, e.g. '[PROCE]' to aid in filtering the debug output (for
  instance by piping it through sed/grep).

* unified debug output handling in the code: -debug_levels are now only
  active when you specify -debug too.

* inflate_decode() and gzip_decode() have been fixed to suit a multithreaded
  environment. gzip_decode() now has an in-memory implementation, using the
  zlib library, for those systems which do not support UNIX pipes/forks.

* Fixed deflate/compress handling: the MJF Accept-Encoding deflate hack has
  been removed and the request header extended. (tested on a Wikipedia
  HTTP/1.1 compliant server)

  You may wish to permanently disable the code within


  in decode.c if you do not wish to depend on the external gzip tool any
  more.

* _all_ system header file #include's have been removed from the sources and
  integrated into config.h to allow for better portable source code.

  config.h.in and autoconf.am have been extended to include several more OS-
  dependent system call and header file checks.

  A seperate native Win32 version of the header file is also provided (used
  by the MSVC2005 native Win32 build).

* several hardcoded buffer sizes in the software have been made configurable
  (but remain hardcoded). See for instance dinfo.c: 12 -->
  PAVUK_INFO_DIRNAME and 1024-and-other-fixed-buf-sizes -->
  BUFIO_ADVISED_READLN_BUFSIZE

* fixed several cases where dangling (i.e. free()d but not NULL-ed) pointers
  caused havok. Code has been quickly reviewed to locate and fix additional
  spots that did not yet cause pavuk to go 'crazy Ivan' (Hunt for the Red
  October, anyone? ;-) )

* hardcoded lock filenames have been converted to #define's to allow these
  to be changed in a single spot (config.h), improving portability. e.g.:

    '._lock' --> PAVUK_LOCK_FILENAME

* UNIX-specific octal privs have been changed to their proper #define's to
  allow for maximum portability (Win32 doesn't know '0644' but can cope with

    S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

  though maybe in a odd way).

* fixed quite a few spots where an unidentified form encoding method would
  lead to _very_ instable bahaviour, including crashes/core dumps. Look for

    fi->method = FORM_M_UNKNOWN

  assignments and additonal FORM_M_UNKNOWN checks.

* added -no_dns support for those who have to work in an environment with
  flaky or no DNS support (I had to as I was working on a box in a specially
  configured, partially walled-off DMZ zone while developing and testing
  pavuk against a web server.)

* fixed typos in the text as I came along them.

* the bufio overhaul also lead to a overhaul of the -dumpxxx code,
  removing/fixing several spots in the code which caused incorrect/instable
  behaviour. (e.g. code in doc.c)

* Fixed handling of compressed data for any text-based server response;
  pavuk now correctly handles any gzipped/deflated text, including, for
  instance, any 'text/javascript' content sent over the wire in compressed
  form (tested on a Wikipedia-based HTTP/1.1 compliant server).

* added -progress_mode: several choices in progress verbosity.

* added -no_disc_io: test a grab/scan without writing anything to disc.
  Mostly useful in combination with the earlier -hammer modes.

* fixed/updated HTTP error response handling in accordance with RFC2616 so I
  can better see what a HTTP/1.1 compliant target is reporting back to
  pavuk. (errcode.c et al)

* unified timing units to fix a few timing oddities: instead of minutes,
  etc. the code uses seconds everywhere (apart, of course, from the few
  locations where we use milleseconds ;-) )

  -timeout is now in milliseconds!

* Added -rtimeout and -wtimeout command line parameters.
  (unit: milliseocnds)

* added -allow_persistent / -noallow_persistent commandline arguments to
  allow/disallow the use of HTTP/1.1 persistent connections.

* added -dumpcmd and -dumpdir commandline arguments.

* added -bad_content commandline argument for use with the hammer/chunky
  modes.

* added -report_url_on_err commandline argument: report the URL which was
  processed while the error occurred.

* added -test_id commandline argument: this is included in the timing report
  so reports can be better automatically processed / combined.

* added -page_sfx commandline argument to help pavuk identify what suffixes
  are to be considered web pages (useful for scanning ASP and ASP.NET sites
  which present unusual mime types with their pages).

* added -tlogfile4sum commandline argument: specify a log file where timing
  info is stored. Handy when pavuk is not only used to grab the info off a
  site but also scan & report site performance.

* added -encode commandline parameter as the counterpart of -noencode.

* added -nohtDig, -noquiet and -noverbose commandline parameters as
  counterparts of -htDig, -quiet and -verbose respectively.

* added filepath support to -dumpfd and -dump_urlfd: by specifying the
  option prefixed with a '@' character, pavuk will treat the option value as
  filepath specification instead of a OS file handle and subsequently open
  the specific file internally. Note that adding yet another '@' character
  as a prefix signals pavuk to _append_ to the specified file, instead of
  _overwriting_ it.

  This is useful when you wish to have those dumps but are working in an
  environment where you cannot pass valid file handles through the
  commandline.

* added -dump_request and -nodeump_request commandline arguments for use
  with -dumpfd: when -dump_request is specified, the log file will include
  complete dump of each request sent to the server by pavuk. Thus you can
  produce a complete audit trail of the exchange.

* replaced the DUMP_URLLIST macros in stats.c by two functions. Code is a
  bit cleaner that way.

* fixed times.c which barfed on timestamps beyond 2037 (signed int wrap
  around for time_t).

* added assert() checks at several locations in the code to help track down
  unexpected behaviour which could lead to crashes (like it did till now).

* unified the proliferation of HEX2ASC-alike macros with and without off-by-
  one offsets inside. Now there's one macro for each of 'em in tools.h.

* changed the configure.in option to --disable-threads to keep the pattern
  consistent (--disable-xxx series of options in configure), but the default
  behaviour remains the same.

* configure.in: as --disable-debug removes any debug-_related_ features from
  the pavuk build, these options have been added: --disable-debugging will
  create a default build with all debugging removed from the compiled
  binaries. --disable-prof and --disable-gprof have been added to remove any
  profile info from the default compiled binaries.

* added checks in configure.in for socklen_t, pid_t and a bunch of system
  calls and header files that do not live in each environment.



2007-may-6

* included pthreads-Win32 based multithreading support in the native Win32
  build.

* included EXPERIMENTAL tre (regex) support in the native Win32 build.

* fixed several lurking bugs (buffer overruns, etc.) which only showed in a
  multithreaded environment.

* fixed locking bugs in the new bufio implementation.

* added Win32 memory leak + heap checking for the DEBUG build: many memory
  leaks have been tracked and fixed. (MSVC <ctrdbg.h> based)

* fixed memory leak due to wrong scope in report_error() code.

* added DBGxxx macro's to aid heap tracking for the debug build. See
  DBGdecl/DBGpass/DBGvars usage.

* removed a very nasty memleak in html_parser_get_url() which would leak at
  least 3 blocks for each rejected local anchor URL - and those come quite a
  few! Took me a day to track it down. :-(

* added filtering so gzipped/compressed files on the server are not
  decompressed unintentionally while the server supports Accept-
  Encoding:gzip or compress.

  ( doc_download_helper() in doc.c )



2007-may-11

* renamed function should_leave_persistent() to the more appropriately named
  should_keep_persistent()

* Updated 'chunky' source to the state of the latest pavuk CVS contents (as
  of today) as this code has not yet been merged into CVS itself.

* fixed bugs in -scenario handling, when scanrio files produced by pavuk are
  re-used in the Win32 environment

* fixed bugs in path & file type commandline arguments for the native Win32
  port.

* fixed bug in retrying/resuming download for RFC2616 (HTTP/1.1) 'chunked'
  content download handling.

* merged -allow_persistent / -noallow_persistent commandline arguments with
  the equivalent -persistent/-nopersistent feature from the official pavuk
  CVS sources.

  Also improved the code a bit: added the 'Connection: close' header for
  requests over -nopersistent connections, so the server will close the
  connection for us.

* added the -ignore_chunk_bug commandline argument to allow pavuk to handle
  RFC2616 'chunked' downloads from buggy (IIS) web servers.

  ( See also:
  http://www.subbu.org/weblogs/main/2004/11/persistent_conn.html
  http://skrb.org/ietf/http_errata.html#chunk-size
  http://www.apps.ietf.org/rfc/rfc2616.html#sec-3.6.1
  http://www.jmarshall.com/easy/http/
  )



2007-may/june

* recompiled in 64-bit Linux (SuSe 10.2) and fixed a few items in the
  Makefile.am, configure.in and ac-config.h.in files. Also added the tests\
  and www\ directories to the distro.

* fixed a few 64-bit compile warnings; at least the test cases in tests\
  perform OK now on a 64-bit Linux system.

* updated the man page a bit; still a lot more to do. Where is that 'nroff
  for dummies' cheatsheet when you need it?  ;-(

* listed -use_http11 as 'on' by default now.

* moved MODE_MIRROR unescape code section up in url.c to line 1682 in
  url_get_local_name_real() as this code would otherwise have no effect at
  all in any environment where the '%' percent character is included in the
  FS_UNSAFE_CHARACTERS charset (for example: Win32).

* PARAM_DOUBLE default values are now fixed point values in 'long' integer
  format; the current values in the program (all 0.0) are clearly within
  range _and_ it 'saves' on compiler warnings quite a bit. (We've still some
  way to go before we get anywhere near a '[almost-]zero-warning cross
  platform portable build: few int to pointer and vice versa casts remain.)

* fixed bug in cfg_get_num_params() which would access uninitialized memory
  out there in NirvanaLand when a PARAM_UNSUPPORTED option was passed to
  pavuk.

* Fixed configure.in to include 'debug' build handling for KDevelop (which
  would pass '--enable-debug=full' to ./configure).

* updated the configure.in script to increase portability (opendir/closedir:
  dirent.h et al)

* included a few aufoconf macros in the m4 directory for easier/proper
  portability support using autoconf et al.

* bugs fixed from BUGS list: multithreaded mode is not as stable as single
  threaded (fixed at least for the CLI version of pavuk; the GTK GUI version
  is in a rather bad shape)

* bugs fixed from BUGS list: signal handling / timeout does not really work
  (at least not in multi threaded downloads). After a SIGINT pavuk just
  hangs.) This has also been fixed for the CLI version of pavuk at least.

* Win32 port now includes JavaScript support (using the statically linked
  Mozilla js library).

* fixed short option definitions in options.h: -tp / -tsp et al

* 'fixed' GUI for Javascript enabled builds (GTK2) - WARNING: it compiles
  now, but has NOT been tested, so expect bugs here!

* merged the 'chunky' code with the pavuk main source tree. Now 'chunky' is
  equivalent to building pavuk with './configure --enable-hammer'.

* set default from -leave_site to -dont_leave_site to prevent 'blown up' web
  crawls when this filter parameter has not been specified.

  This change includes a fix for the cfg/command line handling of pavuk for
  the conditions section (see condition.h + config.c) as pavuk assumed
  sizeof(long)==sizeof(int) in these code sections.

* Now the proper GPL license (GPL, not LGPL) is included in the file
  ./COPYING.



2007-sep

* fixed processing of zero byte length files (robot.txt at figleaf.com,
  etc.): no more crash/assertion failure due to NULLed docu->contents.

* fixed a few memleaks.

* added extra error checking for file rename operations as some issues were
  found with the Win32 build when using a SAMBA-shared filesystem for
  storing the spidered data/files. (It turned out that the same issues
  existed when using native (NTFS, FAT32) filesystems.)

* dialed down the number of default threads from 3 to 1 (see BUGS) to
  prevent a hail of (legitimate) rename error reports.

* added flock() implementation for Win32: when built with multithreading
  support, having no valid flock() implementation is very dangerous!

* changed configure.in to detect both flock() and fcntl() file locking
  mechanisms so pavuk will be able to support writing spidered content to
  network shares on both Win32 and UNIX systems: flock() does not support
  network shares locks, fcntl() does, at least on the latest Linux kernels,
  see man flock(2)

* added error reporting/checking for undesirable use of invalid flock()
  implementation. (Useful when porting pavuk to other non-Unix platforms.)

* Fixed content/file size treatment code for items which are already
  available locally (i.e. pavuk finds the item at the remote has not changed
  from when the last time it fetched the item into local cache).

* Fixed the conditions for when to display certain informational messages:
  less screen clutter when not running in '-verbose' mode OR when running in
  '-progress' modes.

* Fixed several error/info messages in the code section for decompressing
  gzip/compress transmitted HTTP content.

* Fixed handling of gzip/compress transmitted content when retrieved from
  local store instead (when pavuk discovers that the file at the remote site
  has not changed since the last time it was fetched and stored on your
  local disc).

* Fixed a few memleaks.

* Changed the DBGvars/DBGpass/DBGargs macros used for tracing memory
  allocations in debug mode to make these macros look more like regular 'C'
  functions to 'demented' code formatters and analysis tools. The drawback
  is that these still look 'weird' in function prototypes, but that causes
  quite a few less errors/warnings than the old style.

* Fixed bugs in get_abs_file_path() directory detection and Win32 abs path
  processing.

  Also fixed code which produced double slashes in file paths on occasion,
  causing trouble on Win32 platforms. (Fix applied generally.)

* Fixed mk_native() allocated string management pool to support printf() et
  al where up to 3 mk_native() calls are made in the argument list. This is
  important to prevent spurious crashes in multithreaded mode when the worst
  case scenario for mk_native() applies: all threads are executing printf()-
  style statement which has multiple calls to mk_native() in the argument
  list.

  Currently overdimensioned a bit as the actual code only has two
  simultaneous calls while the pool now is dimensioned to tolerate 3
  simultaneous calls per thread.

* No more _strfindnchr() and strfindnchr(): strfindnchr() - and its use -
  has now been fixed to match the (proper working) _strfindnchr().
  [fnmatch.c/tools.c et al]

* Fixed const-correctness of several functions.

* Added '-mime_type_file' commandline option to help pavuk support an up-to-
  date list of mime types and their filename extensions, using, for example,
  the UNIX mime.types(5) config file as a source of MIME type information.

  If the user does not specify the '-mime_type_file' option, the original
  built-in defaults will be used instead.

  This feature has been added to provide better support for the pavuk -
  fnrules %M macro: this macro now will use this configuration to produce a
  suitable filename extension for each MIME type: the first extension listed
  in the '-mime_type_file' config file for the given MIME type will be used
  as extension for the %M macro.

* Changed the GTK GUI macros to become functions for ease of debugging. The
  added (tiny) call overhead won't be a performance hit anyway.

* Fixed -fnrules handling: the generated path is cleaned up before it is
  returned to pavuk for use.

  Cleanup actions:
  - duplicate '/' slashes are removed
  - filenames and directory names which end in a '.' dot, get the dot
    removed

* Added '%X' to the -fnrules formatted processing to allow reformatting of
  filenames using an optional mimetype-derived extension. This is useful
  when grabbing Wiki (MediaWiki et al) sites when you'd like to store the
  grabbed content using default mimetype-related filename extensions, so
  instead of storing a file like

    wiki/page/AboutThisSite

  that would transform into

    wiki/page/AboutThisSite.html

  while pages like

    wiki/static_page/contact.htm

  would remain as is.

  (Note: this might be considered shorthand for a -fnrules (...) expression
   which compares both %e and %E. The intent of %X, however, is to only
   allow %e extensions to pass which are 'valid' for the given MIME type and
   force the %E mimetype based extension for all other cases.)

  CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in
          both simple mode and extended LISP mode.

* Added '%Y', '%A' and '%B' to the -fnrules macros: '%Y' uses the MIME type
  prefered filename extension if the URL/filename doesn't have an extension
  yet (while the rather similar '%X' will OVERRIDE the existing extension if
  it is not listed with the specified MIME type).

  '%B' prints the 'basic MIME type', i.e. the MIME type without the ';'
  semicolon separated MIME attributes such as language, etc., while '%A' will
  print these extensions (if they were passed to us by the server).

  CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in
          both simple mode and extended LISP mode.

  All this allows for pavuk -fnrules commandline arguments like this:

    -fnrules F '*' '%h:%r/%d/%b%s.%Y'
    -mime_types_file ./mime.types
    -tr_chr_chr ':\\!&=?' '_'

  so we'll be able to grab a [Media]Wiki site while storing those pages as
  regular 'abc_php_xyz.html', instead of 'abc.php?xyz' page/filenames.

* Added -fnrules 'fnseq' operator to the extended rules: compares a
  wildcard pattern and a string a la fnmatch(3).

* Checked and updated manpage for the -fnrules operators (added 'ud' and
  'sp' operators to the manpage).

* Added -fnrules 'sn' operator to the extended rules as counterpart of 'ns'.
  'sn' uses strtol() to convert a string to a number, while 'ns' uses
  printf() to format a number to a string. (See the man page.)

* Updated the man page a bit regarding '-fnrules'.

* sanitized escape_str(); a quick code review led us to a lurking bug in
  uconfig.c@309, which has been fixed implicitly.

* Added/updates source code documentation: tools.c/tr.c soure code comments.

* Added some sanity checks in the code (tools.c/tr.c/lfname.c)

* Added debug_level 'rules' to allow debugging of both simple and 'extended'
  -fnrules expressions and '-fnrules' URL F/R matching.

* Different boxes exhibit different mktime() behaviour, especially when
  handling out of range tm value sets. Besides, mktime() works in 'local
  time' while some parts of the code require a robust UTC mkgmtime() (not
  available on many boxes) --> ripped & introduced as tl_mkgmtime(). A local
  time-aware equivalent with excellent out-of-range handling is available as
  tl_mktime().

* Added additional error handling around calls which try to parse time
  stamps using tl_mkgmtime() and tl_mktime() (times.c).

  Basically, now both HTTP and FTP benefit from the new code which should
  now proces timestamps like the UTC timestamps they are, while 'out of UNIX
  time_t bounds' timestamps (beyond the range 1970..2038 A.D.) are handled
  in a more sane manner:

  - out of bounds timestamps are reported by pavuk

  - out of bounds timestamps are then 'sanitized', i.e. restricted to the
    1/1/1970..31/12/2037 date range, i.e. a timestamp beyond the horizon,
    like '1/4/2051' will be 'sanitized' (= restricted) to the upper bound:
    31/12/2037. The same goes for te from antiquity like '11/3/1969' (the
    birthday of a certain person), which will be 'sanitized' towards
    1/1/1970.

* Split up DEBUG into developer related stuff, such as memory/heap checking,
  ASSERT/VERIFY, etc. and user related stuff (the -debug and -debug_level
  command line arguments): ./configure is now fitted with an extra
  parameter:

  --enable/disable-debug-features

  which will turn on/off -debug/-debug_level user level debugging support in
  pavuk, while the existing

  --enable/disable-debug

  adds/removes additional developer checks, such as heap allocated checks
  and ASSERT and VERIFY macros.

  In the code, -debug/-debug_level related code is located within the
  'HAVE_DEBUG_FEATURES' sections, while the developer debug/release builds
  are still related to the standard 'DEBUG' #define.

  This now results in three ./configure options that determine the (debug)
  feature set of your binary:

  --enable/disable-debugging --> compile a binary with source level debug
                                 info included and all optimizations
                                 DISabled for improved debugging (by using
                                 gdb or another debugger of your choice)

  --enable/disable-debug     --> include/exclude additional run time checks
                                 in your binary. Most important are the
                                 ASSERT and VERIFY pre/post-condition
                                 validation methods located throughout the
                                 code. The use of these is advised, though
                                 these may cause a performance hit.

  --enable/disable-debug-features
                             --> include/exclude user level -debug/-
                                 debug_level command line features, which
                                 help you as a pavuk user to 'debug' pavuk
                                 during the run. Using -debug, pavuk will be
                                 EXTREMELY verbose, which can be toned down
                                 by applying a -debug_level restriction
                                 filter. For example:

                                   -debug -debug_level all,!devel

                                 will be VERY verbose, but will NOT log any
                                 DEVEL level debug info, while:

                                   -debug -debug_level !all,rules

                                 will ONLY produce additional output for the
                                 RULES level, i.e. when pavuk processes -
                                 fnrules and/or JavaScript macros.

* Fixed crash when non-RFC compliant website was grabbed: see testcase 7a.

* Added targeted help: when options cannot be parsed correctly,
  short_usage() will try to help the user by printing the full help for the
  abusing commandline option only. (Of course, I screwed up while using
  debug_level flag sets _again_ :-( [Ger])

* Some improvements for network connectivity error handling and reporting.
  (xvherror() added.) This is the result of some FTP tests with pavuk (tests
  8b).

* Don't yak about 'Checking "robots.txt"' anymore when doing a FTP grab when
  robots.txt is NOT applicable anyway.

* FTP: added crude 'autodetect/retry' mechanism for FTP servers which do not
  like NLST (==> response code 550) but report correct directory content for
  LIST (or vice versa). (ftp.c)

* FTP/HTTP: at debug level 'protoD' pavuk will now dump RAW data/content
  received from the server before preprocessing (i.e. converting to HTML or
  decompressing).

* Added command line option integer sizing support: byte sizes can now be
  specified in K, M or G. Other integer values can also be postfixed with K,
  M or G, but then these will be treated like the ISO values 1000, 1E6 and
  1E9.

* Addition memory leak fixes in case pavuk is fed an invalid commandline.

* NTLM support code: fixed a few glaring bugs.

* Added O_SHORT_LIVED to lock file open() flags for better Win32 behaviour.

* Fixed code to load the pavuk configuration settings from, in order of
  appearance:

   env:PAVUKRC_FILE
   ~/.pavukrc
   SYSCONFDIR/pavukrc

  which matches the description in the manual.
  (see also man page)



2008-jan

* Added 'js' flag to '-debug_level', which is used to dump a lot of detail
  about the pattern matching and transformation applied to JavaScript code
  using the '-js_pattern' and '-js_transform / -js_transform2' commandline
  options.

* Added sanity check for '-js_pattern' and '-js_transform[2]' regexes, which
  MUST contain a subexpression for them to 'work' as expected.

* removed re_pmatch_sub() and changed the code where it was used to work
  with the available re_pmatch_subs() call, which allows for more elaborate
  validation anyway. See htmlparser.c.

* Removed a regex handling bug in the -js_transform[2] code, which would
  crash pavuk when using regexes where the first subexpression might be
  empty.

  The crash is due to the fact that the regex parser would return indexes '-
  1' for these empty subexpression(s), resulting in out-of-bounds memory
  writes in the rewrite code. This in turn would nuke the heap, so after
  that is was only a matter of time for pavuk to fail dramatically.


2008 feb 04

* Added DEBUG_MISC() lines to solve sourceforge.net issue: [ 1852885 ] to
  improve manipulation by locally stored files

* Included provisional fix (I don't have a working sample run to reproduce
  the issue (yet)) for sourceforge.net issue: 1852884 ] infinite loop on
  unexpected responses

* Cleaned up the mess that was -progress_mode.

* Cleaned up several DEBUG_xxx macro mistakes

* Added a little description to the 'hidden' -htDig commandline option,
  which can be used to dump the server-transmitted MIME headers for each
  URL, similar to the htdig tool.

* Added a bit of documentation for the -rollback option (which was
  undocumented)


2008 mar 20

* GNU gettext tools don't like '\r' in i18n strings --> fixed by changing
  the related printf() statements in src/doc.c

* started update of configure scripts to the latest autoconf/automake.

  Also reordered the NEWS file so it will work with the new, stricter

    ./bootstrap && ./configure && make distcheck

  distro test cycle.


2008 jul 10

* fixed ';' semicolon bug in http.c near line 2074 which caused incorrect
  decoding of the HTTP/1.x response code header.

* fixed gzip/compress/... content compression support (HTTP/1.1 Accept-
  Encoding); the previous code was a valliant attempt to 'fix' the client
  side (pavuk) to cope with buggy web servers which send the wrong encoding
  type for already compressed files, but this would screw up particular
  responses by *well-behaving* web servers. Of course this would only happen
  in rare circumstances so it was kinda hard to track down.

  Documentation for -Enc/-noEnc has been updated to reflect this situation
  and the code now (hopefully properly) finally supports compressed data
  transmission for RFC2616-complaint web servers.

  If you find that your 'downloaded' compressed files are already
  /incorrectly/ DEcompressed by pavuk, this is NOT the fault of the client
  (pavuk) but evidence that your server is behaving inappropriately and the
  proper remedy for this is the use of the option '-noEnc' which turns this
  feature off so the server is not allowed to screw up in this way any more.

  Also made sure one can check if pavuk has been built with compression
  support by calling 'pavuk --version' and looking at the feature list.

* autoconf/configure script: using the highly undocumented v_cflags or other
  x_* variables as environment variables to hack the configure script (you
  could do that, especially with v_cflags) has been obsoleted while the
  configure and m4/* scripts have been upgraded to support autoconf
  2.62/automake 1.10 and use ONLY *documented* AC.*/etc. macros from now on.

  Note: thanks to the JavaScript library issues on SuSe10.2/AMD64 (older JS
        lib version and seemingly partial header install), I may have failed
        to eradicate all undocumented macros.

* Extra note about configure.in: bash, at least on SuSe10.2/64-bit, handles
  'if eval test ...' just ever so slightly different than 'if test ...',
  especially where it comes to 'test -n'. As these styles were mixed rather
  arbitrarily before, the 'if eval test ...' style has been completely
  removed from the configure script, as this would sometimes render quite
  unexpected (and incorrect!) results.

* fix_crlf.sh has been updated to ensure important Microsoft Visual Studio
  files are not damaged by having their CRLF sequences converted to UNIX LF
  line endings: this kind of thing will make MSVC spit you in the face and
  reject everything you try until you give it back those CRLF line endings
  in there. So much for XML as project file format and MSVC...

* extra fixes to ensure 'make distcheck' does not barf up a hairball. This
  includes enforcing the permanent inclusion of the 'po' subdirectory in the
  Makefile set for multilingual support.

* configure/Makefile(s): if you don't have one or more of the
  archiving/compression tools compress/lzma/gzip/tar/7z(7zip) installed on
  your system, we don't go belly up at config ~ nor at 'make dist' time
  anymore. This, of course, includes correct behaviour at 'make distcheck'
  time: only use/test those 'GNU standard' formats, which can be created on
  your box.

* Added the 'bootstrap' shell script, next to 'autogen.sh'. I know they
  serve the (almost) same purpose, but 'bootstrap' is far more sophisticated
  than autogen.sh and I didn't wish to overwrite 'autogen.sh'. Besides, IDEs
  on UNIX boxen expect either the one or the other (there's no single
  'standard' for this), so we might as well provide both.

  At a later time, we might probably point autogen.sh to bootstrap.

* Updated the mime.types MIME 'hint' file: currently, it's a mix of

  1) all properly registered MIME types ( http://www.iana.org/assignments/media-types/ )

  2) the mime.types file provided with the latest Apache/XAMPP

  3) my (Ger Hobbelt) additional file extension hints as used on my own
     servers. This is mostly about professional graphics ~ and modern
     'scene' audio/video container formats, such as Matroska. This only adds
     extensions for otherwise already existing MIME types.

* Updated the DocBook-based documentation for several options (-End/-noEnc, ...)

* 'pavuk --version' now also reports if ZLIB support is included in the
  binary. This is important for '-Enc'.

* Fixed the '-Enc' compressed transmission and HTTP header processing code
  to act properly with fully RFC2616-compliant web servers, discarding the
  old 'hack/fix' attempt to solve a non-complaint server issue at the
  client, as this would break things for fully compliant servers in the rare
  (but extremely annoying) use case:

  - pavuk with '-Enc' option

  - webserver is fully RFC2616 compliant

  - pavuk issues request for file in a .tar.Z or other gzip/compress
    compressed format, where the file on the server is only slightly
    compressed (fastest compression).

  - webserver will transmit file to pavuk, but due to pavuk reporting it is
    able to handle compressed transmission AND the server discovering that
    the content can be compressed quite some more than it already was, the
    file will be transmitted after a server-side just-in-time compression
    round.

  - pavuk receives the data. The old hacked code would NOT decompress the
    data. However it SHOULD because the server PROPERLY reported 'Content-
    Encoding: gzip' to pavuk. End result: grabbed data which you cannot
    process nor trust to be in the same format as stored on the server as it
    all 'depends' on arbitrary conditions which you cannot control: is the
    web server able to compress the data before transmission? Is the web
    server configured to allow compression? Etc.

  This use case has now been fixed.

  The effect of BADLY behaving web servers (which send 'Content-Encoding:
  gzip' for any .Z, .z or .gz files (IIS x.x and other servers which are not
  configured to /properly/ handle files and MIME types) is described in the
  DocBook manual page now, including the fix for this (specify the '-noEnc'
  commandline with pavuk).

* active FTP: timeout and stop/break handling slightly improved: now pavuk
  should always terminate under all circumstances while a break or stop has
  been signalled.

* Changed the default for '-url_strategy' from 'level' to 'leveli' to make
  pavuk behave more like your regular web browser (with a user clicking
  through web pages).

* Initial fix for NTLM support for 64-bit Windows. (Only lightly tested.)

  This includes converting that bit of code to support the C99 intNN_t types
  (where NN e {8,16,32}), while the configure script takes care about
  providing the proper types for not-fully-C99-compliant environments.

* The TRE regex package would barf up a hairball due to the incorrect header
  file being loaded. ./configure now recognizes TRE specifics a bit better
  and the code now loads the proper header file (<tre/regex.h> instead of
  <regex.h>). This is important on systems which have multiple, ever so
  slightly incompatible regex processing libraries installed.

* Improved diagnostics a little bit by adding reporting support for
  URL_PARENT_REWRITING, i.e. the situation where a parent page of a grabbed
  page is loaded for the sake of adjusting (rewriting) the URLs in its
  content.

* Fixed code so it would compile in full (-DDEBUG) debug mode on UNIX.

* autoconf/configure: ran into some weird issues due to inconsistent M4 []
  quoting: quite a few lines did without it. Turns out that this is a BIG
  No!No! as adding the AX_ADD_OPTION() macro turned this lurking mess into a
  true disaster.

  Fixed by applying [] quoting throughout. The only place where I didn't do
  it, is in the first and second args of AC_DEFINE() -- which should be used
  instead of AC_DEFINE_UNQUOTED when you don't need the latters extra
  functionality anyway -- and the first arg of AC_DEFINE_UNQUOTED(). Any
  other spot where [] quotes are missing in the M4 macros and/or
  configure.in? Consider that a bug and please report so I can fix it.

* Finally got the configure system to recognize my JavaScript libraries and
  all. Tugged and tweaked a few items in the bindings to allow maximum
  flexibility for the JS code when it is used to filter URLs (e.g.
  JavaScript pavuk_url_cond_check() function).

* Updated jsbind.c to use latest SpiderMonkey 1.8.x (tested on Win32)

* Changed man/Makefile to ensure HTML is not recreated every 'make' run, but
  only when manpage changes. This should really copy the results from
  ./doc/, but that's for later...

* DocBook documentation: tweaked man page generation to mimic original
  manpage title exactly.

* DocBook documentation: updated '-version' info (important to see at run-
  time what abilities you've got with /your/ pavuk.

* Win32/MSVC: all project files have been updated to produce next to
  Win32/x86: Win64/AMD64 and Win64/Itanium binaries. These project files
  assume the existence of all optional libraries: OpenSSL, SpiderMonkey
  (JavaScript), zlib.

  Where to get those, prefered directory layout, etc. to be published, so
  others can build from source on Win32/64 too and get the same results.



2008 jul 20

* tweaked configure+makefiles so that a 'make dist' from CVS becomes
  possible: there were quite a few references to yet unpublishable files in
  my makefiles (Ger Hobbelt).

* config section: improved adherence to C standards: no more potentially
  dangerous mixed use of function and data pointers by typecasting function
  pointers into data pointers and vice versa.

  This has been resolved by an added layer of indirection, which makes it
  all very legal C again. It goes somewhat like this:

    function_pointer_type ptr = &function;
    data_pointer_type d = &ptr;

  then use (d[0])(...) to call the function.

  This contrasts the old code:

    data_pointer_type d = (data_pointer_type)&function;

  and function invocation using:

    ((function_pointer_type)d)(...)

* Added support for parsing 'hidden' CSS and JavaScript in HTML. The support
  is also extended to generally parse inside HTML comments PLUS Microsoft IE
  CC's (Conditional Comments): <!--[if...]><![endif]-->

    -read_css
    -read_cdata
    -read_msie_cc
    -read_comments

  These are all enabled by default; documentation has been updated for these
  as well.

* Fixed CSS and [Java]Script handling in the HTML tokenizer/parser, which
  was feeding the filters and URL extractors (htmlparser.c).

  Now the code can cope better with incorrectly formatted pages / files.

* Reordered the HTML tags in htmltags.c in a preparatory move to check the
  list for missing attributes (onXXX JavaScript items for one! several are
  missing) and HTML 3/4 tags. (htmltags.c)



2008 aug 13

* updated the -debug_level related code; DEBUG_DEVEL() and a few others now
  'automagically' report the sourcefile+lineno without the need to specify
  these explicitly + some DEVEL_*() calls have been shifted to other
  '-debug_devel' levels (net, mtthr, htmlform, ...)

* completed the -debug_level tracing for multithreaded runs: now all
  semaphore accesses can be traced using the -debug_devel mtthr

* Major fix for bufio+socket code: no more lockup for pavuk due to delayed
  reception of response data (tl_selectr() would incorrectly lock
  indefinitely -- which proved to be a generic coding mistake in both
  tl_selectr() and tl_selectw() -- PLUS better error condition handling in
  an attempt to improve handling of all sorts of 'spurious error conditions'
  which may occur when your network suffers from packet loss or other
  undesirable effects.

* -mode remind code fix for multithreaded use to make it match recurse and
  other modes better; not severely tested so YMMV! (The old code wouldn't
  work anyway, so it's an improvement anyhow).

* few code cleanups (#if 0 ... #endif)

* DocBook manual updated: now all return codes from pavuk are documented.

* minor code fixes for SSL/SFTP.

* updated configure and code to assist in compiling with both latest
  SiderMonkey and older Mozilla JavaScript libraries (Win32/64 and UNIX
  respectively).

* Some unused error checks replaced by ASSERT() and some ASSERT()s replaced
  by error reports as those errors /can/ happen in actual use (though
  seldom).

* Fix for parsing malformed URLs (with multiple '#' and/or '?': bookmarks
  and query string parts would not be stripped/detached correctly as the
  last '#'/'?' instead of the FIRST occurrence of '#'/'?' would be picked as
  a separation point.

* Ran the gettext files through pot/pox/po again. Lots of 'fuzzies'... These
  need to be fixed.

* EXPERIMENTAL: added preliminary code for extended JavaScript support:
  hooks to process HTML and CSS just like you can process embedded <SCRIPT>s
  now. The new hooks are still 'nulls', i.e. do not have any effect.

  This is a work in progress; it compiles & runs (tested on UNIX and Win32
  in multithreaded mode) but the new hooks still need to be implemented.

  The goal here is that all grabbed (parsable) content should be processable
  by custom JavaScript script functions AND when more than one URL is found,
  the JavaScript code should be allowed to add those extra URLs to the pavuk
  queue (using the new url.queue() JavaScript PavukUrl object method --
  currently a 'nil' member function as it still must be fully implemented).

* isatty() fixes which check for error conditions and do /not/ provide
  special 'console oriented' features when isatty(0) produces an error (may
  happen on Win32/UNIX).

* Checked and updated all header files (after I ran into a cyclic dependency
  when changing a bit of code): no .h files will #include "config.h"; all .c
  files /do/ #include "config.h" as the first header.

  System-dependent stuff (TRUE/FALSE definitions and a few other bits) have
  been moved to config.h (where they below IMO) and removed from tools.h

  This is a change required for the gzip fix [SF bug #2050527].

* Preliminary fix for CSS url grabbing and rewriting bug [SF bug #2050537].

  The new code will now try to keep these three styles of <url> formatting
  in CSS intact -- this is done so as to keep particular CSS browser hacks
  intact as much as possible:

    @import "<url>"
    @import url(<url>)
    @import url('<url>')
    @import url("<url>")

  and of course the use of 'url()' elsewhere in any CSS is treated like the
  three examples above, i.e. NONE of these should be changed regarding <url>
  delimiters (quotes or braces) when rewritten by pavuk.

  The ONLY situation where pavuk will CHANGE the quotes is when a <url> is
  found to contain the delimiter quote itself: in that case the quotes are
  changed from ' to " and vice versa.


2008 aug 18

* minor fixes to the includes mime.types file

* configure: added support/auto-detection for the GNU GDB extended debug
  output (-ggdb -g3) for when building a debug build.

* NTLM: fixed code for Win64 and other 64-bit platforms which do or do not
  support structure packing.

* documentation update: -[no]chunk_bug commandline argument finally
  documented (was in there already for a longer time; is a special fix for
  badly behaving IIS web servers which transmit data in 'chunked mode'.

  Also upgraded the documentation for the -tr_str_str/tr_chr_chr options so
  one can finally read how to use [:print:] and other definitions in there
  for -tr_chr_chr and be able to determine up front what the bugger will do
  for you.

  For example:

  Why does -tr_chr_chr '[hexnum:]' '0123456789abcdef' *not* do what you
  expect when the filename has any of the a..f characters? (Answer: they all
  become 'f' as [:hexnum:] actually expands to

    '0123456789ABCDEFabcdef'

  itself, so it is longer than the destination set and by definition any
  'overflow' will be replaced by the last character in the target set.)

* HTML/CSS/JavaScript parent rewriting was sometimes flaky; this has been
  fixed by fixing several bits of antiquated code in pavuk: now all code
  sections are equaly aware of URL_ISHTML, URL_ISSTYLE and/or URL_ISSCRIPT.

  Several functions have been adapted to mirror the new awareness:

  ext_is_html() has been enhanced and has been renamed to actually show its
  intended function: ext_is_parsable() -- which can be a HTML, CSS *or*
  JavaScript file! (not only HTML can be parent of other URLs and need
  updating ('URL parent rewriting').

  [ SF bug #2050537 ] CSS @import bad / HTML corrupted --> fixed

* On SuSe10.2/AMD64 glibc6 dumped core when running pavuk in full-out '-
  debug -debug_level all' (the latter is implicit when you use '-debug')
  mode. This was caused by glibc()'s printf() functions *sensibly* executing
  a strlen() operation on the data fed to one of several '%.*s' printf()
  formatting parameters, while those data series had NOT been NUL
  terminated.

  This would happen when debugging pavuk while fetching data from a gzip-
  enabled web server: the gzip/inflate code would NOT append a new NUL
  sentinel.

* Several other '%.*s' and '%s' related core dump spots in the DEBUG_XYZ()
  code which would dump downloaded content have been fixed by feeding the
  data through an enhanced asciidump function -- which will switch to HEX
  dumping when the content to be shown for scutiny contains a large amount
  of non-ASCII data (> 10% is the current heuristic to switch over).

* glibc6 on SuSe10.2/AMD64 would also dump core when being fed a 110K string
  to a printf '%s' statement. This has been fixed by always limiting the
  amount of content to be displayed when debug-printing downloaded data
  (various '-debug_level's)

* gzip/inflate would fail to perform on 'non-parsable' content, i.e. plain
  text files downloaded from a gzip-enabled web server. This has been fixed.

  CAVEAT: The current gzip/inflate code does not deliver when it is fed very
          large files. Hence, when downloading VMware images and/or multi-GB
          ISO files, a workaround is to specify -noEnc. This will be fixed
          at a later date.

  [SF bug #2050527] nonparsed files saved in (wrong) compressed when using 
	            HTTP --> fixed

* Parent rewriting would try to treat all parents as HTML, which is VERY
  wrong when the actual parent is a CSS stylesheet or a JavaScript script
  file. Fixed.

* unified variable names for 'struct doc' variables: it is *QUITE*
  irritating to loose your display of 'docu' contents just because this call
  uses 'docp' for the same (or 'html_doc') while trying to track down
  lurking parent rewriting and file URL parsing bugs.

  Updated all sourcefiles to the use of varname 'docu' for the current
  document. 'docp' and 'html_doc' have been renamed.

* two bugfixes for the tr() code: (1) when using X-Y character ranges, the
  size estimator would allocate way too less space. This has been fixed. (2)
  the documentation says it well: you cannot include a NUL in a tr()
  character set. In one case (a range at the start of the spec like this: '-
  z' would actually attempt to insert such a NUL anyhow, causing subtle bug.
  Fixed. And a minor code cleanup.

* fixed argument quoting for external app invocation, which is particularly
  important for Windows machines: they treat '-quoting quite different from
  "-quoting. Fixed by using "-quotes instead of the original '-quotes.

* -enable_js is now turned ON by default - just like the documentation
  already said.

  KNOWN ISSUE: empty lines in JavaScript code and files gets stripped by
               pavuk on rewriting; this will be fixed at a later date.

* fix in mime.types file for CVS file extension + added mime types for 
  Microsoft Office 2007

* fixed heap corruption in ainterface.c when calling append_starting_url()
  when url has been specified in the extended '-request' format, including
  a predefined local filename. (Would dump core on some systems.)

* moved the url2diag and info2diag functions from recurse.c to where they should
  have been: url.c -- to resolve a cyclic dependency.

* fixed up the '-request' format url parser/decoder url_parse() call: several
  types of input specification error would be silently rejected (now pavuk
  prints a suitable error message to tell the user what [s]he did wrong and what
  was expected) + a few tugs & tweaks to fix behavior for parsing extended 
  URL specifications (including cookies, predefined local filenames, etc.) and
  an extra '-debug' (level: URL) line to help you diagnose how the '-request's
  have been parsed/decoded.

* now you can use the extended '-request' URL format anywhere on the
  commandline and/or your pavuk configuration files -- as long as you keep
  it within quotes on the commandline of course, e.g.

   pavuk "URL:http://example.com/ LFNAME:example.html"

* fix: config files generated by pavuk now properly select the 'short format'
  (URL:....) instead of the 'long url spec fomat' (Request:....): previously
  pavuk would loose information about web forms, cookies, local filenames, etc.
  for some types of requested url.

* quickfix for issue reported on the mailing list regarding JavaScript
  interface functions causing the build to fail - which happened when no
  JavaScript library could be found.

  NOTE: on Linux, the JS libraries and headerfiles seem to get installed in
        various places. The current ./configure script looks for the
          jsapi.h
        header file in the directory
          /usr/include/js
        unless you specify the '--with-js-includes=<dir>' option when running
        ./configure.

        The same goes for the js library itself: the current configure script
        looks for either libjs or libmozjs in any of these directories:
          /usr/lib64/thunderbird
          /usr/lib64/firefox
          /usr/lib64
          /usr/lib/thunderbird
          /usr/lib/firefox
          /usr/lib
        unless you specify the ./configure --with-js-libraries=<dir> option
        to point to your specific libjs.a / libmozjs.a

* added an advanced example of use to the pavuk DocBook documentation
  which will end up in the manpage (where it's a bit too much, but then
  at least the users have an extended example of actual use) -- example
  shows how to grab the up-to-date content from a MediaWiki-based web 
  site.

* added S/M/H/D unit support for the time argument decoder function

* Updated the manual regarding:

  - all missing 'hammer mode' options

  - the missing -rtimeout and -wtimeout options

  - checked first few options in options.h and made sure those were all
    documented. (This is a work in progress...)

* All timeouts are now in milliseconds, except the -max_time one, which is
  in minutes.

  All timeout arguments (except -max_time) now recognize the alternative
  units for specifying time: s/m/h/d/S/M/H/D: second, minute, hour, day.

  When no unit has been specified, the unit 'milliseconds' is assumed.

* Fix for bug report #2158794: now all DEBUG_*() functions are called 
  using the proper number of arguments.

  The code has been further enhanced for all printf()-like functions 
  (such as the DEBUF_*() and x*printf() functions) to enable GCC and MSVC
  to check the format specification strings and parameter count and 
  type (GCC).
  
  This led to the discovery of a multitude of errors, which have been 
  fixed (wrong integer sizes, etc.).

* Preliminary code move to allow downloading extremely large entities
  (larger than 2GB) such as DVD ISO images: this has been done by more
  judicious use of the size_t and ssize_t types instead of simply 'int'.
  
  On 64-bit platforms, size_t/ssize_t can handle 64-bit sizes, while
  'int' cannot (as GCC still uses 32-bit ints on most common hardware
  64-bit architectures (Intel, ...)).  Further effort will need to be
  spent to adapt the system (and OpenSSL) calls to enable the complete
  datapath for >2GB entity sizes (at least when compiled on 64-bit).

* Small documentation fix: regex overview of characterset changed in DocBook
  source so it appears as a simple list, instead of just one long paragraph
  full of concatenated items --> improved readability.

* const-ified the source code and fixed a few comment typos and a
  lurking bug in FTP (found thanks to constification): filename
  for directory index urls could be damaged in particular circumstances.

* fixed makefiles for environments without any DocBook tools. Also fixed
  configure script to help detect the absence of mandatory DocBook template
  files. Plus added DocBook produce to the distro as we cannot expect everyone
  to have the DocBook tools; nevertheless, everybody /should/ receive a full
  set of documentation.

* Bugfix in GET_NUMLIST(): now original numlist is properly removed (would only
  be noticable before when specifying multiple port numbers).

* memleak fix for _free_httphdr(): now also the httphdr struct itself gets 
  free()d.

* Fixed lockups in debug logging code when running in '-x' GUI mode; overhauled the
  'recursive invocation' detection code within, which is mandatory to prevent
  recursive calls to debug/log functions to blow up the stack and dump core while
  running in ultra verbose debug/diag mode (-debug -debug_level all). This is the
  second part of the fix for bug #2184196.

* Bugfix for #2023089: new code is introduced for '-lmax' depth level checks:
  the 'depth' (a.k.a. 'level') will always be taken from the non-inline parent URL
  which has the lowest level.

  This should fix situations where 'inline' URLs have 'inline' *parent* URLs, such
  as style sheets, which are referenced non-inline URLs (HTML files).

  Seeking out the lowest level non-inline parent should also take care of situations
  where multiple HTML files at different levels themselves, all (directly!) reference the same
  stylesheet/inline URL.

* Attempt at fixing a GUI semaphore lockup, caused by LOCK_CFG_URLSTACK being used
  for different purposes (was a quick hack once to create a 'critical section' there)
  in recurse.c @ 1129. Same hack, but now we use LOCK_GHBN which should cause much less trouble
  there.

* Bit of code cleanup.

* Code review checks to see if URLT_FTPS and URLT_GOPHER are used consistently where
  you'd expect them. As you would URLT_HTTPS, next to URLT_HTTP.

* Code review checks and fixes to prevent pspurious damage to url->parent structures:
  now the access to this element is critical-sectioned /everywhere/ using LOCK_URL(u); existed
  in 95% of the places already, now all code has been checked.

* Several fixes for multithreaded GTK GUI use. Most important thing which
  was missing: a call to gtk_threads_init().

* JavaScript: updated HTML tag/attribute tables to recognize all
  onXYZ=... JavaScript event attributes in HTML + added the full
  set of attributes to the url pattern class/object which is
  available in pavuk's own JavaScript extension.

For information on current development see here.