Pavuk

SourceForge

Last Update: January 27 2006

 
 

FAQ :

These are Frequently Asked Questions on using pavuk. There are only few entries at the moment. We will very appreciate if you find some problem and solution for it, if you write FAQ entry and send it to the developers. It will help us to reduce the load on responding to help requests.


 
When I'm downloading documents with some special characters like ?&* the stored document tree is not browsable. I want to convert this characters to some others.
This is posible with -tr_chr_chr option. For example use
-tr_chr_chr '?&*' _
and all of ?&* characters become a _ character. If you want to make this a default behavior, add in your ~/.pavukrc file the line
TrChrToChr: "?&*" "_"
 
 
Some sites have dynamic session numbers like PHPSESSION. How do I download them without reloading the whole site on each call?
The argument -fnrules F '*' '(rmpar %o "PHPSESSION")' rewrites the stored filename and removes the dynamic parameter PHPSESSION from the name. If the website uses another parameter, then you need to adapt the command. It is also possible to remove multiple variables from a dynamic website. In this case you need to join the commands like this: -fnrules F '*' '(rmpar (rmpar (rmpar %o "id") "mode") "PHPSESSION")'.
 
 
I'm using a firewall for Internet access. Can pavuk go through it?
Yes. You can use proxys for HTTP, HTTPS, FTP and Gopher.
For HTTP proxy use -http_proxy host:port.
For HTTPS proxy use -ssl_proxy host:port. Pavuk requires a HTTP proxy with enabled CONNECT request.
For Gopher proxy use -gopher_proxy host:port. It can optionally use HTTP gateway for accessing gopher servers (use -gopher_httpgw option) or can use HTTP proxy with enabled CONNECT request.
For FTP proxy use -ftp_proxy host:port. Pavuk can use three different methods for going through the firewall. You can use HTTP gateway for FTP (option -ftp_httpgw), you can use native FTP proxy and third option is to use HTTP proxy with enabled CONNECT request (option -ftp_dirtyproxy)
If your firewall supports SOCKS 4 or SOCKS 5 proxy, you can compile pavuk to support it. You only need development libraries for this protocols during compile time.
 
 
I'm using FWTK as firewall, but I can't download any files through FTP proxy.
The FTP proxy included in FWTK doesn't support passive data transfers. Use option -ftp_active to use active mode of FTP data connections.
 
 
I have different scenarios, which I want to execute automatically. Is it possible to serialize scenario execution with pavuk?
You can use shell or any scripting language to write short scripts to do this. Here is example how to use it with sh or bash:
for scn in *.scn; do pavuk -scndir . -scenario $scn;done
 
 
There are files beginning with .in_ in the directories. What are they for?
These files are used as temporary files while a file is downloaded. When transfer of file fails these files contain the transfered part, which is used for next reget (if possible). These files are used for locking of documents too.
 
 
I want to start pavuk always with GUI interface. Is there any chance to set this in ~/.pavukrc file?
No. There isn't any chance to set it in ~/.pavukrc, but you can use aliasing mechanizm of your shell. For example:
csh:alias xpavuk 'pavuk -X'
bash:alias xpavuk='pavuk -X'
 
 
Is there any chance to close or restart Xserver, without breaking pavuk when I'm running pavuk with GUI?
Yes it is posible with a lot of limitations. At first pavuk must be executed as background job (run pavuk with pavuk -X &; or pavuk -X -bg; or stop pavuk with CTRL-Z from shell, and then put in to background with bg shell command). Then you can use "Go Bg" button, which will discard all pavuk windows from screen as soon as it will be safe (transfer of current document must finish) and then will close the connection to XServer.
 
 
Does pavuk preserve symbolic links with FTP servers?
Yes it does. But you have to use option -ftplist to enable this feature.
 
 
How can I download a complex site to a single directory without subdirectories?
Use following options:
a)-store_info -fnrules F '*' '/directory/%n'
b)-store_info -base_level 1000 -cdir /home/my/directory
Option -store_info is optional with version 0.9pl20 and higher, but is required if you want to do synchronisation in future (see manual for description).
 
 
How do I force pavuk not to build whole directory hierarchy for local document tree?
There are two different ways to do this:
1)You can use option -base_level to cut some levels from hierarchy. For example if you are downloading http://www.site.tld/manual/automake/automake_toc.html and you want to store it only in automake directory use -base_level 3.
2)You can also use option -fnrules to do this job. For example you can put all downloaded files into single directory by using -fnrules 'F' '*' '/directory/%n'.
 
 
In sync mode I'm using the option -remove_old, but pavuk doesn't remove documents which have just disapeared from remote server?
This is no bug. Pavuk needs to know which directory contains your mirror, to be able to find files which belong to it. So you have to use option -subdir together with option -remove_old to specify that directory.
For example if you are mirroring http://www.pavuk.org/ to directory /home/my/mirror, use command
pavuk -mode sync http://pavuk.org/ -dont_leave_dir -remove_old -cdir /home/my/mirror/ -subdir /home/my/mirror/http/www.pavuk.org/
and removing of old documents will work well for you.
 
 
Pavuk tells me stat: no such file or directory but all the files seem to be in the local document tree, just where they belong. What's going on?
This happens when you're deleting temporary files with an external program or script via the -post_cmd switch and then try to rewrite links that are embedded in new incoming documents. By issuing the above mentioned error message, pavuk tells you that something's wrong (i.e., the file that is being referenced in an incoming document is no longer in the local document tree) but it's not crucial as pavuk rewrites the link to the remote destination nevertheless.
 
 
I am trying to mirror a website but locally cached files are not being removed even though the -remove_old option is set.
Make sure you are using the latest version of Pavuk. If you are using the mirror mode be sure that -remove_old, -cdir, and -subdir options are set properly. If you are using fnrules to alter the directory mapping you must also set the -store_info option.
 

If you have questions which are not answered here or in the other documents, ask at the pavuk mailing list.