*==========================================================*
|            -ChangeLog file for HarvestMan-               |
|                                                          |
| URL: http://harvestman.freezope.org/files/Changelog.txt  |
*==========================================================*

Version: 1.4 final (Bug fixes + Minor features)
Release Date: Dec 17 2004

Changes from version 1.3.9-1
============================

Features
========

1. Added an asynchronous url server which listens
to port 3081 (by default). The url server can be
optionally enabled to gather and send urls instead
of using a Queue. This can be faster, since the
url server uses asyncore module of Python with
queues, which is faster than just using queues.

To enable this feature, set the config variable
network.urlserver to 1.

2. Modified caching algorithm to store the data
of the files download in the cache file. Hence
if some one accidentally deletes the downloaded files,
HarvestMan can recreate the files from the cache file,
without actually downloading them, if they are uptodate.

3. Queue architecture modified. The data queue has
been replaced with a links queue. Instead of pushing
web page data into a queue, fetchers process them and
push the new urls to a queue. Crawlers get the urls 
, walk through them and posts the newly created url
objects into the url queue or sends them to the url
server. This saves memory on the queues.

4. Added an option for controlling file download
based on maximum file size. The maximum size by default
for a single file is 1 MB.

5. Added an option for dumping a url tree which shows
parent-child dependencies of the urls generated. This
can be either a text file or an html file. 

6. Added an advertisement/banner filter to the rules
module. If enabled this can skip urls related to ad
banners or graphics.

7. New controller thread to manage file and time limits
on downloads.

Fixes
=====
1. This release fixes a huge bug in HarvestMan, i.e
that of hanging threads. The threading architecture
is modified to introduce local buffers. Threads 
do an unblocked push on the queue as opposed to
a blocked push in all previous versions. If they
cannot push the data (Queue full) after 5 attempts,
they store the data in a local buffer. In the next
loop of the threads, they try to push the buffer data
before creating any new objects to push (by crawling
pages/parsing html files. This ensures that the
threads dont block continously on the queue leading
to deadlocks and time outs.)

2.Increased the idling time of threads to reduce CPU
  load.

3. Fixed a bug with correctly identifying WWW urls.
4. Fixed a bug that incorrectly modifies urls
   with spaces between words.
5. Fixed many bugs with get_relative_filename method.
6. Fixed bugs with generating urls. Trailing spaces
   and/or newlines need to be removed from path
   components.
7. Added a method to correctly identify the type of
   a url based on its mimetype.
8. Fixed bugs in robot protocol checking method.
   Many optimizations are also added to quickly
   process urls. A robot object cache (dictionary)
   and url object whitelist has been added to
   reduce processing time. Also html files need
   to be processed.
9. Fixed bugs in url filter checking method.
10.Fixed bugs in the order of checking rules
   in violates_basic_rules method.
11.Fixed bug in creating regular expression for
   filtering based on file extension.
12. Many bug fixes in localise_file_links method.
13. Fixed a bug in correctly generating the
    regular expression for old url.
14. Fixed a bug in localising file names. All
    web page files are correctly localised now.
15. Fixed a bug in updating files from project
    cache.
16. Bugfixes in urltracker module.
17. Fixed the bug when program exits sometimes
    just after downloading the first url.
18. Fixed bug with parsing <base href="..."> 
    link.
19. Fixed error in managing an empty url.
    Correct error message is printed now.
20. Fixed bugs with logging errors.
    The error log stream is disabled and 
    the configuration option removed.
21. Fix to allow special characters in project base
    directory (such as ~ for home directory on
    Unix systems).
22. Fixed bug in function that opens robots.txt
    urls.
23. Removed some useless arguments from some
    functions.
24. Fixed bug with url object in connect(...) 
    function.
25. Fixes to make slow mode work.
26. Modified to use methods of cPickle module instead
    of pickle module in utils.py (cPickle is faster).
27. Use our own strptime module since this function
    is not available on all Python versions on Windows.
28. Fixes in locale setting on Windows platform.
29. Log file for each project is now generated in the
    project directory as '<projectname>.log'. This is
    not a configurable option anymore.
30. The verification of downloaded files by checksumming
    is disabled. This is not a configurable option
    anymore.
31. The renaming algorithm is disabled since it is not
    general purpose.

Other Changes
=============

1. License of program changed to GNU GPL.
2. The genconfig.py script is more interactive now,
   displaying the options selected.
3. Language encoding specified on top of all Python
   files.
4. A script to check Python dependency namely, 'check_dep.py'
   has been added.
5. Installation made easier on Linux and Unix like systems.
   A script named 'install' does the job for you.
6. The 'genutils' directory is renamed to 'tools'.

