Welcome Guest ( Log In | Register )

2 Pages V  < 1 2  
Closed TopicStart new topic
> Hashing in DC++
wolfmother
post Mar 28 2010, 08:12 PM
Post #21
Blame him.

Posts: 171
Joined: 14-April 07
Next event:



Wolfmother's big DC++ hashing writeup
Background - What is hashing?
Hashing is a way of uniquely identifying the contents of any file. After processing the entirety of the file's contents, DC++ will spit out a small piece of text that can be used as a "fingerprint" for that file. This allows a lot of neat things like being able to verify the integrity of a file or being able to find other computers on the network that also have that file. However, it is not without its downsides.

The primary problem is the fact that it takes a long time to process large amounts of files, often up to days for large shares. The other is that since hashing is very important on the internet, the developers of DC++ intentionally removed compatibility with non-hashing clients in newer versions. This becomes a problem, as often someone comes to a LAN without hashing or knowing that they should hash, and cannot share their files with people using the newer versions until they spend a few days hashing their files (by which time the LAN is often over).

About TTH
DC++ uses a method of hashing called TTH - Tiger Tree Hashing. Tiger is the name of the hashing algorithm used; it's optimized for high performance on desktop systems and while it doesn't have as large a margin of security as many other popular systems, it has not been broken and there is no reason to believe it'll be broken in the near future. Instead of hashing the whole file in one go, DC++ will hash small (64kb) pieces of the file and then hash the hashes together until it comes up with a final result. This takes about 10% longer than just hashing the file outright, but gives the advantage that it's possible to identify an incomplete part of a file.

However, keeping track of all of those hashes (there can be tens of thousands for a high definition video file) uses a substantial amount of memory, and every hash lookup request means looking through this. This file is saved as HashData.dat and will generally grow at a rate of about 50-60mb per terabyte shared. It must be loaded into memory when DC++ is launched and stay there, and a lot of important operations like searching, downloading or uploading will require it to be accessed frequently. This can hamper performance somewhat on systems with slower CPUs or small amounts of RAM.

How long will it take?
The amount of time it takes to hash files depends on the amount to be shared and the hardware your system is running. The read speed (from the hard disk) depends on the speed of your hard disk(s), hard disk controller(s), and motherboard. The hashing speed depends on the speed of your CPU and RAM. A rule of thumb I use for my server is about 6 hours per terabyte, but it can vary wildly depending on your setup.

DC++ has been designed to hash on a single core at a time, so it is not uncommon to see one core being used 100% while hashing in the Task Manager. This is normal and you can't really improve performance much more if this is what you're seeing.

Why it's good on the internet
Hashing has a lot of advantages on the internet:
  • Fake files are much less of a problem.
  • You can download from multiple users at once without the risk of one of their copies being corrupt (even if they have different filenames).
  • Resuming incomplete downloads, particularly from different users


Why LANs are very different to the internet
  • Fake/dodgey files are extremely uncommon at a LAN (compared to the total share size) and you can generally download a different copy in seconds anyway.
  • Bandwidth is much higher, so there is nowhere near as much unique data on the network (meaning that there are probably only a handful of unique copies of a given file but a few dozen mirrors of each with filenames intact, meaning you can just Match Queue and it'll probably work better than a hash request)
  • At any given time, a substantial number of people will be hitting the network's transfer limit*. Unless the person you're downloading off has a crapton of slots you'll be hitting that bottleneck long before they max out their link.
  • Total sharesizes and data throughput are much, much higher, meaning that those tiny little CPU and memory hits that hashing causes suddenly become much, much bigger.
  • Bandwidth is so high that verifying and resuming incomplete files is usually a waste of time; it's just about quicker to just download it again from scratch.
  • And of course, there's the processing period at the start.


That said, those advantages are not completely moot at a lan; they just become no longer worth the downsides. DC's approach to hashing is quite hamfisted and not suited to the environment at all.

*This is part of how the network is laid out; each table only gets a few gigabits to the core.
Version compatibilities
There are two types of file list; hashed, and non-hashed (they're actually different formats). However, actually choosing a version of DC++ is not that simple: there are some versions that can read one but not the other, and there is no version which will generate both. If someone wants to download a hashed file list from you and your version only generates a non-hashed file list, they won't be able to see your files and thus can't download from you. However, it's possible to be running a version which will readtheir hashed file list even if it doesn't generate one, in which case you can download from them.

DC++ is open source, so there are a lot of variations of it and variations of those variations available. Without getting into the complexities of ApexDC, StrongDC, IceDC etc etc what really matters is the version of DC++ they're based on. Here's a list of the major versions of DC++ with the versions worth using in bold:

  • Before .306 - generates only non-hashed, reads only non-hashed.
  • .306 - generates only non-hashed, reads only non-hashed.
  • .401 - generates only non-hashed, reads both hashed and non-hashed. Has a bug involving large file sizes.
  • .673 - generates only hashed, reads both hashed and non-hashed.
  • .674 - generates only hashed, reads both hashed and non-hashed. Has a CPU usage bug.
  • After .674 - generates only hashed, reads only hashed.


Also, there's a version called LANDC++ which I believe is based on a fairly recent version of IceDC++. It generates a file list in the same format as a hashed file list, but without the hashes. Since no version of DC++ will read such a list, it's not compatible with any of the above versions. Avoid.


Also
No, hash requests do not put any extra load on the network. Also you should have a minimum of one upload slot per hard disk plus an extra; any less and you'll make baby jesus cry.


--------------------
http://www.prolapsoft.com - Wolfmother's AV & Net Software
Profile CardPM
Go to the top of the page
+Quote Post
R4N
post Mar 28 2010, 08:50 PM
Post #22
Hueg.

Posts: 634
Joined: 4-June 07
From: In a packing crate till next respawn.
Next event:



I think thats enough of an explaination to keep everyone informed tongue.gif

case closed


--------------------

Voodoo: i5 3570k, Asrock Z77, 16GB DDR3, Ati 280x 3GB, 120GB 520 SSD + 2x 2TB, Lian-Li PC A04, Win8.1.
Banshee: N2820 Intel NUC, 4GB DDR3l, 60GB Kingston V300 SSD, 1080P Projector :3
Rush: HP Microserver, 8GB DDR3, 4x 2TB
Profile CardPM
Go to the top of the page
+Quote Post

2 Pages V  < 1 2
Closed TopicStart new topic
2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)
0 Members:

 

Lo-Fi Version Time is now: 19th June 2025 - 04:22 AM