Rocksolid Light - news.software.nntp - Re: Meta: a usenet server just for sci.math

On 03/26/2024 06:04 PM, Ross Finlayson wrote:
> arithmetic hash searches
>
> take a hashcode, split it up
>
> invert each arithmetically, find intersection in 64 bits
>
> fill in those
>
> detect misses when the bits don't intersect the search
>
> when all hits, then "refine", next double range,
>
> compose those naturally by union
>
> when definite misses excluded then go find matching partition
>
> arithmetic partition hash
>
> So, the idea is, that, each message ID, has applied a uniform
> hash, then that it fills a range, of so many bits.
>
> Then, its hash is split into smaller chunks the same 1/2/3/4
> of the paths, then those are considered a fixed-point fraction,
> of the bits set of the word width, plus one.
>
> Then, sort of pyramidally, is that in increasing words, or doubling,
> is that a bunch of those together, mark those words,
> uniformly in the range.
>
> For example 0b00001111, would mark 0b00001000, then
> 0b0000000010000000, and so on, for detecting whether
> the hash code's integer value, is in the range 15/16 - 16/16.
>
> The idea is that the ranges this way compose with binary OR,
> then that a given integer, then that the integer, can be
> detected to be out of the range, if its bit is zero, and then
> otherwise that it may or may not be in the range.
>
> 0b00001111 number N1
> 0b00001000 range R1
> 0b00000111 number N2
> 0b00000100 range R2
>
> 0b00001100 union range UR = R1 | R2 | ....
>
>
> missing(N) {
> return (UR & RN == 0);
> }
>
>
> This sort of helps where, in a usual hash map, determining
> that an item doesn't exist, is worst case, while the usual
> finding the item that exists is log 2, then that usually its value
> is associated with that, besides.
>
> Then, when there are lots of partitions, and they're about
> uniform, it's expected the message ID to be found in only
> one of the partitions, is that the partitions can be organized
> according to their axes of partitions, composing the ranges
> together, then that search walks down those, until it's either
> a definite miss, or an ambiguous hit, then to search among
> those.
>
> It seems then for each partition (group x date), then those
> can be composed together (group x month, group x year,
> groups x year, all), so that looking to find the group x date
> where a message ID is, results that it's a constant-time
> operation to check each of those, and the data structure
> is not very large, with regards to computing the integers'
> offset in each larger range, either giving up when it's
> an unambiguous miss or fully searching when it's an
> ambiguous hit.
>
> This is where, the binary-tree that searches in log 2 n,
> worst-case, where it's balanced and uniform, though
> it's not to be excluded that a usual hashmap implementation
> is linear in hash collisions, is for excluding partitions,
> in about constant time and space given that it's just a
> function of the number of partitions and the eventual
> size of the pyramidal range, that instead of having a
> binary tree with space n^2, the front of it has size L r
> for L the levels of the partition pyramid and r the size
> of the range stamp.
>
> Then, searching in the partitions, seems it essentially
> results, that there's an ordering of the message IDs,
> so there's the "message IDs" file, either fixed-length-records
> or with an index file with fixed-length-records or otherwise
> for reading out the groups' messages, then another one
> with the message ID's sorted, figuring there's a natural
> enough binary search of those with value identity, or bsearch
> after qsort, as it were.
>
> So, the idea is that there's a big grid of group X date archives,
> each one of those a zip file, with being sort of contrived the
> zip files, so that each entry is self-contained, and it sort of
> results that concatenating them results another. So
> anyways, the idea then is for each of those, for each of
> their message IDs, to compute its four integers, W_i,
> then allocate a range, and zero it, then saturate each
> bit, in each range for each integer. So, that's like, say,
> for fitting the range into 4K, for each partition, with
> there being 2^8 of those in a megabyte, or that many
> partitions (512), or about a megabyte in space for each
> partition, but really where these are just variables,
> because it's opportunistic, and the ranges can start
> with just 32 or 64 bits figuring that most partitions
> are sparse, also, in this case, though usually it would
> be expected they are half-full.
>
> There are as many of these ranges as the hash is split
> into numbers, is the idea.
>
> Then the idea is that these ranges are pyramidal in the
> sense, that when doing lookup for the ID, is starting
> from the top of the pyramid, projecting the hash number
> into the range bit string, with one bit for each sub-range,
> so it's branchless, and'ing the number bits and the partition
> range together, and if any of the hash splits isn't in the
> range, a branch, dropping the partition pyramid, else,
> descending into the partition pyramid.
>
> (Code without branches can go a lot faster than
> code with lots of branches, if/then.)
>
> At each level of the pyramid, it's figured that only one
> of the partitions will not be excluded, except for hash
> collisions, then if it's a base level to commence bsearch,
> else to drop the other partition pyramids, and continue
> with the reduced set of ranges in RAM, and the projected
> bits of the ID's hash integer.
>
> The ranges don't even really have to be constant if it's
> so that there's a limit so they're under a constant, then
> according to uniformity they only have so many, eg,
> just projecting out their 1's, so the partition pyramid
> digging sort of always finds one or more partitions
> with possible matches, those being hash collisions or
> messages duplicated across groups, and mostly finds
> those with exclusions, so that it results reducing, for
> example that empty groups are dropped right off
> though not being skipped, while full groups then
> get into needing more than constant space and
> constant time to search.
>
> Of course if all the partitions miss then it's
> also a fast exit that none have the ID.
>
> So, this, "partition pyramid hash filter", with basically,
> "constant and configurable space and time", basically
> has that because Message Id's will only exist in one or
> a few partitions, and for a single group and not across
> about all groups, exactly one, and the hash is uniform, so
> that hash collisions are low, and the partitions aren't
> overfilled, so that hash collisions are low, then it sort
> of results all the un-used partitions at rest, don't fill
> up in n^2 space the log 2 n hash-map search. Then,
> they could, if there was spare space, and it made sense
> that in the write-once-read-many world it was somehow
> many instead of never, a usual case, or, just using a
> list of sorted message Id's in the partition and bsearch,
> this can map the file without loading its contents in
> space, except as ephemerally, or the usual disk controller's
> mmap space, or "ready-time" and "ephemeral-space".
>
> In this sort of way there's no resident RAM for the partitions
> except each one with a fixed-size arithmetic hash stamp,
> while lookups have a fixed or constant cost, plus then
> also a much smaller usual log 2 time / n^2 space trade-off,
> while memory-mapping active files automatically caches.
>
>
> So, the idea is to combine the BFF backing file format
> and LFF library file format ideas, with that the group x date
> partitions make the for archive and active partitions,
> then to have constant-time/constant-space partition
> pyramid arithmetic hash range for lookup, then
> ready-time/ephemeral-space lookup in partitions,
> then that the maintenance of the pyramid tree,
> happens with dropping partitions, while just
> accumulating with adding partitions.
>
> Yeah, I know that a usual idea is just to make a hash map
> after an associative array with log 2 n lookup in n^2 space,
> that maintenance is in adding and removing items,
> here the idea is to have partitions above items,
> and sort of naturally to result "on startup, find
> the current partitions, compose their partition pyramid,
> then run usually constant-time/constant-space in that
> then ready-time/ephemeral-space under that,
> maintenance free", then that as active partitions
> being written roll over to archive partitions being
> finished, then they just get added to the pyramid
> and their ranges or'ed up into the pyramid.
>
> Hmm... 32K or 2^15 groups, 16K or 2^14 days, or
> about 40 years of Usenet in partitions, 2^29,
> about 2^8 per megabyte or about 2^20 or one
> gigabyte RAM, or, just a file, then memory-mapping
> the partition pyramid file, figuring again that
> most partitions are not resident in RAM,
> this seems a sort of good simple idea to
> implement lookup by Message ID over 2^30 many.
>
> I mean if "text Usenet for all time is about a billion messages",
> it seems around that size.
>
>

So, trying to figure out if this "arithmetic hash range
pyramidal partition" data structure is actually sort of
reasonable, gets into that it involves finding a balance
in what's otherwise a very well-understood trade-off,
in terms of the cost of a lookup, over time, and then
especially as whether an algorithm is "scale-able",
that even a slightly lesser algorithm might be better
if it results "scale-able", especially if it breaks down
to a very, very minimal set of resources, in time,
and in various organizations of space, or distance,
which everybody knows as CPU, RAM, and DISK,
in terms of time, those of lookups per second,
and particularly where parallelizable as with
regards to both linear speed-up and also immutable
data structures, or, clustering. ("Scale.")

Then it's probably so that the ranges are pretty small,
because they double, and whether it's best just to
have an overall single range, or, refinements of it,
according to a "factor", a "factor" that represents
how likely it is that hashes don't collide in the range,
or that they do.

This is a different way of looking at hash collisions,
besides that two objects have the same hash,
just that they're in the same partition of the range
their integer value, for fixed-length uniform hashes.

I.e., a hash collision proper would always be a
redundant or order-dependent dig-up, of a sort,
where the idea is that the lookup first results
searching the pyramid plan for possibles, then
digging up each of those and checking for match.

The idea that group x date sort of has that those
are about on the same order is a thing, then about
the idea that "category" and "year" are similarly
about so,

Big8 x year
group x date

it's very contrived to have those be on the same
order, in terms of otherwise partitioning, or about
what it results that "partitions are organized so that
their partitions are tuples and the tuples are about
on the same order, so it goes, thus that uniformity
of hashes, results being equi-distributed in those,
so that it results the factor is good and that arithmetic
hash ranges filter out most of the partitions, and,
especially that there aren't many false-positive dig-up
partitions.

It's sort of contrived, but then it does sort of make
it so that also other search concerns like "only these
groups or only these years anyways", naturally get
dropped out at the partition layer, and, right in the
front of the lookup algorithm.

It's pretty much expected though that there would
be non-zero false-positive dig-ups, where here a dig-up
is that the arithmetic hash range matched, but it's
actually a different Message ID's hash in the range,
and not the lookup value(s).

Right, so just re-capping here a bit, the idea is that
there are groups, and dates, and for each is a zip file,
which is a collection of files in a file-system entry file
with about random access on the zip file each entry,
and compressed, and the entries include Messages,
by their Message ID's, then that the entries are
maybe in sub-directories, that reflect components
of the Message ID's hash, where a hash, is a fixed-length
value, like 64 bytes or 128 bytes, or a power of two
and usually an even power of two thus a multiple of four,
thus that a 64 byte hash has 2^64 * 2^8 many possible
values, then that a range, of length R bits, has R many
partitions, in terms of the hash size and the range size,
whether the factor is low enough, that most partitions
will naturally be absent most ranges, because hashes
can only be computed from Message ID's, not by their
partitions or other information like the group or date.

So, if there are 2^30 or a billion messages, then a
32 bit hash, would have a fair expectation that
unused values would be not dense, then for
what gets into "birthday problem" or otherwise
how "Dirichlet principle" makes for how often
are hash collisions, for how often are range collisions,
either making redundant dig-ups, in the way this
sort of algorithm services look-ups.

The 32 bits is quite a bit less than 64 * 8, though,
about whether it would also result, that, splitting
that into subdirectories, results different organizations
here about "tuned to Usenet-scale and organization",
vis-a-vis, "everybody's email" or something like that.
That said, it shouldn't just fall apart if the size or
count blows up, though it might be expect then
a various sorts of partitioning, to keep the partition
tuple orders square, or on the same orders.

The md5 is widely available, "md5sum", it's 128 bits,
its output is hexadecimal characters, 32-many.

https://en.wikipedia.org/wiki/MD5
https://en.wikipedia.org/wiki/Partition_(database)
https://en.wikipedia.org/wiki/Hash_function#Uniformity

Otherwise the only goal of the hash is to be uniform,
and also to have "avalanche criterion", so that near Message-Id's
will still be expected to have different hashes, as it's not
necessarily expected that they're the same group and
date, though that would be a thing, yet Message ID's
should be considered opaque and not seated together.

Then MD5 is about the most usual hash utility laying
around, if not SHA-1, or SHA-256. Hmm..., in the
interests of digital preservation is "the tools for
any algorithms should also be around forever",
one of those things.

So anyways, then each group x date has its Message ID's,
each of those has its hash, each of those fits in a range,
indicating one bit in the range where it is, then those are
OR'd together to result a bit-mask of the range, then
that a lookup can check its hash's bit against the range,
and dig-up the partition if it's in, or, skip the partition
if it's not, with the idea that the range is big enough
and the resulting group x date is small enough, that
the "pyramidal partition", is mostly sparse, at the lower
levels, that it's mostly "look-arounds" until finally the
"dig-ups", in the leaf nodes of the pyramidal partitions.

I.e., the dig-ups will eventually include spurious or
redundant false-positives, that the algorithm will
access the leaf partitions at uniform random.

The "pyramidal" then also get into both the empties,
like rec.calm with zero posts ten years running,
or alt.spew which any given day exceeds zip files
or results a lot of "zip format, but the variously
packaged, not-recompressed binaries", the various
other use cases than mostly at-rest and never-read
archival purposes. The idea of the "arithmetic hash
range pyramidal partition" is that mostly the
leaf partitions are quite small and sparse, and
mostly the leveling of the pyramid into year/month/date
and big8/middle/group, as it were, winnows those
down in what's a constant-rate constant-space scan
on the immutable data structure of the partition pyramid.

Yeah, I know, "numbers", here though the idea is
that about 30K groups at around 18K days = 50 years
makes about 30 * 20 * million or less than a billion
files the zip files, which would all fit on a volume
that supports up to four billion-many files, or an
object-store, then with regards to that most of
those would be quite small or even empty,
then with regards to "building the pyramid",
the levels big8/middle/group X year/month/date,
the data structure of the hashes marking the ranges,
then those themselves resulting a file, which are
basically the entire contents of allocated RAM,
or for that matter a memory-mapped file, with
the idea that everything else is ephemeral RAM.

Subject	Replies	Author
Re: Meta: a usenet server just for sci.math By: Ross Finlayson on Thu, 28 Mar 2024	5	Ross Finlayson

The most dangerous food is wedding cake. -- American proverb

rocksolid / news.software.nntp / Re: Meta: a usenet server just for sci.math