Message-ID:

When I left you, I was but the pupil. Now, I am the master. -- Darth Vader

On 03/27/2024 09:05 PM, Ross Finlayson wrote:
> On 03/26/2024 06:04 PM, Ross Finlayson wrote:
>> arithmetic hash searches
>>
>> take a hashcode, split it up
>>
>> invert each arithmetically, find intersection in 64 bits
>>
>> fill in those
>>
>> detect misses when the bits don't intersect the search
>>
>> when all hits, then "refine", next double range,
>>
>> compose those naturally by union
>>
>> when definite misses excluded then go find matching partition
>>
>> arithmetic partition hash
>>
>> So, the idea is, that, each message ID, has applied a uniform
>> hash, then that it fills a range, of so many bits.
>>
>> Then, its hash is split into smaller chunks the same 1/2/3/4
>> of the paths, then those are considered a fixed-point fraction,
>> of the bits set of the word width, plus one.
>>
>> Then, sort of pyramidally, is that in increasing words, or doubling,
>> is that a bunch of those together, mark those words,
>> uniformly in the range.
>>
>> For example 0b00001111, would mark 0b00001000, then
>> 0b0000000010000000, and so on, for detecting whether
>> the hash code's integer value, is in the range 15/16 - 16/16.
>>
>> The idea is that the ranges this way compose with binary OR,
>> then that a given integer, then that the integer, can be
>> detected to be out of the range, if its bit is zero, and then
>> otherwise that it may or may not be in the range.
>>
>> 0b00001111 number N1
>> 0b00001000 range R1
>> 0b00000111 number N2
>> 0b00000100 range R2
>>
>> 0b00001100 union range UR = R1 | R2 | ....
>>
>>
>> missing(N) {
>> return (UR & RN == 0);
>> }
>>
>>
>> This sort of helps where, in a usual hash map, determining
>> that an item doesn't exist, is worst case, while the usual
>> finding the item that exists is log 2, then that usually its value
>> is associated with that, besides.
>>
>> Then, when there are lots of partitions, and they're about
>> uniform, it's expected the message ID to be found in only
>> one of the partitions, is that the partitions can be organized
>> according to their axes of partitions, composing the ranges
>> together, then that search walks down those, until it's either
>> a definite miss, or an ambiguous hit, then to search among
>> those.
>>
>> It seems then for each partition (group x date), then those
>> can be composed together (group x month, group x year,
>> groups x year, all), so that looking to find the group x date
>> where a message ID is, results that it's a constant-time
>> operation to check each of those, and the data structure
>> is not very large, with regards to computing the integers'
>> offset in each larger range, either giving up when it's
>> an unambiguous miss or fully searching when it's an
>> ambiguous hit.
>>
>> This is where, the binary-tree that searches in log 2 n,
>> worst-case, where it's balanced and uniform, though
>> it's not to be excluded that a usual hashmap implementation
>> is linear in hash collisions, is for excluding partitions,
>> in about constant time and space given that it's just a
>> function of the number of partitions and the eventual
>> size of the pyramidal range, that instead of having a
>> binary tree with space n^2, the front of it has size L r
>> for L the levels of the partition pyramid and r the size
>> of the range stamp.
>>
>> Then, searching in the partitions, seems it essentially
>> results, that there's an ordering of the message IDs,
>> so there's the "message IDs" file, either fixed-length-records
>> or with an index file with fixed-length-records or otherwise
>> for reading out the groups' messages, then another one
>> with the message ID's sorted, figuring there's a natural
>> enough binary search of those with value identity, or bsearch
>> after qsort, as it were.
>>
>> So, the idea is that there's a big grid of group X date archives,
>> each one of those a zip file, with being sort of contrived the
>> zip files, so that each entry is self-contained, and it sort of
>> results that concatenating them results another. So
>> anyways, the idea then is for each of those, for each of
>> their message IDs, to compute its four integers, W_i,
>> then allocate a range, and zero it, then saturate each
>> bit, in each range for each integer. So, that's like, say,
>> for fitting the range into 4K, for each partition, with
>> there being 2^8 of those in a megabyte, or that many
>> partitions (512), or about a megabyte in space for each
>> partition, but really where these are just variables,
>> because it's opportunistic, and the ranges can start
>> with just 32 or 64 bits figuring that most partitions
>> are sparse, also, in this case, though usually it would
>> be expected they are half-full.
>>
>> There are as many of these ranges as the hash is split
>> into numbers, is the idea.
>>
>> Then the idea is that these ranges are pyramidal in the
>> sense, that when doing lookup for the ID, is starting
>> from the top of the pyramid, projecting the hash number
>> into the range bit string, with one bit for each sub-range,
>> so it's branchless, and'ing the number bits and the partition
>> range together, and if any of the hash splits isn't in the
>> range, a branch, dropping the partition pyramid, else,
>> descending into the partition pyramid.
>>
>> (Code without branches can go a lot faster than
>> code with lots of branches, if/then.)
>>
>> At each level of the pyramid, it's figured that only one
>> of the partitions will not be excluded, except for hash
>> collisions, then if it's a base level to commence bsearch,
>> else to drop the other partition pyramids, and continue
>> with the reduced set of ranges in RAM, and the projected
>> bits of the ID's hash integer.
>>
>> The ranges don't even really have to be constant if it's
>> so that there's a limit so they're under a constant, then
>> according to uniformity they only have so many, eg,
>> just projecting out their 1's, so the partition pyramid
>> digging sort of always finds one or more partitions
>> with possible matches, those being hash collisions or
>> messages duplicated across groups, and mostly finds
>> those with exclusions, so that it results reducing, for
>> example that empty groups are dropped right off
>> though not being skipped, while full groups then
>> get into needing more than constant space and
>> constant time to search.
>>
>> Of course if all the partitions miss then it's
>> also a fast exit that none have the ID.
>>
>> So, this, "partition pyramid hash filter", with basically,
>> "constant and configurable space and time", basically
>> has that because Message Id's will only exist in one or
>> a few partitions, and for a single group and not across
>> about all groups, exactly one, and the hash is uniform, so
>> that hash collisions are low, and the partitions aren't
>> overfilled, so that hash collisions are low, then it sort
>> of results all the un-used partitions at rest, don't fill
>> up in n^2 space the log 2 n hash-map search. Then,
>> they could, if there was spare space, and it made sense
>> that in the write-once-read-many world it was somehow
>> many instead of never, a usual case, or, just using a
>> list of sorted message Id's in the partition and bsearch,
>> this can map the file without loading its contents in
>> space, except as ephemerally, or the usual disk controller's
>> mmap space, or "ready-time" and "ephemeral-space".
>>
>> In this sort of way there's no resident RAM for the partitions
>> except each one with a fixed-size arithmetic hash stamp,
>> while lookups have a fixed or constant cost, plus then
>> also a much smaller usual log 2 time / n^2 space trade-off,
>> while memory-mapping active files automatically caches.
>>
>>
>> So, the idea is to combine the BFF backing file format
>> and LFF library file format ideas, with that the group x date
>> partitions make the for archive and active partitions,
>> then to have constant-time/constant-space partition
>> pyramid arithmetic hash range for lookup, then
>> ready-time/ephemeral-space lookup in partitions,
>> then that the maintenance of the pyramid tree,
>> happens with dropping partitions, while just
>> accumulating with adding partitions.
>>
>> Yeah, I know that a usual idea is just to make a hash map
>> after an associative array with log 2 n lookup in n^2 space,
>> that maintenance is in adding and removing items,
>> here the idea is to have partitions above items,
>> and sort of naturally to result "on startup, find
>> the current partitions, compose their partition pyramid,
>> then run usually constant-time/constant-space in that
>> then ready-time/ephemeral-space under that,
>> maintenance free", then that as active partitions
>> being written roll over to archive partitions being
>> finished, then they just get added to the pyramid
>> and their ranges or'ed up into the pyramid.
>>
>> Hmm... 32K or 2^15 groups, 16K or 2^14 days, or
>> about 40 years of Usenet in partitions, 2^29,
>> about 2^8 per megabyte or about 2^20 or one
>> gigabyte RAM, or, just a file, then memory-mapping
>> the partition pyramid file, figuring again that
>> most partitions are not resident in RAM,
>> this seems a sort of good simple idea to
>> implement lookup by Message ID over 2^30 many.
>>
>> I mean if "text Usenet for all time is about a billion messages",
>> it seems around that size.
>>
>>
>
>
>
> So, trying to figure out if this "arithmetic hash range
> pyramidal partition" data structure is actually sort of
> reasonable, gets into that it involves finding a balance
> in what's otherwise a very well-understood trade-off,
> in terms of the cost of a lookup, over time, and then
> especially as whether an algorithm is "scale-able",
> that even a slightly lesser algorithm might be better
> if it results "scale-able", especially if it breaks down
> to a very, very minimal set of resources, in time,
> and in various organizations of space, or distance,
> which everybody knows as CPU, RAM, and DISK,
> in terms of time, those of lookups per second,
> and particularly where parallelizable as with
> regards to both linear speed-up and also immutable
> data structures, or, clustering. ("Scale.")
>
>
> Then it's probably so that the ranges are pretty small,
> because they double, and whether it's best just to
> have an overall single range, or, refinements of it,
> according to a "factor", a "factor" that represents
> how likely it is that hashes don't collide in the range,
> or that they do.
>
> This is a different way of looking at hash collisions,
> besides that two objects have the same hash,
> just that they're in the same partition of the range
> their integer value, for fixed-length uniform hashes.
>
> I.e., a hash collision proper would always be a
> redundant or order-dependent dig-up, of a sort,
> where the idea is that the lookup first results
> searching the pyramid plan for possibles, then
> digging up each of those and checking for match.
>
> The idea that group x date sort of has that those
> are about on the same order is a thing, then about
> the idea that "category" and "year" are similarly
> about so,
>
> Big8 x year
> group x date
>
> it's very contrived to have those be on the same
> order, in terms of otherwise partitioning, or about
> what it results that "partitions are organized so that
> their partitions are tuples and the tuples are about
> on the same order, so it goes, thus that uniformity
> of hashes, results being equi-distributed in those,
> so that it results the factor is good and that arithmetic
> hash ranges filter out most of the partitions, and,
> especially that there aren't many false-positive dig-up
> partitions.
>
> It's sort of contrived, but then it does sort of make
> it so that also other search concerns like "only these
> groups or only these years anyways", naturally get
> dropped out at the partition layer, and, right in the
> front of the lookup algorithm.
>
> It's pretty much expected though that there would
> be non-zero false-positive dig-ups, where here a dig-up
> is that the arithmetic hash range matched, but it's
> actually a different Message ID's hash in the range,
> and not the lookup value(s).
>
> Right, so just re-capping here a bit, the idea is that
> there are groups, and dates, and for each is a zip file,
> which is a collection of files in a file-system entry file
> with about random access on the zip file each entry,
> and compressed, and the entries include Messages,
> by their Message ID's, then that the entries are
> maybe in sub-directories, that reflect components
> of the Message ID's hash, where a hash, is a fixed-length
> value, like 64 bytes or 128 bytes, or a power of two
> and usually an even power of two thus a multiple of four,
> thus that a 64 byte hash has 2^64 * 2^8 many possible
> values, then that a range, of length R bits, has R many
> partitions, in terms of the hash size and the range size,
> whether the factor is low enough, that most partitions
> will naturally be absent most ranges, because hashes
> can only be computed from Message ID's, not by their
> partitions or other information like the group or date.
>
> So, if there are 2^30 or a billion messages, then a
> 32 bit hash, would have a fair expectation that
> unused values would be not dense, then for
> what gets into "birthday problem" or otherwise
> how "Dirichlet principle" makes for how often
> are hash collisions, for how often are range collisions,
> either making redundant dig-ups, in the way this
> sort of algorithm services look-ups.
>
> The 32 bits is quite a bit less than 64 * 8, though,
> about whether it would also result, that, splitting
> that into subdirectories, results different organizations
> here about "tuned to Usenet-scale and organization",
> vis-a-vis, "everybody's email" or something like that.
> That said, it shouldn't just fall apart if the size or
> count blows up, though it might be expect then
> a various sorts of partitioning, to keep the partition
> tuple orders square, or on the same orders.
>
>
> The md5 is widely available, "md5sum", it's 128 bits,
> its output is hexadecimal characters, 32-many.
>
> https://en.wikipedia.org/wiki/MD5
> https://en.wikipedia.org/wiki/Partition_(database)
> https://en.wikipedia.org/wiki/Hash_function#Uniformity
>
> Otherwise the only goal of the hash is to be uniform,
> and also to have "avalanche criterion", so that near Message-Id's
> will still be expected to have different hashes, as it's not
> necessarily expected that they're the same group and
> date, though that would be a thing, yet Message ID's
> should be considered opaque and not seated together.
>
> Then MD5 is about the most usual hash utility laying
> around, if not SHA-1, or SHA-256. Hmm..., in the
> interests of digital preservation is "the tools for
> any algorithms should also be around forever",
> one of those things.
>
> So anyways, then each group x date has its Message ID's,
> each of those has its hash, each of those fits in a range,
> indicating one bit in the range where it is, then those are
> OR'd together to result a bit-mask of the range, then
> that a lookup can check its hash's bit against the range,
> and dig-up the partition if it's in, or, skip the partition
> if it's not, with the idea that the range is big enough
> and the resulting group x date is small enough, that
> the "pyramidal partition", is mostly sparse, at the lower
> levels, that it's mostly "look-arounds" until finally the
> "dig-ups", in the leaf nodes of the pyramidal partitions.
>
> I.e., the dig-ups will eventually include spurious or
> redundant false-positives, that the algorithm will
> access the leaf partitions at uniform random.
>
> The "pyramidal" then also get into both the empties,
> like rec.calm with zero posts ten years running,
> or alt.spew which any given day exceeds zip files
> or results a lot of "zip format, but the variously
> packaged, not-recompressed binaries", the various
> other use cases than mostly at-rest and never-read
> archival purposes. The idea of the "arithmetic hash
> range pyramidal partition" is that mostly the
> leaf partitions are quite small and sparse, and
> mostly the leveling of the pyramid into year/month/date
> and big8/middle/group, as it were, winnows those
> down in what's a constant-rate constant-space scan
> on the immutable data structure of the partition pyramid.
>
> Yeah, I know, "numbers", here though the idea is
> that about 30K groups at around 18K days = 50 years
> makes about 30 * 20 * million or less than a billion
> files the zip files, which would all fit on a volume
> that supports up to four billion-many files, or an
> object-store, then with regards to that most of
> those would be quite small or even empty,
> then with regards to "building the pyramid",
> the levels big8/middle/group X year/month/date,
> the data structure of the hashes marking the ranges,
> then those themselves resulting a file, which are
> basically the entire contents of allocated RAM,
> or for that matter a memory-mapped file, with
> the idea that everything else is ephemeral RAM.
>
>
>

Wonder about the pyramidal partition arithmetic range hash
some more, with figuring out how to make it so that
the group x date grid of buckets, has a reasonably
well-defined run-time, while using a minimal amount
of memory, or a tunable amount giving performance,
for a well-defined constant resource, that's constant
and fully re-entrant with regards to parallel lookups.

The idea is to implement the lookup by message-id,
where messages are in buckets or partitions basically
according to group x date,

a.b.c/yyyy/mmdd/0.zip
a.b.c/yyyy/mmdd/0.pyr

with the idea of working up so that the groups,
on the order of 30K or so, and days, on the order
of 15K or so, have that mostly also the posts are
pretty sparse over all the groups and dates,
with the idea that absence and presence in
the file-system or object-store result for usual
sorts lookups, that search hits would be associated
with a message-id, then to look it up in any group
it was posted to, then across those or concomitantly,
the idea that cross-posts exist in duplicate data
across each partition.

a/b.c/yyyy/mmdd

yyyy/mmdd/a/b.c

The idea is that yyyy is on the order of 40 or 50,
while mmdd is 365, with the idea of having "0000"
for example as placeholders for otherwise dateless
posts sort of found in the order, and that 'a' is about
on the order of 30 or 40, all beyond the Big 8, then
that after finding matches in those, which would
be expected to be pretty dense in those, where
the message-id is hashed, then split into four pieces
and each of those a smaller uniform hash, then
it's value in then the range, simply |'s into the range
bits, then diving into the next level of the pyramid,
and those that match, and those that match, and
so on, serially yet parallelizably, until finding the
group's date files to dig, then actually looking
into the file of message-ids.

a/b.c/yyyy/mmdd/0.zip
a/b.c/yyyy/mmdd/0.pyr
a/b.c/yyyy/mmdd/0.ids

a/b.c/yyyy/mmdd.pyr
a/b.c/yyyy.pyr
a/b.c.pyr
a/pyr

yyyy/mmdd/a/b.c.pyr
yyyy/mmdd/a.pyr
yyyy/mmdd.pyr
yyyy.pyr

One can see here that "building the pyramid" is
pretty simply, it's a depth-first sort of traversal
to just "or" together the lower level's .pyramid files,
then usually for the active or recent besides the
archival or older, those just being checked for
when usually lookups are for recent. The maintenance
or re-building the pyramid, has a basic invalidation
routine, where lastModifiedTime is reliable, or
for example a signature or even just a checksum,
or that anyways the rebuilding the data structure's
file backing is just a filesystem operation of a usual sort.

Then, with like a 16KiB or so, range, is basically
about 4KiB for each the 4 hashes, so any hash-miss
results a drop, then that's about 16 kibibits,
about as above usual or a default hash for
the message-id's, where it's also designed that
/h1/h2/h3/h4/message-id results a file-system
depth that keeps the directory size within usual
limits of filesystems and archival package files,
of all the files, apiece.

Then, a megabyte of RAM or so, 2^20, then with
regards to 2^10 2^4, is about 2^6 = 64 of those
per megabyte.

30K groups x 15K days ~ 450M group days, hmm, ...,
not planning on fitting that into RAM.

2 groups x 18262 days, 36K, that should fit,
or, 32768 = 2^15, say, by 2^6 is about 2^9 or
512 megabytes RAM, hmm..., figuring lookups
in that would be at about 1 GHz / 512 MiB
about half a second, ....

The idea is that message by group-number are just
according to the partitions adding up counts,
then lookup by message-Id is that whatever
results search that returns a message-id for hits,
then has some reasonable lookup for message-id.

Subject	Replies	Author
Re: Meta: a usenet server just for sci.math By: Ross Finlayson on Thu, 28 Mar 2024	5	Ross Finlayson

When I left you, I was but the pupil. Now, I am the master. -- Darth Vader

rocksolid / news.software.nntp / Re: Meta: a usenet server just for sci.math