Bashreduce - parallel sorting and other tricks used in cloud computing

s243a · Post by **s243a** » Mon Dec 07, 2020 7:22 pm

I was lovely looking for a tool that I could try map-reduce type algorithms without having to install Hadoop. The reason being is that I don't want to have to depend on java. In a small system java is extra bloat and there might be issues with Java on specific platforms such as termux or 32 bit systems. The newest versions of java aren't released by Oracle and as a consequence require an openjdk equivalent to work for 32bit systems and on Android systems (e.g. termux) root (or emulation) is likely required for java to work.

Anyway bashreduce is one project that lets you experiment with map-reduce without having to install Java.

https://github.com/erikfrey/bashreduce/blob/master/br

Bashreduce It is mostly Witten in bash but does have a c tool for faster partitioning of text files. The c tool might be optional.

It lookes like the API might be different than Hadoop so if you want the same API you could add a wrapper function.

s243a · Post by **s243a** » Sun Dec 27, 2020 9:24 am

Erik Frey's project looks interesting due both to the networking techniques used and likely efficiency advantages. However, many applications don't need that degree of parallel computing. Even a single processor core can do a lot of computation these days and the sorting algorithm has nice scaling qualities from a computation efficiency standpoint.

I think that one can slow there bash script down a lot if they do a lot if they have to invoke a lot of sub-processes for small tasks. I also think that there is big efficiency gains if a typical file size that a given process processors is on the order of the block size of the file system.

The above is a bit speculative by me but I notice that the md5sum is very fast when applied to a whole directory of small files. For example one might type either "md5sum $(ls -1 *.files)" or "md5sum $(ls -1f)" or "md5sum $(ls -1 *.list; ls -1 *.md5sums)" as I do in some draft code.

When working with the amount of metadata used in a typical package manager the sort and join commands can be very fast without needing parallel processing. I've been playing around a bit with this more batch style processing for package manager syncing and didn't need a bashreduce type algorithm for computational speed.

However, a bashreduce-like algorithm could be useful for combining partial datasets to produce a complete dataset. This is similar to a join except without complete segregation into seperate tables without null entries. Here is my basic idea:

1. Say we have a bunch of lines with a common key, then for each line with a common key we do the following:

Code: Select all

    split(line, fields , OFS)
    if ( length(pkg) == 0 ){
      pkg=fields[1]
    }
    if ( length(arch) == 0 ){
      arch=fields[2]
      sub(/$[:]/, "",arch)
    }
    if ( length(ver) == 0 ){
      ver=fields[3]
    }
    if ( length(pkgfile) == 0 ){
      pkgfile=fields[4]
    } 
    if ( length(dir_name) == 0 ){
      dir_name=fields[5]
    }      
    if ( length(filelist) == 0 ){
      filelist=fields[6]
    }   
    if ( length(md5sum) == 0 ){
      md5sum=fields[7]
    }

Then once we looped through all the data with a common key, we print the fields:

Code: Select all

print pkg, arch, ver, pkgfile, dir_name, filelist, md5sum

More complete code can be found on pastebin. This technique I think might fall under a generalized version of mapreduce. Or at least it is more general then, the project I'm following which is by sorhus:
https://github.com/sorhus/bash-reduce
http://sorhus.github.io/2016/03/27/bash-reduce-part-1/

Sorhus bashredue implementation is easier to understand than "Erik Frey" version but likely less efficient then Erik's (this is just speculation). However, as noted above Sorhus, version is more than sufficient for the current application that I'm looking at.

Anyway, a big thing to understand about how Sorhu's implementation works, is to understand how data with a common key is grouped together. This is done by a script called shuffle.awk, which unfortunately has a misleading name.

Usually mapreduce type algorithms are used for numeric computations such as word count. For instance each word might be mapped to (word, 1) where word is the key and one is the word count of a single instance of the key. To get the total count all words that are the same (i.e. have the same key), you some up all the separate counts to produce the total count for that word. The shuffle algorithm compresses the representation by having a single line for a given key and then writing each value for each word instance as separate fields.

However, in a non-standard map-reduce-like implementation, we might consider having multiple fields for each key and having each field represent a different type of information. In this case we might need two distinct separators, one to separator unique fields associated with a given instance of a key and a different spectator to distinguish separate instances of that key. To accommodate this unique variation of the map-reduce genre of ideas I modified the so called "suffle.awk" function:

I also added the ability to repeat the key in the output of shuffle.awk for each unique instance of the key even though this is redundant since shuffle outputs all items which have the same key to a single line.

The shuffle step happens after the map and before the reduce part of the algorithm. The shuffle step doesn't add much value as One could alternatively check for a key to change value in order to know when one is working with a different set of items that all have the same key. However, it does simplify the reducer algorithm somewhat and also makes it easier to parallelize the reduce part of the algorithm.

If you want to see the output of the shuffle algorithm then you can use an identity function as the reducer. Alternatively, you might want to use a grep-like function as the reduce to show only lines which have multiple instances of a single key. For a package manager application this could identify when one has multiple versions of the same package installed.

I should wrap up by showing how I call the bash-reduce-like type of algorithm. This can be scene at the end of the following codeblock:

Code: Select all

  dpkg)
    pfx=3
    cd "$a_dir/info"; dir_bname=$(basename "$a_dir")
    md5sum $(ls -1 *.list; ls -1 *.md5sums) "" \
       | sed -r 's#^([^[:space:]]+)([[:space:]]+)([^[:space:]]+.*)([.])([^.]*)$#\3||dpkg|\3\4|\1|'"$bname"'#g' \
       | sed -r 's#([^:]+)(:?[^\|]*\|)(.*)#\1|||||\2\3#g' \
       | sort > "${outfile}"_"$pfx"_md5 #f1=pkg, f2=:arch, f3=ver, f4=pkgfile, f5=dir_name, f6=filelist, f7=md5sum
    cd ..
    cat status | awk -v db_dir=dpkg "$AWK_fn_map_deb" > "${outfile}"_"$pfx"_db
    cd $CWD/..
    cat "${outfile}"_"$pfx"_md5  "${outfile}"_"$pfx"_db > "${outfile}"_"$pfx"_data
    cd $CWD/..
    ./bash-reduce -v shuffle_OFS="@@@@" -v shuffle_FS="|" -v Keep_Key=true ./dpkg_mappers/identity.awk ./dpkg_reducers/LB_reducer.awk "${outfile}"_"$pfx"_data > "${outfile}"_"$pfx"

What I want to do here is combine the info I can glan from the files which list the files in a package, with that of the metadata given in the "status" file produced by dpkg. The actual call to bash-reduce is here:

Code: Select all

    ./bash-reduce -v shuffle_OFS="@@@@" -v shuffle_FS="|" -v Keep_Key=true ./dpkg_mappers/identity.awk ./dpkg_reducers/LB_reducer.awk "${outfile}"_"$pfx"_data > "${outfile}"_"$pfx"

Notice, I'm passing to bash-reduce some input variables, which are "shuffle_OFS", which sperates uniqe instances of times that have the same key and also the variable shuffle_FS, which separates unique fileds for a single instance of a key. The md5sum for the file that lists the contents of a package, might serve as a flag to see if this file list exists for the package. I could also use it as a check to see if there is a similar list of files in the puppy package manager. If the puppy package manager has a the same file with a different has I could maybe try to put the two files in the safe forum so I could have a single file referenced by two hard links.

One should also note above in this last example some of the processing is done without the bash-reduce algorithm and as I noted above this is sufficiently fast for my application.

Puppy Linux Discussion Forum

Bashreduce - parallel sorting and other tricks used in cloud computing

Bashreduce - parallel sorting and other tricks used in cloud computing

Re: Bashreduce - parallel sorting and other tricks used in cloud computing