TribeFlow source code
Contains the TribeFlow (previously node-sherlock) source code.
The python dependencies are:
You will also need to install and setup:
Easy way: Install Anaconda Python and set it up as your default enviroment.
Hard way: Use pip or your package manager to install the dependencies.
pip install numpy
pip install scipy
pip install cython
pip install pandas
pip install mpi4py
pip install plac
Use or package manager (apt on Ubuntu, HomeBrew on a mac) to install OpenMP and MPI. These are the managers I tested with. Should work on any other environment.
Simply type make
make
Either use python setup.py install
to install the packager or just use it from
the package folder using the run_script.sh
command.
How to parse datasets: Use the scripts/trace_converter.py
script. It has a help.
For command line help:
$ python scripts/trace_converter.py -h
$ python main.py -h
Running with mpi
$ mpiexec -np 4 python main.py [OPTIONS]
Running TribeFlow from other python code:
Check the api_singlecore_example.py file
Converting the Trace
Let's assume we have a trace like the Last.FM trace from Oscar Celma. In this example, each line is of the form:
userid \t timestamp \t musicbrainz-artist-id \t artist-name \t
musicbrainz-track-id \t track-name
For instance:
user_000001 2009-05-01T09:17:36Z c74ee320-1daa-43e6-89ee-f71070ee9e8f
Impossible Beings 952f360d-d678-40b2-8a64-18b4fa4c5f8Dois PĆ³los
First, we want to convert this file to our input format. We do this with the
scripts/trace_converter.py
script. Let's have a look at the options from
this script:
$ python scripts/trace_converter.py -h
usage: trace_converter.py [-h] [-d DELIMITER] [-l LOOPS] [-r SORT] [-f FMT]
[-s SCALE] [-k SKIP_HEADER] [-m MEM_SIZE]
original_trace tstamp_column hypernode_column
obj_node_column
positional arguments:
original_trace The name of the original trace
tstamp_column The column of the time stamp
hypernode_column The column of the time hypernode
obj_node_column The column of the object node
optional arguments:
-h, --help show this help message and exit
-d DELIMITER, --delimiter DELIMITER
The delimiter
-l LOOPS, --loops LOOPS
Consider loops
-r SORT, --sort SORT Sort the trace
-f FMT, --fmt FMT The format of the date in the trace
-s SCALE, --scale SCALE
Scale the time by this value
-k SKIP_HEADER, --skip_header SKIP_HEADER
Skip these first k lines
-m MEM_SIZE, --mem_size MEM_SIZE
Memory Size (the markov order is m - 1)
The positional (obrigatory) arguments are:
We can convert the file with the following line:
python scripts/trace_converter.py scripts/test_parser.dat 1 0 2 -d$'\t' \
-f'%Y-%m-%dT%H:%M:%SZ' > trace.dat
Here, we are saying that column 1 are the timestamps, 0 is the user, and 2 are the
objects (artist ids). The delimiter -d is a tab. The time stamp format is
'%Y-%m-%dT%H:%M:%SZ'
.
Adding memory
Use the -m argument to increase the burst (B parameter in the paper) size.
python scripts/trace_converter.py scripts/test_parser.dat 1 0 2 -d$'\t' \
-f'%Y-%m-%dT%H:%M:%SZ' -m 3 > trace.dat
Learning the Model
The example below is the same code used for every result in the paper. It runs TribeFlow with the options used in every result in the paper. Explaining the parameters:
The example below uses 20 cores
$ mpiexec -np 20 python main.py trace.dat 100 output.h5 \
--kernel eccdf --residency_priors 1 99 \
--leaveout 0.3 --num_iter 2000 --num_batches 20
Below we have the list of datasets explored on the paper. We also curated links to various other timestamp datasets that can be exploited by TribeFlow and future efforts.
Datasets used on the paper:
List of other, some more recent, datasets that can be explored by TribeFlow.
Basically, anything with users (playlists, actors, etc also work), objects and timestamps.
On the example
folder we have some sub-sampled datasets that can be used to
better understand the method.