Julia Boehlke d0db47a081 explaining the data and data stream creating notebook		2 vuotta sitten
..
data_stream_files	d0db47a081 explaining the data and data stream creating notebook	2 vuotta sitten
label_dictionaries	d0db47a081 explaining the data and data stream creating notebook	2 vuotta sitten
raw_data_csv_files	d0db47a081 explaining the data and data stream creating notebook	2 vuotta sitten
README.md	d0db47a081 explaining the data and data stream creating notebook	2 vuotta sitten

Explaining the Data

Raw Data

The raw data used to create the streams are found in the raw_data_csv_files/ directory in form of csv files with meta-data columns. The data for the bavarian forest is found in all_data_MD.csv. The 'series_code' column gives every series of images (sometimes refered to as sequences) a unique identifier. The cameratraps take five images at a time, checks for movement and then takes another five images until no more movement is detected. This means that the length of the series differ. The 'broken' column was created checking which images are truncated or broken in some shape or form. When we recieved the data from the biologists they only labeled every fifth image of a sequence. Further the only labeled those series containing at least on image with an animal. we used this information to extrapolate to an entire sequence. The 'in_csv_file' column indicates wheather that image was contained in the original file. The 'orig_label' column is wrong and should be ignored, use the 'species' column for class names. The 'MD_category' column contains the classification of the most confident bounding box of the mega detector. The mega detector only destinguisches between humans (2), vehicles (3) and animals (1). The images where the mega detector did not detect anything are labeled with -1 in that column. The 'MD_max_conf' column contains the highest confidence level found for the bounding boxes detected while the 'MD_all_conf' contains a list in string format of all confidences, which also indicates the number of bounding boxes detected. The all_data_Caltech.csv file was created from the original meta data json-files and contain similar columns as all_data_MD.csv. The all_data_BB.csv files contain the meta data for the subset of the Brandenburg dataset that can be found in /AMMOD_data/camera_traps/Brandenburg/crops_tilo.

Label Dictionaries

This folder contains pickled dictionary files, that map an integer label to a human readable species name. The bavarian forrest related dictionary used in all recent experiments is 'BIRDS_11_Species.pkl'. The earlier version, befrore two more classes were filtered out for beeing too small is found in 'BIRDS.pkl'. The word 'BIRDS' comes up in these file names because all the different fine-grained labels in the original csv file we recieved from the biologists were joined to form a larger 'birds' class. The 'BB.pkl' is the Brandenburg dataset label dictionary, while the dictionary related to the Caltech camera trap dataset is found in CALTECH_LABEL_DICT.pkl.

Stream Files

The streams were created with the data_stream_creation notebook found in /scripts/jupyter_notbeooks/. In the folder BW_stream_files there are the files for the bavarian forest data, Bayerischer Wald (BW), while the Caltech_stram_files contains the data for the Caltech data. In each of these folders there are files corresponding to each of the five cross-validation splits. The affiliations are indicated by the 'cvi_' at the beginning of each file name. For each split, two-thirds of the data are used for training, one third is used in the test data and the last third is used as validation data. The data is split series- and class-wise. This means that images from one series will all be placed in one of the three groups train/test/validation. Further, the series for each class where split into five sets used to creat the cross-validation splits which means that the share of images from each class is roughly the same for all train/test/validation data. Because the sereis differ in length, this is only roughly the case though. For each split there is a train_stream, test_stream and validation_stream. The train_stream files contain a list of sublists which contian tuple of with the following fields (img_path, int_label, seq_id). Each sublist is also called an experience, a term introduced by avalanche, and is either 128 or 384 tupel long. The data strem with experience 384 contains the exact same data as the corresponding stream with experience size 128. The experiences where simply joined together, therfore the test and validation data remains the same. The train streams are ordered by date and series code. This means the first experience contains the series of the earlies data in the dataset. In order to fit the fixed experience size, a single sequence might have been split across two experiences in a row. The test and validation files contain a single list with tupels of the same shape. The test/validation files with either 'summer' or 'winter' in the filename are the subsets of the full test/validation files corresponding to a specific season. We defined two months before and three after the summer solstice as summer. Each of the described files also comes in a '_crop_' version, where the image path in the tupel differs from the non-crop version. In our experiments only one cropped image with the highest conidence produced by the mega detector was extracted and used. Preliminary experiments showed the benefit of using the cropped image in terms of classification accuracy, which is why we used the crops onyl in the experiments. The 'exp_season_split_dict.pkl' files are dictionaries with the keys 'summer' and 'winter' which each have lists as values that indicate which experience belongs to which season of the stream with experience size 128. When using the data with expereince size 384 this has to be considered.

README.md

Explaining the Data

Raw Data

Label Dictionaries

Stream Files