TASK1-Crowd Counting

Deep learning based computer vision algorithms have surpassed the human-level performance for many CV tasks, like object recognition and face verification. However, computer vision technology relies on the valid information from the input image and video, and the performance of the algorithm is essentially constrained by the quality of source image/video. Along with the emergence of gigapixel-level image/video, the corresponding computer vision tasks remain unsolved, due to the extremely high-resolution, large-scale, huge-data that induced by the gigapixel camera.

This task is intended to evaluate the ability of algorithms to estimate the crowd density map in a complex scenario. For this task, participants will use our Gigapixel Video Dataset, a new resource with high spatial resolution and wide FOV simultaneously for computer vision challenges.

Dataset Download:

The Gigapixel Video Dataset 0.1alpha will be used for this task. This dataset consists of 65 representative images from the train station and the shanghai marathon sequences. These images are saved in JPEG format with more than 200K heads. We will release more labeled data in the future.

  • Marathon
  • Video length: 9000 frames
  • Frame size: 27000x15000
  • Frame rate: 30 Hz
  • Station
  • Video length: 9000 frames
  • Frame size: 26000x14000
  • Frame rate: 30 Hz

Invalid Area

Limited by the resolution, sometimes even human can not clearly count the exact number of people in some far places. Therefore, we have delineated some invalid areas which are considered artificially unrecognizable and have no groundtruth label.

Dataset Image Size Invalid Area
shanghai_marathon 26908 × 15024 1 ≤ x ≤ 26908, 1 ≤ y ≤ 6670
train_station 26558 × 14828 1 ≤ x ≤ 26558, 1 ≤ y ≤ 5130

Note: The top left pixel is set to the origin of the coordinates (x = 1, y = 1).


The groundtruth labels are saved in .mat and .txt files.

The first two lines indicate the total number of people in the image. After that, each line represents a head position. The first number is the x coordinate; the second number is the y coordinate. The top left pixel is (1,1).

<x1 y1>
<x2 y2>
<xN yN>


We provide the results for a baseline proposal method on the test subset.……baseline details……


All of the Gigapixel Video Dataset on this page are copyright by Smart Imaging Laboratory, Tsinghua-Berkeley Shenzhen Institute and published under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license.


When using our datasets in your research, we will be happy if you cite us!
For all of the datasets, please cite:

title={Multiscale gigapixel video: A cross resolution image matching and warping approach},
author={Yuan, Xiaoyun and Fang, Lu and Dai, Qionghai and Brady, David J and Liu, Yebin},
booktitle={Computational Photography (ICCP), 2017 IEEE International Conference on},