technology | PixelClock

Posts Tagged ‘technology’

Fast offline people counting from the sky with AI

Posted: July 31, 2025 in AI
Tags: artificial-intelligence, computer-vision, machine-learning, technology, yolo

Some time ago, I was challenged with a project for automatically monitoring the occupancy of a natural area.

The idea was to use a drone equipped with a camera to shoot a video from the air. The video would be later processed with computer vision algorithms and deep neural network models to obtain the number of people present at the site at the time of recording.

It involved a linear area several kilometers long, which the drone would survey from end to end, with the camera capturing the full width of the path as it moved along.

Picture of a natural area — _{Photo by viveelaltopalancia.blogspot.com}

The expected result was a video with bounding boxes overlaid over each detected person, along with a global counter showing the total number of people seen up to that point.

Once the video was processed, the detected occupancy would be reported to the authorities, who would compare it with manual counts to detect anomalies caused by large variations.

To make things more interesting, we couldn’t count on good-reliable and high-bandwidth internet access in the area to be surveyed, so the only viable option was to process the video completely offline. Also, deploying the solution in the cloud would be more complex and require the usual monitoring and management.

So, we’re talking about a multi-object tracker. This is usually done in two steps:

1. Detection

In the detection phase, we run inference on a video frame using a deep CNN. The result is a set of bounding boxes, each with its class (type of detected object) and a confidence score ranging from 0 to 1.

Image of an office with bounding boxes over different objects — Photo by MTheller

2. Tracking

In the tracking phase, we assign a unique identifier to each bounding box and maintain it across consecutive frames. We analyze the content of the bounding box in the previous frame and compare it with the current one to determine if it’s the same object. When a new person appears who hasn’t been detected previously, we increment the global people counter. This a way to avoid counting the same person once per frame. A good tracker should also be able to handle temporary occlusions. For example, if a person walks behind a tree and gets occluded temporarily, when they are visible again, the tracker must recognize it’s the same person seen in previous frames and reassign the same id, which is called re-identification.

GIF animation of a multi object tracker — Image from GeekAlexis‘s github page

Tracked people with their corresponding unique IDs

Hardware

To execute this process on-site, a modest gaming laptop by the standards of that time was purchased. An HP Victus with 16 GB of RAM, a 512 GB SSD, and an Nvidia RTX 3050 graphics card.

To capture the footage, a DJI Mavic Air 2 drone was used, recording at 4K resolution (3840×2160).

Software

The operating system is Windows. I would have preferred to install a Linux distro, but my experience with laptops using dual GPU setups is a bit hit-and-miss. I use Linux as my daily driver on one of those machines—I know what I’m talking about 😄. And I didn’t have much time for development, so I prioritized having good driver support, even if it meant doing some acrobatics to get the whole framework working.

PREPARATION
In previous years, the company already did people counting on beaches using a custom solution, developed several years earlier based on a YOLO v4 detector + the Deep SORT algorithm.

Still image of a video where people counting is showcased — Frame produced by the preexisting solution. Notice the unique numeric IDs on top of each bounding box. Blue boxes are new detections (no ID yet), while red boxes are detections tracked over time.

YOLO is a family of detection algorithms known for their high performance by doing a single pass through the neural network (You Only Look Once).

The bounding boxes generated by YOLO in the current frame were fed to Deep SORT for tracking.

Deep SORT stands for Deep Simple Online Realtime Tracking. It relies on deep learning models and computationally complex algorithms for bounding box association. A CNN extracts appearance descriptors that encode the appearance of the detected objects. A Kalman filter is also used to predict the state of the object in the current frame based on its last known state, accounting for the object’s motion dynamics.

The process ran in the cloud using servers with Nvidia A100 cards. An RTSP stream was transmitted live, stored on the server, and added to a job queue.

It worked but had some shortcomings:

There was no way to resume a stream if it was cut off, which happened frequently on the beach due to high crowds and cell saturation. An incomplete video was processed anyway, yielding a partial result.
It was very heavy: It could take 30 minutes from the end of a transmission to having the result ready. This was to be expected because of Deep SORT’s high complexity
It got expensive very quickly: In order to save money, GCP instances had to be manually brought down and up every day. Arrive one minute late, and they’re out of GPUs for you to rent.
GPUS were sometimes pulled offline for some reason, and the instances had to be restarted manually.

When the preexisting solution was developed, a custom YOLOv4 model was trained for it, using hundreds of real images taken by the drone at beaches and manually labeled. Since the model worked well, and there was no time to train and tune a new one, I decided to reuse it but changing the framework and tracker, as the bottleneck was the detection stage, which could take over 1s per frame.

I researched ways to lighten the tracking workload to make it feasible to run offline on a laptop and found several interesting proposals with one thing in common: not running tracking on every frame, but on one out of every N frames.

Mvmed is a real-time online tracker for objects in MPEG-4 and H.264 compressed videos. The interesting part here is that the motion vectors stored in the P and B frames of an H.264 stream are averaged inside each bounding box to interpolate their positions and sizes for the frames between tracker executions.

As curious as this approach seemed to me, getting the Docker container to build was far from straightforward because of broken and outdated packages. It turned out to be a rabbithole I couldn’t afford to go down.

Also, Mvmed doesn’t use YOLOv4 for detection, so I would have had to change that part—which I wouldn’t have minded if the project compiled out-of-the-box.

FastMOT is another somewhat outdated project by current standards with the same approach but uses a KLT filter to fill in the gaps between tracking steps efficiently.

Detection is done with YOLOv4.
Tracking is run once every N frames using a DeepSORT algorithm with OSNet Re-identification. It also includes camera motion compensation.
Accuracy on the MOT20 training set is 77.9% when run every 5 frames.

So at that time, I decided to move forward with FastMOT.

After updating quite some obsolete dependencies, fixing many build errors, and compiling OpenCV with support for the RTX 3050 compute architecture (compute_85), I converted the detection model to ONNX in order to finally convert it to the TensorRT format FastMOT expects.

And I finally ran it on a test flight.

Static frame of a video with detection bounding boxes — Frame of a processed video showing only bounding boxes without ids

I modified the tracker code to add two counters and display them on each frame. One counter shows the number of distinct people counted so far, and the other shows the number of people detected in the current frame.

I tweaked the line and font sizes a bit, and this was the result.

GIF animation showing the results — Scaled down gif version of the processed video (10 fps)

Deployment

To run it on Windows, I used WSL2. Fortunately, GPU virtualization had recently been supported, so I installed the Nvidia Docker container toolkit.

I deployed the Docker container.

And to allow a user to run it easily, I programmed a simple Python UI using tkinter.

The UI runs locally (not in WSL), and lets users select a file and process it. When processing starts, the Docker container is launched this time in WSL with the proper parameters. The X11 windows are redirected, and a window shows the video being processed in real time.

docker run --gpus all --rm -it -v $(pwd):/usr/src/app/FastMOT -v $3:/tmp/data -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=unix$DISPLAY -e TZ=$(cat /etc/timezone) fastmot:latest python ./app.py -i "/tmp/data/$1" -o "/tmp/data/$2" --mot -g -v

Finally, a desktop shortcut pointing to the Python UI app completes the setup.

It ran consistently at 30+ fps on the Victus laptop with frameskip (N) = 5 with considerably good re-identification results.
The confidence threshold for rejecting detections was set to 0.3 (objects detected with < 30% confidence are not passed on to the tracker)
The tracker works fairly well when the YOLO model struggles with detections in consecutive frames
The model is primarily trained over pedestrians captured from above with a drone. As you can see it struggles a bit with swimmers.

Final thoughts

The implemented solution fulfilled the expectations, and seemed to provide the right balance between precision and computational cost for running on edge hardware.

There are many other interesting architectures for dense crowd estimation I’d like to take out for a spin like CSRNet

Melbits Pod: Merging Digital and Physical Play for Children

Posted: September 13, 2024 in Hardware projects, Melbits POD
Tags: arduino, bluetooth, electronics, embedded, firmware, hardware, iot, melbits, nordic, pixies, pod, project, sensors, smart, technology, toy

One of the things I enjoy the most is embedded development. I always say that creating something with your own hands, which you can also program and watch as the gadget comes to life, is an incredible feeling. Eight years ago, I embarked on what was possibly my most ambitious project in this area.

It all started way back in 2016 as a business idea to create something fun that would also have a positive impact on society.

The motivation

In this screen-filled world where children are increasingly immersed at an early age, and where parents often use them as a sort of “digital pacifier” to give themselves a moment of respite, we who are still big kids perceive that the art and joy of playing with physical toys in the real world is being lost. Audiovisual media floods all our senses, leaving little to no room for imagination.

But don’t get me wrong, I am a big advocate of technology because it enables us to do incredible things and allows for lifestyles never seen before. In recent decades, we’ve experienced a technological explosion unlike anything in human history. Robotics, telecommunications, electronics, artificial intelligence… New technologies arise and become obsolete in just a matter of months. It’s hard to stay up to date, but it’s even harder for society to absorb this new world of possibilities moving at breakneck speed. So, in many cases, something that could be highly beneficial if used well becomes the opposite because it’s misused. And it’s misused because development has been so fast that we haven’t had time as a society to build a culture and healthy habits around the digital world.

From a reflection like this, an idea emerges.

The idea

In short: an electronic toy with sensors that you have to play with in order to progress in a digital game.

Melbits are small digital pixies that start their lives as seeds. As they grow and develop, they turn into puppies or adult Melbits. But not everything is happiness in the world of Melbits, because viruses, representing everything bad in the digital world, are lurking and can infect Melbits, thus creating new viruses.

The user will obtain Melbit seeds in the digital game (smartphone or tablet), which must then be transferred to the real world by loading them into the toy, which acts as a sort of incubator. Next, they’ll receive an incubation recipe which might include moving the toy, letting it stand still, give it some sunshine or maybe keep it in a dark, cold place.

This is where the digital part ends and the physical part begins. From this moment on, the user must play with the toy as instructed if they want the Melbit to develop correctly inside. Otherwise, a virus will appear, and there will be consequences.

Little girl playing with the Melbits Pod

The incubation process can last from seconds to hours, helping to cultivate skills like patience and perseverance.

Once incubation is complete, the toy notifies the user, and they can transfer the results back to the tablet to see how well or poorly they did.

Girl lifting up the Melbits Pod revealing a new Melbit inside

Hardware specs

Right from the beginning, it was clear that the toy needed to have sensors, some way to provide feedback to the user, and, if possible, no buttons, as well as the ability to communicate with a smartphone or tablet. However, the specific requirements evolved throughout the development process as we shaped the user experience and experimented with different solutions to find the one that offered the best cost-benefit ratio.

Here’s how the final specifications turned out:

Bluetooth Low Energy
4 high-brightness orange LEDs
RGB LED
ERM vibration motor
Temperature sensor
2 photodiodes
Accelerometer
Amplified speaker
Rechargeable lithium battery
USB port
Hidden multifunction button (not typically used)
ARM Cortex M0 microcontroller at 64 MHz
192 KB of Flash memory
24 KB of RAM
Internal hard ABS casing with screws, enclosed in a soft vinyl outer casing

Firmware Specs

Encrypted bootloader with OTA capability
Game logic
Music player with sine wave, triangular wave, noise, and PCM channels
Extra channel to control the vibration motor
Extra channels to control the LEDs
Several embedded melodies and effects
Adjustable output volume
Battery charge via USB connection and SoC monitoring with feedback
Sensor reading and updating
Motion pattern recognition
Automatic sleep and wake-up mode without buttons
“Box mode”
Storage memory
Customizable settings (vibration, LED brightness, speaker volume)
Streaming sensor readings via BLE
Diagnostic functions for manufacturing
Magic Link! (more on this later)
Proprietary encrypted BLE protocol controlling all functions
All of this within 192 KB of Flash and 24 KB of RAM!

Software Specs

The Melbits Pod app was developed in parallel by another team. It’s a Unity3D project which I adapted to support Bluetooth LE and async communications with the POD.

Cross-platform iOS/Android app
3D graphics with skeletal models
Multiple props, accessories, and costumes
2D touch UI
User guide tutorial
Melbit family tree
Persistent user profile
UI for viewing and changing toy settings
Tutorial voiceovers
Music and sounds by Aries!
Analytics
Activities with Melbits (playing, feeding, etc.)
Flow for loading and unloading a Melbit to/from the POD
Bluetooth Low Energy with proprietary encrypted protocol
Automatic toy firmware updates via OTA
Front camera usage to take a selfie with your Melbit
Augmented reality using the rear camera

Pod Simulator

Desktop app created during development to parallelize app development before hardware was available

In addition to all of this, I studied European and international regulations, as they are very strict with toys.

In closing

It was undoubtedly a great and exciting project. Join me in the following articles where I explain how I developed it prototype after prototype to the final product as the technical director of Melbot Studios, how I traveled to China to resolve questions with the manufacturer, and how it was finally mass-produced and went on sale after a successful Kickstarter campaign!

PixelClock

Latest Posts

Categories