• Fast way to iterate through video frames with Python and OpenCV

    Recently, I was working on a program to sample N frames from a source video, and then assign a score to each frame (from 0 to 1) in terms of thumbnail-worthiness. Before long, it became apparent that decoding video frames was one performance bottleneck. In this post, I would look into different ways of reading video frames with OpenCV and then speeding it up with multithreading.

  • Implement shapenet face landmark detection in Tensorflow

    In my previous post on building face landmark detection model, the Shapenet paper was implemented in Pytorch. With Pytorch, however, to run the model on mobile requires converting it to Caffe. Though there is tool to take care of that, some operations are not supported and in the case of Shapenet, it was not something I know how to fix yet. Turn out it was simpler to just re-implement Shapenet in Tensorflow and then convert it to Tensorflow Lite.

  • Use MobileNetV2 as feature extractor in Tensorflow

    Applying machine learning in image processing tasks sometimes feel like toying with Lego blocks. One base block to extract feature vectors from images, another block to classify… Popular choices of feature extractors are MobileNet, ResNet, Inception. And as with any other engineering problem, choosing a feature extractor is about considering trade-offs between speed, accuracy, and size. For my current task of dealing with ML on mobile devices, MobileNetV2 seem to be a good fit as it is fast, quantization friendly and does not sacrifice too much of accuracy. Tensorflow provides a reference implementation of MobileNetV2 that makes using it much easier.

  • Train a face dectector using TensorFlow object detection API.

    About 3 years ago, putting together a face detection camera application for mobile devices was more involving a task. I remember a colleague sitting next to me back then tinkering with OpenCV and dlib to produce a demo with the right trade-off between size, speed and accuracy. As with every engineering problem, there is no one-size-fit-all solution. A on-device face detector may choose to reduce the size of input images to quicken detection, though lower resolution results in lower accuracy. Fast forward to the moment, it has never been as easier to customize your own face dection model thanks to folks at Google who open source their Tensorflow object dection api. Besides, platforms like Colab provide hobbists with free access to ML training-capable machines.

  • Building face landmark detection model using Pytorch

    Having used dlib for face landmark detection task, implementing my own neural network to achieve similar goal can be potentially fun and help the learning process. There is this recently released paper that outlines the approach of using machine learning in setting parameters used in traditional statistical models. The author is nice enough to release his source code, which can be a great starting point. So I forked from there, changed code to remove some bulky dependencies, and sort of re-writing it to better fit my mental model and in the process understand it better.

  • Allocate objects on memory buffer for performance gain

    I wrote about the cost of memory allocation in a recent post. Given a fixed amount of memory needed, reserve a large chunk in one go is cheaper than grabing smaller chunks one at a time. I did not realize that Cpp has the facility to take advantage of that until reading through the code of folly::IOBuf.

  • Echo server with libevent

    Network programming is one area where non-blocking IO can be used to achieve higher performance. A typical server needs to handle a few hundreds to a few thousands connections at a time. With the thread-pool based blocking model, when a new connection is established, a server’s thread serving that connection will trigger kernel system call to read data from socket file descriptor, be blocked until data are available. Thus, to handle say 200 connections concurrently, the sever needs to spawn 200 threads.

  • The underestimated cost of memory allocation

    In languages like Java and C++, memory allocation is explicit and obvious. Programming with arrays in those languages mean thinking about size in advance before allocation. If there is a need for flexible size list, the standard library is also explicit about whether the list is backed by an array or linked list, so that programmers are mindful about operation complexity.

  • Implicit property getter can be harmful

    Property as a programming language’s feature has been around for a while. I first got to use it while developing a multi-tenant cloud-based point of sale application on .NET platform. The idea is to avoid the verbosity of calling getter/setter methods by invoking them behind the scene whenever a field is accessed/assigned.

  • Basic FBThrift example

    Facebook re-opensourced their fork of Apache Thrift some 4 years ago. Yet there is relatively little documentation and independent comparison to see if there is any performance bonus to gain by moving from Apache Thrift to FBThrift. The two are no drop-in replacement, thus, replacing one with another requires effort. This post first looks at setting up and running FBThrift. Complete code example used in this post can be found here.

  • Floating point binary representation in javascript

    Inspired by this post which explains how computer stores floating point number, Here is a bit of javascript code that print out a float32 number in binary format. Firstly, the main idea is, floating point number is represented similar to scientific representation of numbers using E notation.

  • Derivative of loss function in softmax classification

    Though frameworks like Tensorflow, Pytorch has done the heavy lifting of implementing gradient descent, it helps to understand the nuts and bolts of how it works. After all, neural network is pretty much a series of derivative functions. In this blog post, let’s look at getting gradient of the lost function used in multi-class logistic regression.

  • Perspectives in designing ML cost function

    For many different machine learning problems, finding a solution involves these similar steps.

  • How dk.brics.automaton regex library works

    This brics regex library is by far the fastest when comparing with openJDK java.util.regex and Let looks at what lie under the hood. dk.brics.automaton is a Finite automata library with application in Regex. The idea is similar to google re2j, which is to construct a DFA from regex string and matching an input string means advancing from one state to another. (google re2j is surprisingly the slowest in my test case. Which is probably due to my particular regex input, or some bug with Java port. I have not examined it yet)

  • Misconceptions when scale applications

    Once in a while, engineering managers and software engineers would triumphantly tell me to use a particular piece of technology because it is scalable or good for concurrency. Unfortunately, scaling an application means trading off different factors affecting an app’s performance and no single technology comes as silver bullet to solve them all. Below are some of the misconceptions I often hear.

  • How regex is implemented

    Recently at work, I need to take a deeper look at how optimization is done by different regex libraries and how they combine regex patterns. This document is examining 2 regex implementations: java.util.regex of OpenJDK and re2 of Google.

  • How java load classes

    For a while, I have delegated the task of managing java run command to IDE. Eclipse, Netbean and IntelliJ and all seem to do a decent job of masking away complexity of supplying java with JVM parameters, classpath, debugging options… Compared to other languages, the java run command can get horrendously long, not very suitable for handcrafting. Recently, when working on some sort of command generating at work and playing around with java command line arguments, I encountered some minor gotchas.

  • Shortcomings of web technologies in building a robust desktop application.

    As computation power and data are shifting to the edge, web applications are more and more like installed desktop applications with (limited) file system access, threading, offline storages… Frameworks such as Electron, which allow browser-based application to be packaged as standalone app, are gaining popularity. However, with great power comes great responsibility. Javascript, html, css have always been used to build webpages, which is a stateless and forgiving environment. A broken webpage can be remedied by hitting refresh button. The web is not used to be fast, users’ expectation for it is lower than for an installed application.

  • Java byte literal for value greater than 0x80

    In porting a piece of code from C++ to Java, I encountered statement like this:

  • C++ '&' operator

    Some note on C++ references

  • Concurrency in Java context

    Concurrency is an unavoidable fact in web development if the page ever gets pass more than one user (which is pretty much any service out there). But concurrency also poses a problem to data consistency. This post is a back-to-the-basics summary of techniques I’m aware of.

  • Running Odoo 8 on Pypy

    Pypy has not reached the point of compatibility that one could simply switch the interpreter. Some libraries need to be replaced, or installed with latest development code. Still, one can get the Odoo web running with a few changes.

  • Tornado Non-blocking Smtp Client

    Recently, I was developing a tornado-based web application at work. The idea of using Tornado is to base the whole web application on Tornado’s single-threaded IOloop, which enables the application to handle higher load compared to using other multi-threaded model (given the same hardware). Since there is no additional thread spawned to handle concurrent requests, no memory overhead. And besides, that help avoids context-switching cost when the program control is passed between threads.

  • Fibonacci Tail Recursion

    (Documenting my progress with Haskell. little by little)