Joie is a mobile app that provides something everyone wants: happiness! By promoting an ethos of mindfulness, Joie allows users to see, over time, the places, people, and times of day that make them happiest.

Building the Perfect iOS Camera

Building the Perfect iOS Camera

At the heart of Joie is the experience of capturing a moment through a camera lens on your phone. We knew immediately how important this experience will be to the success or failure of our app. In this post I wanted to provide you with some insight into what it took to create an iOS Camera; it was much harder than expected :)

Before I can dive into the details, which will cover everything from AV Foundation to Spacial Sampling, I should begin by describing our technology stack. It takes a lot of moving parts to build an application like Joie. Multiple phone based operating systems, a robust backend, and everything in between. I wanted to insure that we can not only scale but can create as much code reuse as possible as we grow. I decided to go all in with C# and .Net. We were accepted into the BizSpark program by Microsoft that allows us to leverage many of their cloud services. The idea being, once we reach growing pains we simply tweak a slider and keep marching ahead. For the front end, I decided to take on the challenge of learning Xamarin. Xamarin, has a long history that began as Mono, the OG of cross platform development. It later branched off and became a paid for product. Gaining traction over the years and proving itself a mature platform that can keep up with constant API changes from both Google and Apple while remaining performant. Xamarin would allow myself and the team to develop through a single unified programming language (C#) while fully leveraging native features. As of this year (2016), Xamarin is now free to use and has been bought by Microsoft allowing for a much tighter integration.

My biggest challenge with this project was to be able to ramp up quickly enough in both Objective C, and Xamarin. There was immediate pay off but things quickly stalled. None of the Xamarin Camera components seemed to do a good enough job or allowed for enough customization of the views and the photo/video capturing experience that would be necessary for Joie. Over the iterations I faced everything from black screens - do to hijacked OpenGL contexts, rotation inconsistencies between orientations, and one of my biggest challenges - a poor auto focus that was clearly sub-par to other camera applications like Instagram and Snapchat. After three iterations it became clear to me this was going to be a marathon and not a race. We sacrificed on the original (and overly aggressive schedule) and hit the text books.

Understanding the Camera

Before we can dive into iOS Camera APIs it's important to understand how pictures are taken. How, through manipulation of some light, we're able to preserve a still frame.

Let's begin at the lens. Just a decade ago, and some still today take photographs using traditional film. The film would be fed into a physical camera very carefully, not to expose it to light. Exposing the film to light through a lens would forever ingrain the light onto the film. The film used was categorized by how sensitive it was to light ranging in numbers of 100, 200, 400, 800, and so on; these numbers are more commonly known as ISO values. The higher the number, the more sensitive the film would be to light, meaning that it would more quickly absorb light and eventually become a white page. The opposite was true for lower numbers which were less sensitive to light and better suited for taking pictures where there was an over abundance of light such as a snowy mountain on a sunny day.

Shutter speed is another variable involved in photography; it determines how long to keep a shutter open, allowing for light to hit the frame. The slower the shutter speed, the more light you let in. Shutter speed also always photographers to capture light as it moves through world, resulting in futuristic looking "streaks" or "beams."

As you can now guess, it takes a lot of information about the real world to know how to best adjust these settings for taking a picture. Is it light or dark? Is the object still or moving? 

Finally there's the lens' focus. The best way to think about camera focus is to imagine line on the floor from you (or the camera lens) into the distance. A camera lens unlike the pupil in your eye, has to pick one spot along this imaginary line that will be in sharp detail while the rest will gradually be more blurry.

The Story of Sampling

So far we've only discussed photography from the analog perspective. Our camera phones do not take in film and with that comes the challenge of not only deciding how much light to capture, where to focus, but also what kind of light and color to capture. What kind you ask? Well to help understand this better we need to start thinking of color as what it really is. A wave. A very very smooth wave with an infinite number of gradients and our computers aren't very good at storing infinite numbers. We need to provide a precise range. How many shades of blue are we interested in? How many shades of light? Wait ... isn't light just a shade of color? Not exactly, when you see a digital photo you see an end product which is often the blending of both color (material of the object) and light itself. 

Because it's impossible for the digital to represent all of analog without a high cost of storage as you'll see later. We utilize something called Sampling. There are primarily two types of Sampling methods. Temporal sampling which captures variations in signal over time. This is most commonly used in audio. The second type of sampling is called Spatial Sampling which captures luminance (light) and chrominance (color). If our end result is to render a single pixel captured from the real world we need to know how much information we can represent. This is often called color depth; or bit depth. How many bits (a single bit is a 1 or 0) we use to represent a color. For example an 1-bit color depth means that we are only able to capture black and white and thus resulting in monochrome images. Notice that even a 1-bit color depth can still produce gray gradients thanks to the bitwise AND operation we use with the light information we captured independently. A 4-bit color depth (2 * 2 * 2 * 2 = 16) can represent 16 different colors. Today what we regard as true color uses 24-bits for the three RGB colors allowing for a mind dumbing 16,777,216 colors. Still not anywhere near what our eye is able to capture. Deep color which is saved with 48-bits is the highest grade I've seen and it's certainly overkill for Joie.

If we were to stop here a single raw image would look something like this pixels in width x pixels in height x bit-depth = size of a single image. So let's imagine we want to take a 4k still. 4032 x 3024 x 24 = 292,626,432 bits or 36.5 MBs. Now you might say, so what I have gigabytes on my phone. Well that's true but what about video where we need to multiply that number by frames per second and the total number of seconds for the video. Before you know it, we're looking at gigabytes!! (60 fps * 15 second video = 32.8 gb). That's huge! This is where compression comes into play.

Video data is typically encoded using a color model called YCbCr(also known as YUV). Y is essentially the brightness information and UV is the color information. Very smart people discovered that the human eye is very sensitive to light but not as much to color variations. This is where chroma subsampling comes into play. Its a very clever way to compress pixel information by averaging surrounding pixels. It's typical today to use 4 pixel quadrants so we represent chroma subsampling as a ratio J:a:b. J is the number of pixels contained within some reference block (usually 4 as mentioned), a is the number of chrominance pixels that are stored for every J pixels in the first row, and b is the number of additional pixels that are stored for every J pixels in the second row. I don't expect this to be entirely intuitive but the sample graphic that will hopefully help visualize what these ratios might look like.

iPhone Cameras capture only in 4:2:0. This loss of color becomes only problematic when performing chroma keying of color correction for instance in post-production processes which would result in noise artifacts.

The Nightmare that is iOS

iOS lives on the backbone of multiple APIs that are utilized through out the Mac ecosystem. This includes Core Media (classes prefixed with CM), CoreVideo (classes prefixed with CV), the AV Foundation (classes prefixed with AV), and UIKit (classes prefixed with UI). The stack looks something like this.

Because Joie will be interfacing with the hardware at a fairly low level and will need to leverage OpenGL, it becomes a common and sometimes painful task of converting data between seemingly identical formats. For example, let's say we want to display a capture image on screen. We have to create a UIImage. Seems simple, but we also need to create a CGImage which represents the bitmap data, and a CIImage to store for GPU optimizations--or if we want to leverage the iOS facial recognition api. Oh, and along the way we might need to do some conversions to and from sample buffers. I ended up having to spend days, not only understanding the structure, but also dealing with the conversion to and from Xamarin and Objective C's implementations. I also had to do terrible things like creating Handles and IntPtrs in C#.

Out of the woods

Once you understand the overall multimedia stack provided by iOS, it makes the most sense to stay within the AV Foundation layer. Which is broken up into the following categories:

  • Audio Playback and Recording (AVAudioPlayer | AVAudioRecorder).
  • Media Inspection (AVMetadataItem)
  • Video Playback (AVPlayer | AVPlayerItem
  • Media Capture (AVCaptureSession)
  • Media Editing and Media Processing (AVAssetReader | AVAssetWriter)

At the center of Media Capture is the AVCaptureSession. This object is used as a central hub to pipe information from the camera to a potential destination, although it's not quite that straightforward. We can have multiple inputs (think multiple cameras), and we can have multiple outputs (a memory location, a file, an OpenGL context, a data stream ...).

 The cool thing is that we can configure routing on the fly by enabling and reconfiguring the AVCaptureSession. This, of course, is not optional, and results in a lot of boilerplate code with you locking and unlocking the AVCaptureSession's configuration, while checking for supported features and functionality and gracefully handling errors. This is also where you specify and configure your capture quality. At the most basic level we use AVCaptureSessionPresent to optimize for the situation.

Next we configure the AVCaptureDevice which represents an input stream from the physical hardware. This is where we have control over things like focus, exposure, white balance, flash, and a camera torch.

If you're trying to capture both still images and video, you will need two separate AVCaptureDevices. You will also need separate devices for the front and back camera.

The AVCaptureOutput is an abstract class with various implementations such as AVCaptureStillImageOuput and AVCaptureMovieFileOutput. For OpenGL processing, we'll also need AVCaptureVideoDataOutput which is a raw stream.

One of the early approaches that gave me a lot of trouble was using a single output, because constant context switching would result in a black screen or unreliable image quality. I admit this was in large part of my lack of understanding of the platform at the time.

A much better approach is to utilize the AVCaptureVideoPreviewLayer to display what the camera sees. This is achieved by creating a new UIView and overwriting the Layer to an AVCaptureVideoPreviewLayer.

Finally, instead of using a single UIView for the Camera we have three: A CameraView, a PreviewView which utilizes the AVCaptureVideoPreviewLayer, and an OverlayView used to handle touch input events and draw gestures. This is of course a work in progress, but I feel like we're making headway.

What's next

A very big part of what makes the Joie app is the gesture mechanic for taking pictures. Instead of snapping a button you draw a smile or a frown. Thankfully this is a much easier under taking and we have already had some success with this. We are in the process of working through user interface challenges such as; gesture length and timing, and working with video. Finally we'll be looking into creating an OpenGL layer which will trace your finger, leaving behind a temporary impression of paint.

If you have any questions, want to get involved, or are interested in supporting this project. Please join our mailing list or reach out to us directly! Thank you.

Here are some pictures taken with the new Joie camera:

Book Thoughts: Finding Your Element

Book Thoughts: Finding Your Element

Five Foods to Make You Happy

Five Foods to Make You Happy