720p Video from an FPGA

One of my longer-time side projects is designing and building a homebrew computer of my own design. This is definitely a "labor of love" sort of project, as opposed to a project I'm likely to actually complete, but designing and redesigning parts of it continues to keep me amused for now.

This week I started to revamp the video output portion. Previously I'd prototyped a eight-color, 640x480 text-only VGA output using an FPGA and some resistors as a DAC. That was a fun learning exercise, but I'd always intended to have graphical output of some sort and so that result didn't really satisfy me.

Furthermore, I've learned more about Verilog and digital logic design in the year or so since I was last working on this, so I took this as an opportunity for a do-over. My goal for this first round is to generate a 720p test pattern that my monitors will render, delivered over HDMI because VGA is annoyingly analog and DVI is becoming hard to find on newer monitors.

How Video Output Works

When I was first learning how to produce video output from an FPGA, I was amused (though in retrospect not surprised) to find that the fundamental ideas behind video signalling go all the way back to analog television, and are designed around the mechanics of cathode ray tube televisions.

Before the invention of other display technologies such as plasma, TFT LCD, OLED, etc, cathode ray tubes were the video display technology of choice for several decades. A cathode ray tube display produces a video image by firing an electron beam at a glass screen coated in phosphorescent material, using electromagnets to deflect the electron beam at different parts of the screen.

For full-screen video images, the electron beam is deflected in a predictable way called a raster scan: it starts at the top-left of the screen, and then is moved quickly from left to right while changing the intensity. Once it reaches the right edge of the screen, the electron beam is deflected even faster back to the left side and just below where it initially started, to scan across the second line. This continues until the entire screen has been scanned with horizontal lines from top to bottom, at which point the beam is moved quickly back to the top left to begin the next frame.

Because the screen is phospherescent, the activated parts of the screen remain luminated for a short time after activation, and persistence of vision causes the viewer's eye to percieve each frame as a complete image even though it is being constructed gradually from horizontal sweeps across the screen.

During the time when the electron beam is moving horizontally along a line, the electron beam is modulated to different intensities to produce different brightness levels. For a color television, there are separate electron beams for red, green, and blue and the screen is coated with three different phosphors that appear as those three colors respectively. The same principle applies, but each of the three beams is modulated separately to produce different mixtures of those primary colors.

It also takes some time for the electron beams to move from right to left to begin a new row and to move from bottom to top to begin a new frame. To avoid creating smears across the screen during these periods, the electron beams are disabled. The times when the beam is disabled to move quickly to a new location are called blanking periods.

The blanking periods themselves have three sub-periods. The main central sub-period is called the sync pulse, which is marked by the transmission of a synchronization signal that tells the display either that a row has ended or an entire frame has ended. Before the sync pulse is the (oddly-named) front porch sub-period, which is included primarily to allow older analog televisions time to reach a stable voltage after the picture signals stop for blanking. After the sync pulse and before the next active display period is the (equally-oddly-named) back porch sub-period, which a older analog video systems used to re-calibrate the voltage level for black before starting the next line of video.

Video signals, whether transmitted over cables or over radio waves, followed along closely with the movement of the electron beams. The video source produced the active display signal at the same time the electron beam was scanning across a horizontal line, and produced the sync signals at the appropriate times to force the electron beam to deflect back to the start of the line or frame.

Analog television standards like NTSC and PAL call for specific lengths of time for transmitting the active display portion of a line, the blanking periods and their sub-periods, and for many lines make up a single frame (or "field"). Modern computer video transmission techniques build on these same principles: the lengths of time spent in each of the periods and the specific encoding of the data has changed, but modern video signal standards all still follow the same basic structure as analog television displayed on a cathode ray tube.

Analog video is defined in terms of the amount of time spent in each period. Computer systems and digital televisions use digital video, which introduces the idea of discrete pixels and the pixel clock. ("Clock" here is the digital electronics sense of the term: a signal that switches from high to low at some consistent frequency, with one pixel associated with each clock period.)

Whereas in theory analog video has infinite horizontal resolution (as long as you have enough signalling bandwidth and you can modulate your electron beam fast enough), digital video standards are defined in terms of how quickly a single pixel is transmitted, and then in turn how many pixels make up one line and how many lines make up one frame.

Conceptually then, we can think of the video signal as being a rectangular array of pixels, where most of the pixels are part of the active display, but some of the pixels are in blanking periods and thus considered to be fully black.

A video mode is commonly expressed as the width and height of the visible pixel array, but common video mode also has a specific number of pixels or lines assigned to each of its blanking periods, which modern displays will use to recognize which video mode is intended by the transmitter. A device generating video must ensure that the pixel clock frequency and the number of pixels in each period matches a standard mode in order to ensure that the result can be understood by standard displays.

My previous video output experiment produced a 640 by 480 pixel display. That resolution has an effective total size of 800 by 525 pixels if we include the blank pixels from its blanking periods. The full timing specification for a 640 by 480 pixel video signal is:

Pixel clock at 25.175 MHz. (approx. 39.7 nanoseconds per pixel)
640 active pixels, 16 front porch pixels, 96 sync pixels, and 48 back porch pixels per line.
480 active lines, 10 front porch lines, two sync lines, and 33 back porch lines per frame.

640 by 480 is a lowest-common-denominator mode that dates back to the release of VGA graphics in 1987 and is well supported by most displays. Later graphics cards were able to produce video signals with more visible signals per frame, such as 800 by 600 visible pixels with the following timing specification:

Pixel clock at 40.000 MHz. (approx. 25.0 ns nanoseconds per pixel)
800 active pixels, 40 front porch pixels, 128 sync pixels, and 88 back porch pixels per line.
600 active lines, one front porch line, four sync lines, and 23 back porch lines per frame.

There is an upper limit to how long we can spend transmitting a single frame before the display would appear to flicker by defeating the persistence-of-vision effect. In modern displays, 60 frames per second is generally considered to be the standard frame rate. In order to include more pixels in a frame, we must shorten the length of time spent transmitting each pixel.

One significant limiting factor for the pixel density of displays has been how quickly the technology of the day can reliably transmit pixel data. 1080p high definition video has a pixel clock at 148.5 MHz, which is only about 6.7 nanoseconds per pixel. A 3840 by 2160 "4K" image needs a pixel clock of 533.250 MHz, giving only 1.8 nanoseconds per pixel.

High-frequency signalling is challenging to get right, particularly with technology and manufacturing techniques available to me for hobby use. For that reason, I decided to compromise by choosing 720p, which still looks reasonably nice on a modern widescreen HD display but only requires a 74.25 MHz pixel clock. The full timing specifications for 720p video are:

Pixel clock at 74.25 MHz. (approx 13.5 nanoseconds per pixel)
1280 active pixels, 110 front porch pixels, 40 sync pixels, and 220 back porch pixels per line.
720 active lines, five front porch lines, five sync lines, and 370 back porch lines per frame.

Display Interfaces

There are many different standards for transmitting video data between a video source (such as a computer) and a display. The main ones available on equipment available for purchase at the time I write this are:

VGA: A hybrid analog and digital signalling format, where the active picture data per line is transmitted as analog while the horizontal and vertical sync are sent separately as digital signals.
Because VGA picture data is analog, in principle it has infinite horizontal resolution. However, in practice the voltage changes produced by the source will be derived from some sort of digital signal, and modern displays have a fixed number of descrete pixels anyway, so VGA in modern use ends up just being a lossy way to transmit a digital signal.
VGA is also becoming harder to find on modern equipment, and so any system built around VGA output is likely to require external conversion circuitry in future once VGA displays are no longer available.
DVI: The first well-deployed video interface standard to support all-digital signalling. Picture data and the sync signals are carried over three TDMS pairs, with ostensibly one pair per color channel but with these also overloaded to carry the sync signals when needed.
A DVI cable can also carry an analog VGA signal if needed, and so in systems where a standard VGA connector is unavailable it may be possible to still connect a VGA source using a simple physical adapter, but DVI itself is becoming harder to find on newer display equipment.
HDMI: The current standard for interfaces between devices in home theater setups. HDMI was introduced with the same digital signalling protocol as DVI but extended to carry both audio and video at once. As a result, common valid DVI signals are also electrically compatible with HDMI, and so the physical HDMI connector is increasingly replacing the DVI connector on modern displays.
HDMI is the prevailing connector standard on television equipment currently, and it's commonly available on computer displays too.
DisplayPort: A modern standard intended to replace all of the above for computer display use-cases. A DisplayPort cable can in principle carry audio, video, USB, and other forms of data.
The DisplayPort standard includes a protocol for a display source to detect that it is connected to an HDMI display and switch automatically to HDMI signalling, but the converse is not true: connecting an HDMI source to a DisplayPort display requires an active adapter, which is more expensive.

For my purposes — producing only 720p video — VGA would be more than sufficient if it weren't for its obscolescence. DisplayPort is certainly overkill, so I settled on DVI output over an HDMI connector.

DVI and HDMI transmit digital color data over the three signalling pairs in a serial fashion, sending ten data bits per pixel clock period. That means that the data clock is effectively ten times the pixel clock, or 742.5MHz for my desired 720p video signal.

That frequency is outside the capabilities of the low-cost Lattice ICE40 FPGAs I've been using for these experiments, and indeed that's why I'd elected to produce only VGA output in my first iteration. Fortunately, I've since learned about a specialzed chip model TFP410, which accepts a parallel RGB video signal as input and produces an equivalent DVI] signal as its output.

Parallel signalling is not one of the standards I explored above, because it's not a practical signalling standard for cables between equipment. However, it's commonly used internally within devices due to its relative simplicity: each of the red, green, and blue data values is transmitted over parallel data lines. For 8-bits-per-channel video, that's eight separate digital traces on the printed circuit board per channel for a total of 24 traces just to send the color data, and then four more traces for the pixel clock, active display signal, horizontal sync signal, and vertical sync signal.

This is a space vs. time tradeoff: by routing 28 separate signals we can transmit the same data in less time, or in the same amount of time at a lower frequency.

As a further compromise — keeping in mind that eventually the video picture will need to be stored in RAM and my homebrew computer system is unlikely to have lots of it — I decided to use fewer than eight bits per channel, accepting a reduction in color depth.

12-bit Color

In modern systems, each display pixel is generally represented by 24 bits of color data, giving eight bits for red, eight bits for green, and eight bits for blue. That gives 256 distinct shades of each channel, for over 16 million distinct color combinations.

My homebrew computer design has a 16-bit data bus, so I've generally been leaning toward design decisions that allow working with only 16-bit numbers. With that in mind, I decided to adopt a color model with only four bits per channel, for a total of 12 bits per pixel.

That gives 16 shades of each channel, or 4096 distinct colors. Four bits is one hexadecimal digit, so this color system is effectively what you get when you use the shorthand HTML color syntax #f00 instead of the longer #ff0000 form.

This limited color space would not be acceptable in any modern system, but for my hobbiest purposes it's more than sufficent. I've been aiming my design specifications at "approximately Amiga 500" levels, and the Amiga OCS chipset was itself capable of displaying the same 4096 distinct colors, albeit using a more complex compression technique to represent the image in only 5 bits per pixel compared to my design's direct 12-bit-per-pixel encoding.

For my initial work here, I'm focusing primarily on the video signalling and not including any video memory to store pictures, so the main advantage of 12-bit color at this stage is that I only need four wires per channel between the FPGA and the TFP410 DVI encoder chip. Along with the other control signals, that's a total of 16 distinct digital signals to connect between the two chips.

Bringing it all together

With all of the above design decisions in mind, I sketched out the high-level system I was aiming to build:

An external 12MHz oscillator connects to the ICE40 FPGA I'm using for this prototype. That FPGA has an on-board PLL block that can multiply that clock to approximately 74.25 MHz to serve as the pixel clock for 720p output.

The main custom logic in the FPGA is split into two parts: the timing generator and the pixel generator.

The timing generator is responsible for reacting to the pixel clock and categorizing each pixel as either visible picture, front porch, sync, or back porch. It produces a signal "active" which is set high when the current pixel is in the visible picture. It also produces signals "hsync" and "vsync" which are high when the current pixel is in either the horizontal or vertical sync period respectively. The front porch and back porch pixels are represented by all three of those signals being low.

In order to correctly categorize the pixels, the timing generator must count how many pixels it has seen horizontally and how many lines it has seen vertically. When the "active" signal is high, it also outputs those horizontal and vertical counters as the "x" and "y" signals, which allow the picture generator to know which pixel within the active pixel region is current.

The Verilog source code for the timing generator is as follows:

module video_timing
(
    input wire reset,
    input wire clk,

    output reg [15:0] x,
    output reg [15:0] y,
    output reg hsync,
    output reg vsync,
    output reg active
);

    // State constants for our two timing state machines (one horizontal, one vertical)
    `define VIDEO_SYNC       2'd0
    `define VIDEO_BACKPORCH  2'd1
    `define VIDEO_ACTIVE     2'd2
    `define VIDEO_FRONTPORCH 2'd3

    // These settings are for 720p, assuming clk is running at 74.25 MHz
    `define VIDEO_H_SYNC_PIXELS   16'd40
    `define VIDEO_H_BP_PIXELS     16'd220
    `define VIDEO_H_ACTIVE_PIXELS 16'd1280
    `define VIDEO_H_FP_PIXELS     16'd110
    `define VIDEO_H_SYNC_ACTIVE   1'b1
    `define VIDEO_V_SYNC_LINES    16'd5
    `define VIDEO_V_BP_LINES      16'd20
    `define VIDEO_V_ACTIVE_LINES  16'd720
    `define VIDEO_V_FP_LINES      16'd5
    `define VIDEO_V_SYNC_ACTIVE   1'b1

    reg [1:0] state_h;
    reg [15:0] count_h; // 1-based so we will stop when count_h is the total pixels for the current state
    reg inc_v = 1'b0;
    reg [1:0] state_v;
    reg [15:0] count_v; // 1-based so we will stop when count_v is the total lines for the current state

    // Change outputs on clock.
    // (These update one clock step behind everything else below, but that's
    //  okay because the lengths of all the periods are still correct.)
    always @(posedge clk) begin
        if (reset == 1'b1) begin
            hsync  <= ~`VIDEO_H_SYNC_ACTIVE;
            vsync  <= ~`VIDEO_V_SYNC_ACTIVE;
            active <= 1'b0;
            x      <= 16'd0;
            y      <= 16'd0;
        end else begin
            hsync  <= (state_h == `VIDEO_SYNC) ^ (~`VIDEO_H_SYNC_ACTIVE);
            vsync  <= (state_v == `VIDEO_SYNC) ^ (~`VIDEO_V_SYNC_ACTIVE);
            active <= (state_h == `VIDEO_ACTIVE) && (state_v == `VIDEO_ACTIVE);
            x      <= count_h - 1;
            y      <= count_v - 1;
         end
    end

    // Horizontal state machine
    always @(posedge clk) begin
        if (reset == 1'b1) begin
            count_h <= 16'b1;
            state_h <= `VIDEO_FRONTPORCH;
        end else begin
            inc_v <= 0;
            count_h <= count_h + 16'd1;

            case (state_h)
                `VIDEO_SYNC: begin
                    if (count_h == `VIDEO_H_SYNC_PIXELS) begin
                        state_h <= `VIDEO_BACKPORCH;
                        count_h <= 16'b1;
                    end
                end
                `VIDEO_BACKPORCH: begin
                    if (count_h == `VIDEO_H_BP_PIXELS) begin
                        state_h <= `VIDEO_ACTIVE;
                        count_h <= 16'b1;
                    end
                end
                `VIDEO_ACTIVE: begin
                    if (count_h == `VIDEO_H_ACTIVE_PIXELS) begin
                        state_h <= `VIDEO_FRONTPORCH;
                        count_h <= 16'b1;
                    end
                end
                `VIDEO_FRONTPORCH: begin
                    if (count_h == `VIDEO_H_FP_PIXELS) begin
                        state_h <= `VIDEO_SYNC;
                        count_h <= 16'b1;
                        inc_v <= 1;
                    end
                end
            endcase
        end
    end

    // Vertical state machine
    always @(posedge clk) begin
        if (reset == 1'b1) begin
            count_v <= 16'b1;
            state_v <= `VIDEO_FRONTPORCH;
        end else begin
            if (inc_v) begin
                count_v <= count_v + 16'd1;
                case (state_v)
                    `VIDEO_SYNC: begin
                        if (count_v == `VIDEO_V_SYNC_LINES) begin
                            state_v <= `VIDEO_BACKPORCH;
                            count_v <= 16'b1;
                        end
                    end
                    `VIDEO_BACKPORCH: begin
                        if (count_v == `VIDEO_V_BP_LINES) begin
                            state_v <= `VIDEO_ACTIVE;
                            count_v <= 16'b1;
                        end
                    end
                    `VIDEO_ACTIVE: begin
                        if (count_v == `VIDEO_V_ACTIVE_LINES) begin
                            state_v <= `VIDEO_FRONTPORCH;
                            count_v <= 16'b1;
                        end
                    end
                    `VIDEO_FRONTPORCH: begin
                        if (count_v == `VIDEO_V_FP_LINES) begin
                            state_v <= `VIDEO_SYNC;
                            count_v <= 16'b1;
                        end
                    end
                endcase
            end
        end
    end

endmodule

The timing generator is mainly just a pair of finite state machines, where the pixel and line counters trigger advancement from one state to the next. The horizontal state machine counts pixels, and also signals when it reaches the end of a line. The vertical state machine then counts lines.

Each time the pixel clock goes high, we update our outputs based on the current state of each of the state machines. The statement always @(posedge clk) means "on every positive edge of the clk signal", or "whenever clk goes high". The <= statements within each of the blocks describe how to update the values of the registers (declared as reg) in response to that event, considering the current values of the input signals.

The role of the picture generator module is to determine the color of each of the pixels in the active display region. For each pixel clock, it receives the active signal to tell it whether it should produce a color at all, and 16-bit integer values for the x and y coordinates of the current pixel.

This is separated from the timing generator because while the timing signals are fixed by the particular display mode we're using (720p in this case), there are many different ways we could represent the visible contents of the screen.

The most common solution in common systems is to store a framebuffer in a RAM chip, where each pixel in the active pixel region is represented by some value stored in RAM and the picture generator simply reads that value and copies it to its color outputs.

In older systems (from the 8-bit and 16-bit eras) there often wasn't sufficient memory to store a distinct value for each pixel on the screen, so the screen would instead be composed of fixed-size tiles selected from a limited collection, with the same tile image data potentially appearing several times on the screen, and the colors would be selected from a limited palette of colors in order to reduce the number of bits required per pixel. The pixel generator in that case would need to deal with all of this indirection in order to figure out which tile covers the current pixel, which color index from that tile is present for the current pixel, and which real color from the palette that color index corresponds to.

Since my goal for this initial exercise was just to generate the 720p timing signal, and since I don't yet have any external RAM connected to this FPGA anyway, my first iteration of picture generator is just a test pattern produced mathematically from the pixel coordinates:

module video_test_pattern
(
    input wire clk,
    input wire [15:0] x,
    input wire [15:0] y,
    input wire active,

    output reg [3:0] r,
    output reg [3:0] g,
    output reg [3:0] b
);

    always @(posedge clk) begin
        r <= 4'b0000;
        g <= 4'b0000;
        b <= 4'b0000;

        if (active) begin
            if (y < 100) begin
                r <= x[5:2];
            end else if (y < 200) begin
                g <= x[5:2];
            end else if (y < 300) begin
                b <= x[5:2];
            end else if (y < 400) begin
                r <= x[5:2];
                g <= x[5:2];
            end else if (y < 500) begin
                r <= x[5:2];
                b <= x[5:2];
            end else if (y < 600) begin
                g <= x[5:2];
                b <= x[5:2];
            end else if (y < 700) begin
                r <= x[5:2];
                g <= x[5:2];
                b <= x[5:2];
            end else begin
                if (x[7:0] < 64) begin
                    r <= x[3:0];
                end else if (x[7:0] < 128) begin
                    g <= x[3:0];
                end else if (x[7:0] < 192) begin
                    b <= x[3:0];
                end else begin
                    r <= x[3:0];
                    g <= x[3:0];
                    b <= x[3:0];
                end
            end
        end
    end

endmodule

In this case, the color of each pixel is selected based on combinatorial logic with the x and y coordinates. This is not a useful output for a computer system, but is sufficient to show that the timing is working correctly, that the 12-bit color data is being produced as expected, and that the DVI encoder chip is successfully transcoding all of this data.

After connecting these two Verilog modules together and mapping the appropriate signals to the output pins connected to the TFP410 chip, the small monitor I use for testing on my workbench showed the pattern as expected:

Although my camera and subsequent image scaling created some unfortunate moiré patterns, the display itself shows crisp edges and gradients.

With only 16 shades of each color channel it's impossible to create long, smooth gradients without stepping or dithering, but for the simple, flat geometric patterns I'm ultimately intending to show (old-school bevelled buttons, text boxes, window decorations, etc) this should be more than sufficient.

That's all I have for now! When I next have time to work on this I'm hoping to implement a single hardware sprite to use as a mouse cursor (because the CPU I'm planning to use will not have a fast enough memory bus to do that well in software) and then connect some static RAM to produce a framebuffer. My current development board doesn't have any RAM chips, so I'll have some PCB design and manufacturing to do before I can dive into that.