View the site in Français View the site in English (USA) Site displayed in English (GB)
You are here: ac6 > ac6-formation > ARM cores > NEON programming
Download Catalog
Download Catalog
Download as PDF
Download as PDF
Write us
Write us
Printable version
Printable version
 

RC1 NEON programming

This course explains how to use NEON SIMD instructions to boost multimedia algorithms


formateur
Objectives
bullet_jaune_1 This course has been designed for programmers wanting to run multimedia algorithms on NEON Single Instruction Multiple Data execute units.
bullet_jaune_1 Each instruction family is detailed, first at assembly level, and then at C level using macros developed present in arm_neon.h file.
bullet_jaune_1 Several tricky usage of processing instructions are provided.
bullet_jaune_1 Vector and vector element load / store instructions are studied and guidelines for organizing data in memory are provided to minimize the number of memory accesses.
bullet_jaune_1 The underlying cache operation as well as preload mechanisms (instruction and hardware prefetch) are detailed to explain how a processing can be pipelined .
bullet_jaune_1 The course shows how DSP typical algorithms such as FIR and FFT can be vectorized and then optimized to be executed on NEON unit.
bullet_jaune_1 This course has already been delivered to companies developing audio applications for mobile phones.

bullet_jaune_1 THIS COURSE IS PROPOSED EITHER AS AN INSTRUCTOR-LED COURSE OR AS E-LEARNING.

bullet_jaune_1 ACSYS has developed an optimized NEON based FFT coded in assembler language
bullet_jaune_2 performance for 1024 complex floating point single precision samples is:
bullet_jaune_2 - 106_000 core clock cycles for Cortex-A9
bullet_jaune_2 - 90_000 core clock cycles for Cortex-A8
bullet_jaune_2 for any information contact guillaume.peron@ac6.fr
Labs are run under RVDS

A more detailed course description is available on request at info@ac6-training.com
Prerequisites
bullet_jaune_2 Knowledge of ARM V4T / V5TE instruction set.

Outline
DAY 1
CORTEX-A8 AND CORTEX-A9(MP) ARCHITECTURE
bullet_jaune_2 Data path, studying how data are loaded from external memory and copied into level 1 and possibly level 2 caches
bullet_jaune_2 Programmer’s model
bullet_jaune_2 Highlighting coherency issues when data are shared by several cores, purpose of the SCU implemented in Cortex-A9
bullet_jaune_2 Cortex-A8 and Cortex-A9 instruction pipeline, branch predictors
INTRODUCTION TO NEON/VFPv3
bullet_jaune_2 Clarifying the resources shared by NEON and VFP
bullet_jaune_2 Register bank, Q registers, D registers
bullet_jaune_2 Data types
bullet_jaune_2 Vector vs scalar
bullet_jaune_2 Related system registers
bullet_jaune_2 Alignment issues
bullet_jaune_2 Enabling NEON/VFP
NEON INSTRUCTION SYNTAX
bullet_jaune_2 Instructions producing wider / narrower results
bullet_jaune_2 Instructions modifiers
bullet_jaune_2 Selecting the shape
bullet_jaune_2 Selecting the operand / result type
bullet_jaune_2 Syntax flexibility
bullet_jaune_2 Declaring initialized vectors in C language
bullet_jaune_2 Using unions with vectors and arrays of vectors to simplify the debug
bullet_jaune_2 Casting vectors
LOAD / STORE INSTRUCTIONS
bullet_jaune_2 Addressing modes
bullet_jaune_2 Vector load / store
bullet_jaune_2 Vector load / store multiple
bullet_jaune_2 Element and structure load / store instructions
bullet_jaune_3 Multiple single elements
bullet_jaune_3 Single element to 1 lane
bullet_jaune_3 Single elements to all lanes
bullet_jaune_2 Optimizing the ordering of data in memory to take benefit of 2-, 3- and 4- element structures
bullet_jaune_3 Example: managing audio samples
bullet_jaune_2 Processor acceleration mechanisms: store merging buffers
bullet_jaune_3 Practical lab: using load with de-interleaving instructions to store all right lane samples into a vector and left lane samples into another vector
DAY 2
DATA TRANSFER INSTRUCTIONS
bullet_jaune_2 Move
bullet_jaune_2 Swap
bullet_jaune_2 Table lookup
bullet_jaune_2 Vector transpose
bullet_jaune_2 Vector zip / unzip
bullet_jaune_2 Data transfer between NEON and integer unit
bullet_jaune_3 Practical lab: clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors
LOGICAL AND BITFIELD INSTRUCTIONS
bullet_jaune_2 Logical AND, Bit Clear, OR, XOR
bullet_jaune_2 Operations with immediate values
bullet_jaune_2 Bitwise insert instructions, avoiding branches
bullet_jaune_2 Count Leading zeros, ones, signs
bullet_jaune_2 Normalizing floating point numbers when VFP is not implemented
bullet_jaune_2 Scalar duplicate
bullet_jaune_2 Extract
bullet_jaune_2 Shift with possible rounding and saturation
bullet_jaune_2 Bitfield revers
bullet_jaune_3 Practical lab: Transposing a matrix, shifting a large bitmap using vector instructions
ARITHMERICAL INSTRUCTIONS
bullet_jaune_2 Add, modulo vs saturated arithmetic
bullet_jaune_2 Halving / Doubling the result
bullet_jaune_2 Rounding
bullet_jaune_2 Subtract
bullet_jaune_2 Multiply
bullet_jaune_2 Multiply accumulate / Multiply subtract
bullet_jaune_2 Absolute value
bullet_jaune_2 Min / Max
bullet_jaune_2 Converting Floating Point numbers into Fixed point numbers
bullet_jaune_2 Converting Fixed point numbers into Floating point numbers
bullet_jaune_2 Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm
bullet_jaune_2 Pairwise instructions
bullet_jaune_2 Element comparison
bullet_jaune_3 Practical lab: implementing a complex multiply accumulate with NEON
bullet_jaune_3 Practical lab: converting fixed-point elements into single precision floating point values and adding the resulting elements
NEON CODING EXAMPLES
bullet_jaune_2 FIR filter
bullet_jaune_3 Converting the scalar algorithm into a vector algorithm
bullet_jaune_3 Finding the NEON instructions to encode the vector algorithm
bullet_jaune_3 Optimizing the code
bullet_jaune_3 Using the performance monitor to tune the algorithm
bullet_jaune_2 FFT (DFT)
bullet_jaune_3 Converting the scalar algorithm into a vector algorithm, understanding how circle properties can be used to process 4 angles concurrently
bullet_jaune_3 Finding the NEON instructions to encode the vector algorithm
bullet_jaune_3 Optimizing the code
bullet_jaune_3 Using the performance monitor to tune the algorithm