Site affiché en Français Voir le site en English (USA) Voir le site en English (GB)
Vous êtes ici: ac6 > ac6-formation > ARM cores > NEON programming

RC1 NEON programming

This course explains how to use NEON SIMD instructions to boost multimedia algorithms

Objectives
bullet_jaune_1 This course has been designed for programmers wanting to run multimedia algorithms on NEON Single Instruction Multiple Data execute units.
bullet_jaune_1 Each instruction family is detailed, first at assembly level, and then at C level using macros developed present in arm_neon.h file.
bullet_jaune_1 Several tricky usage of processing instructions are provided.
bullet_jaune_1 Vector and vector element load / store instructions are studied and guidelines for organizing data in memory are provided to minimize the number of memory accesses.
bullet_jaune_1 The underlying cache operation as well as preload mechanisms (instruction and hardware prefetch) are detailed to explain how a processing can be pipelined .
bullet_jaune_1 The course shows how DSP typical algorithms such as FIR and FFT can be vectorized and then optimized to be executed on NEON unit.
bullet_jaune_1 This course has already been delivered to companies developing audio applications for mobile phones.

bullet_jaune_1 THIS COURSE IS PROPOSED EITHER AS AN INSTRUCTOR-LED COURSE OR AS E-LEARNING.

bullet_jaune_1 ACSYS has developed an optimized NEON based FFT coded in assembler language
bullet_jaune_2 performance for 1024 complex floating point single precision samples is:
bullet_jaune_2 - 106_000 core clock cycles for Cortex-A9
bullet_jaune_2 - 90_000 core clock cycles for Cortex-A8
bullet_jaune_2 for any information contact guillaume.peron@ac6.fr
Labs are run under RVDS

A more detailed course description is available on request at info@ac6-training.com
Prerequisites
bullet_jaune_2 Knowledge of ARM V4T / V5TE instruction set.

Plan
DAY 1
CORTEX-A8 AND CORTEX-A9(MP) ARCHITECTURE
bullet_jaune_2 Data path, studying how data are loaded from external memory and copied into level 1 and possibly level 2 caches
bullet_jaune_2 Programmer’s model
bullet_jaune_2 Highlighting coherency issues when data are shared by several cores, purpose of the SCU implemented in Cortex-A9
bullet_jaune_2 Cortex-A8 and Cortex-A9 instruction pipeline, branch predictors
INTRODUCTION TO NEON/VFPv3
bullet_jaune_2 Clarifying the resources shared by NEON and VFP
bullet_jaune_2 Register bank, Q registers, D registers
bullet_jaune_2 Data types
bullet_jaune_2 Vector vs scalar
bullet_jaune_2 Related system registers
bullet_jaune_2 Alignment issues
bullet_jaune_2 Enabling NEON/VFP
NEON INSTRUCTION SYNTAX
bullet_jaune_2 Instructions producing wider / narrower results
bullet_jaune_2 Instructions modifiers
bullet_jaune_2 Selecting the shape
bullet_jaune_2 Selecting the operand / result type
bullet_jaune_2 Syntax flexibility
bullet_jaune_2 Declaring initialized vectors in C language
bullet_jaune_2 Using unions with vectors and arrays of vectors to simplify the debug
bullet_jaune_2 Casting vectors
LOAD / STORE INSTRUCTIONS
bullet_jaune_2 Addressing modes
bullet_jaune_2 Vector load / store
bullet_jaune_2 Vector load / store multiple
bullet_jaune_2 Element and structure load / store instructions
bullet_jaune_3 Multiple single elements
bullet_jaune_3 Single element to 1 lane
bullet_jaune_3 Single elements to all lanes
bullet_jaune_2 Optimizing the ordering of data in memory to take benefit of 2-, 3- and 4- element structures
bullet_jaune_3 Example: managing audio samples
bullet_jaune_2 Processor acceleration mechanisms: store merging buffers
bullet_jaune_3 Practical lab: using load with de-interleaving instructions to store all right lane samples into a vector and left lane samples into another vector
DAY 2
DATA TRANSFER INSTRUCTIONS
bullet_jaune_2 Move
bullet_jaune_2 Swap
bullet_jaune_2 Table lookup
bullet_jaune_2 Vector transpose
bullet_jaune_2 Vector zip / unzip
bullet_jaune_2 Data transfer between NEON and integer unit
bullet_jaune_3 Practical lab: clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors
LOGICAL AND BITFIELD INSTRUCTIONS
bullet_jaune_2 Logical AND, Bit Clear, OR, XOR
bullet_jaune_2 Operations with immediate values
bullet_jaune_2 Bitwise insert instructions, avoiding branches
bullet_jaune_2 Count Leading zeros, ones, signs
bullet_jaune_2 Normalizing floating point numbers when VFP is not implemented
bullet_jaune_2 Scalar duplicate
bullet_jaune_2 Extract
bullet_jaune_2 Shift with possible rounding and saturation
bullet_jaune_2 Bitfield revers
bullet_jaune_3 Practical lab: Transposing a matrix, shifting a large bitmap using vector instructions
ARITHMERICAL INSTRUCTIONS
bullet_jaune_2 Add, modulo vs saturated arithmetic
bullet_jaune_2 Halving / Doubling the result
bullet_jaune_2 Rounding
bullet_jaune_2 Subtract
bullet_jaune_2 Multiply
bullet_jaune_2 Multiply accumulate / Multiply subtract
bullet_jaune_2 Absolute value
bullet_jaune_2 Min / Max
bullet_jaune_2 Converting Floating Point numbers into Fixed point numbers
bullet_jaune_2 Converting Fixed point numbers into Floating point numbers
bullet_jaune_2 Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm
bullet_jaune_2 Pairwise instructions
bullet_jaune_2 Element comparison
bullet_jaune_3 Practical lab: implementing a complex multiply accumulate with NEON
bullet_jaune_3 Practical lab: converting fixed-point elements into single precision floating point values and adding the resulting elements
NEON CODING EXAMPLES
bullet_jaune_2 FIR filter
bullet_jaune_3 Converting the scalar algorithm into a vector algorithm
bullet_jaune_3 Finding the NEON instructions to encode the vector algorithm
bullet_jaune_3 Optimizing the code
bullet_jaune_3 Using the performance monitor to tune the algorithm
bullet_jaune_2 FFT (DFT)
bullet_jaune_3 Converting the scalar algorithm into a vector algorithm, understanding how circle properties can be used to process 4 angles concurrently
bullet_jaune_3 Finding the NEON instructions to encode the vector algorithm
bullet_jaune_3 Optimizing the code
bullet_jaune_3 Using the performance monitor to tune the algorithm