This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the file projects, or http://www.swox.com/gmp/projects.html for more sizeable projects.
_mpz_realloc with a small (1 limb) size.
mpz_XXX(a,a,a).
mp_exp_t, the precision of
__mp_bases[base].chars_per_bit_exactly is insufficient and
mpf_get_str aborts. Detect and compensate.
mpz_get_si to work properly for MIPS N32 ABI (and other
machines that use long long for storing limbs.)
mpz_*_ui division routines. Perhaps make them return the
real remainder instead? Changes return type to signed long int.
mpf_eq is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub for this
situation; put similar code in mpf_eq.
mpn_divrem cannot throw away the quotient when XSIZE !=
0. Document or fix.
__builtin_constant_p is unavailable? Same problem with MacOS
X.
dump_abort in
mp?/tests/*.c.
mpz_get_si returns 0x80000000 for -0x100000000.
count_leading_zeros if
the bit is set. This is an optimization on all machines, and significant
on machines with slow count_leading_zeros.
count_trailing_zeros is used
on more or less uniformly distributed numbers. For some CPUs
count_trailing_zeros is slow and it's probably worth
handling the frequently occurring 0 to 2 trailing zeros cases specially.
udiv_qrnnd for inverting limbs to
instead use mpn_invert_limb.
umul_ppmm to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB <= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm code in longlong.h: Add partial
products with fewer operations.
mpn_get_str and mpn_set_str running in
the sub O(n^2) range, using some divide-and-conquer approach, preferably
without using division.
mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
mpz_size,
mpz_set_ui, mpz_set_q, mpz_clear,
mpz_init, mpz_get_ui, mpz_scan0,
mpz_scan1, mpz_getlimbn,
mpz_init_set_ui, mpz_perfect_square_p,
mpz_popcount, mpf_size,
mpf_get_prec, mpf_set_prec_raw,
mpf_set_ui, mpf_init, mpf_init2,
mpf_clear, mpf_set_si.
mpn_addmul_1, mpn_submul_1, and
mpn_mul_1 for the 21264. On 21264, they should run at 4, 3,
and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
Torbjörn for his xm.s and xam.s skeleton files.)
mpn_addmul_1, mpn_submul_1, and
mpn_mul_1 for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
chunks.
mpn_add_n and
mpn_sub_n. The current sparc64 code uses MOVcc
instructions, which take about 6 cycles on UltraSPARC. The correct
approach is probably to use conditional branching. That should lead to
loops that run at 4 cycles/limb. (Torbjörn has code that just needs to be
finished.)
mpn_addmul_1,
mpn_submul_1, and mpn_mul_1. Should use
floating-point operations, and split the invariant single-limb multiplier
into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
will become very deep. (Torbjörn has C code that is useful as a starting
point.)
mpn_lshift and mpn_rshift.
Should give 2 cycles/limb. (Torbjörn has code that just needs to be
finished.)
mpn_addmul_1, mpn_submul_1, and
mpn_mul_1. The current development code runs at 11
cycles/limb, which is already very good. But it should be possible to
saturate the cache, which will happen at 7.5 cycles/limb.
umul_ppmm. Important in particular for
mpn_sqr_basecase.
mpn_mul_basecase and mpn_sqr_basecase
for important machines.
mpn_lshift/mpn_rshift.
Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift for shifts by 1. (See Pentium code.)
count_leading_zeros.
udiv_qrnnd. (Ask Torbjörn for the file
test-udiv-preinv.c as a starting point.)
mpn_add_n and mpn_sub_n.
It should just require 3 cycles/limb, but the current code propagates
carry poorly. The trick is to add carry-in later than we do now,
decreasing the number of operations used to generate carry-out from 4 to
to 3.
mpn_lshift.
The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
is unused. (Ask Torbjörn for a copy of the current code.)
mpn_rshift based on new mpn_lshift.
mpn_add_n and mpn_sub_n. Should
run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
mpn_mul_basecase and
mpn_sqr_basecase. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
mpn_mul_basecase and
mpn_sqr_basecase. Same comment applies to this as to the
same functions for Fujitsu VPP.
count_leading_zeros for 64-bit machines:
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
...
mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
mpz_crr (Chinese Remainder Reconstruction).
mpq_set_f for assignment from mpf_t
(cf. mpq_set_d).
mpz_init (and mpq_init) do lazy
allocation. Set ALLOC(var) to 0, and have
mpz_realloc special-handle that case. Update functions that
rely on a single limb (like mpz_set_ui,
mpz_[tfc]div_r_ui, and others).
mpf_out_raw and mpf_inp_raw. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
gmp_errno.
gmp_fprintf, gmp_sprintf, and
gmp_snprintf.
mpq input and output functions.
mpz_jacobi to a full range of inputs.
Kevin Ryde is working on this.
mpz_div and mpz_divmod use rounding
analogous to mpz_mod. Document, and list as an
incompatibility.
malloc
separately for each TMP_ALLOC block, so a redzoning
malloc debugger could be used during development.
unsigned long int is used for
bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.
|
Please send comments about this page to
tege@swox.com. Copyright (C) 1999, 2000 Torbjörn Granlund. |