User talk:Ferrandm

From www.coria-cfd.fr
Jump to: navigation, search

Hypotheses

The hypotheses to simplify the Navier-Stokes equations are :

  • The total energy balance according to Poinsot-Veynante is descreibed below, if  :

The term can be ignored with regard to , which is the heat release due to combustion.

This release can be defined by this equations : , where is the specific heat of reaction, is the standard enthalpy of formation and is the reaction rate.

is the power produced by volume forces on species k.

is the heat source term which is null here since there is no electric spark, laser or radiative flux.

is the mass fraction of the species k.

The tensor is composed of the tensor of constraints and the tensor of pressure :

Thanks to these hypotheses, the total energy balance becomes :

.


With which corresponds to the heat release due to combustion, .


  • The term is not used. Thanks to it we can define other forces like electromagnetical force or weight.
  • is either a fuel or a combustive source term.

if it is a fuel source term or if it is an oxidant source term.

and are respectively the molar mass of fuel and oxidant.

  • can be determined by the Arrhenius law :

  • The heat flows due to mass fraction gradients and the species molecular distribution due to temperature gradients are neglicted (cf : Soret and Dufour).
  • The termal conductivity, is given by :

Where is the Prandtl number :

This number compares the momemtum distribution with the heat distribution.

  • The Lewis number for the species k, , is :

Where is the species diffusion.

We will consider the Lewis number egals to 1 in order to simplify the physics of the flame of pre-mixture.

  • The Schmidt number, , is :

  • From the relation of these last three numbers, we can deduct that :

and

Therefore, the diffusion coefficient can be defined as :

  • We use the Law of Fick which is a simplified diffusion law :

is the flux of particles density.

C is the particles density.

Thus the diffusion coefficients of the various species are characterized by the number of Lewis.

  • We consider that the gas is a perfect, viscous, reagent and diatomic gas. Consequently as the gas is viscous, the compressible effects are not dominating. What involves that the bulk viscosity is useless.
  • The gas is reagent thus the mixture of various species is not isothermal, they must be individually followed. It implies that the calorific capacities depend on the temperature and on the composition.
  • For a diatomic gas, the calorific capacities can be defined as follows :

  • The gas respects the law of perfect gases :

  • The combustion is an irreversible transformation whose the creation of entropy is compensated with an entropy given up by the system to the outside, because of a thermal transfer, thus we can consider that the transformation is isentropic. Consequently, the law of Laplace for the thermodynamics is applicable :

  • According to the power law, the dynamic viscosity depends only on the temperature :

  • The tensor of the constraints by respecting the hypothesis of a Newtonian fluid is :

The Kronecker symbol, , egals 1 if i=j, 0 else.

  • The acoustic Reynolds number is defined such as :

c is the speed of sound : 340 m., L is the characteristic length and is the kinematic viscosity of air : 1.45e-5 m²..

Closure equations

     


Parameters

Makefile

The makefile allows to compile routines and subroutines.

In order to compile, the command "make" will be typped in the terminal.

The default target of this file is "all : :".

Since various computers can be used, the compilation has to be conditional. That is to say that several configurations of compilation are available in function of the cluster considered (Antares, Jade, ...) :

  • FC indicates what is the used compiler. For instance : if the language is fortran90 and MPI, it will be written FC = mpif90.
  • A part of the code is written in C, it is about the file Vtk_bin_write.c. So the compiler for the C language has to be defined thanks to CFLAGS.
  • FFLAGS defines the optimization options for the Fortran compiler. For instance :
FFLAGS = -O3 -ipo -p.

"O3" is an optimization option of level 3. From this level, semantics of the routine can be modified. The optimization will be more aggressive concerning the memory access and the buckles management. The compilation time will be more important than for a lower level and the code size will be also bigger.

"ipo" is an interprocedural optimization option. It authorizes the inlining of present subroutines in seperated source files.

"p" is a profiling option.

The common files of the executable for calculation and the common files of the executable for postprocessing have to be defined.

For example :

all : : Mezo3D df6_pospro
COMMON_OBJS += communs.o
COMMON_OBJS += lectus.o

Then Allegro files are compiled and then an executable is created :

Mezo3D : $(ALLEGRO_OBJS)
$(QUIET_LINK)$(FC) $(FFLAGS) -o $@ $(filter %.o,$ ˆ)

The same thing is done for the postprocessing files :

df6_pospro : $(POSTPRO_OBJS)
$(QUIET_LINK)$(FC) $(FFLAGS) -o $@ $(filter %.o,$ˆ)

An option named clean is available. This one allows to erase files with extensions .o, .mod, .d. The executable files, the solution files and the gmon.out files are also deleted.

It is possible to change this option if necessary .

Executable

Fort.10.sh

Fort.10.sh is a file written in shell language in purpose to ease the creation of initial directories for the submission of new jobs.

As it has been first thought it works with the file launch-run.sh. In launch-run.sh the variables NX, NY, NZ (that define the mesh size per processor) and the variables NPX, NPY, NPZ (that define the number of processors in each direction) are defined to be exported in fort.10.sh. In this way, several calculations can be launched by changing only a few parameters.

Initialization example :

sh launch-run.sh -g

The option -g calls the subroutine generate(). This subroutine creates a directory named by the variable "dir". In this directory, two other directories will be created. A directory "IN" will contain the input files while a directory "OUT" will contain the output files. Then the template of the executable "NAME" is redirected towards a file named fort.10. The file fort.10 will be located in the directory "IN". This step is realized like this :

sh $Name.template > $dir/IN/fort.10.

When the file fort.10 is created, the subroutine copies the executable calculating the case, the executable dealing the results as well as the file fort.11 in the directory "IN". Then the template of the job is redirected towards the file named "job".

Vtk_bin_write.c

This file concerns the distribution of processors and the choice of the cluster.

It is possible to launch calculations either on ATON, or on VARGAS. In that case, it is necessary to put the value 1 for the computer which is going to be used. For instance, if the used computer is ATON, then the value entered will be 1 and so, it will be 0 for the computer VARGAS :

#define ATON 1
#define VARGAS 0

Otherwise, if the used computer is VARGAS, then the value entered will be 1 and it will be 0 for the computer ATON :

#define ATON 0
#define VARGAS 1

Be careful, some options will be possibly modified in function of the cluster.

The distribution of processors can be defined in function of :

  • Coordinates.
  • Vectors : in that case it will depend on the dimension of the problem, NDIM.
  • Scalar : this function will define the id number of the processor.

Job

The job file is used if the calculation of the problem is made on a computer which is not local.

The extension of this file is .ll because it is a LoadLeveler file.

A job for Antares has to be created like this :

  • Working name : locating more easily the calculations.
@ job_name = NAME 
  • Standard exit file :
@ output = \(job_name).o\(jobid)
  • Error exit file :
@ error = \(job_name).e\(jobid)
  • Working type :
@ job_type = MPICH
  • Total number of wanted processus :
@ total_tasks = NP
  • Maximal elapsed time for the job :
@ wall_clock_limit = 24 :00 :00
  • Maximal memory by MPI process : if a process exceeds the specified limit, then the calculation is killed. Conversely, if this parameter is overestimated, the waiting time will be longer. It is preferable to execute calculations with a data limit subordinate or equal in 3gb. The really necessary memory by process has to get closer as much as possible to the wanted memory.
@ data_limit = 550mb
  • Initial directory to send : this directory contains all the data for the calculation.
@ cri_initialdir = PWD/dir/IN
  • Final directory for the results : the computer has to send the results to this directory.
@ cri_finaldir = PWD/dir/OUT
  • User mail : an e-mail will be send in case of overtaking of elapsed time.
@ notify_user = name@adress.fr
  • Buffering working directory :
cd $LOCAL_WORK_DIR

The arborescence of the file must be defined below this line. Generally, it corresponds to the arborescence of the final directory.

The result files are moved in a directory named Spool before being copied in the final directory.

Launch-run

Test case

Weak Scaling

It is important to remind that the results of the test will depend on the case and on the used computer. In order to make a test case of the code on the computer Antares, it is enough to inform these files : job.ll.template and launch-run.sh

Launch-run.sh

Having determined thanks to the strong scaling an optimal slice for the block size, it is possible to obtain the optimal block size with the weak scaling. Thus tests are realized for a block size of 40x40x40, 32x32x32 and 20x20x20 on various combinations of cores. You can choose your own block size.

Distribution of cores :

Total core number Cores number () Cores number () Cores number ()
1 1 1 1
2 1 1 2
1 2 1
2 1 1
4 1 1 4
2 2 1
2 1 2
8 1 1 8
2 2 2
4 2 1
2 4 1
2 1 4
16 2 1 8
1 2 8
32 2 2 8
4 1 8
1 4 8
64 4 2 8
2 4 8
128 8 2 8
2 8 8
256 8 4 8
4 8 8
512 8 8 8

Example of a test line :

NX=32 NY=32 NZ=32 NPX=8 NPY=8 NPZ=8 ./fort.10.sh $@

That means that the case is tested with a block size of on 8x8x8 (512) cores.

Job.ll.template

  • Modify your e-mail address in order to receive information on the calculation progress.
  • Possibly, modify the parameters "wall_clock_limit" and "data_limit" to gain places in the waiting line or to increase the memory attributed by process.

Jobs submission

  • Compile files :
make
  • Create files for every test case :
sh launch-run.sh -g
  • Submit jobs to a cluster :
sh launch-run.sh -s

The terminal asks if the job "XXXXX" is to be submitted. Just enter "y". If an error occurs saying the file is not executable, thinking to make it executable with the command :

chmod 777 "filename"

Data Processing

When the jobs are submitted and when the calculation is realized, the computer sends files results to the final directory which was defined in the template of the job. Run time are in files log.

Efficiency calculation

The objective is to get closer to the theoretical efficiency which is 1. To obtain the efficiency of the studied test case, the following equation has to be used :

  • Tref : Run time on reference cores number. The reference cores number is defined such as :

  • Tn : Run time on n cores.

figure

The efficiency is drawn in function of the cores number. The principle of the weak scaling is to keep a constant load by core, so the total size of the mesh increases. What explains the efficiency does not increase and stabilizes from a certain cores number.

Speed-up calculation

The theoretical speed-up is calculated as :

  • n : number of used cores.
  • nref : reference cores number.

The objective is to obtain a speed-up which keeps the speed of the theoretical speed-up curve. The equation allowing to determine this speed-up is the following one :

figure

Determination of the optimal block size

An optimal zone for the block size is determined thanks to the strong scaling. Tests with the weak scaling are realized on various block sizes belonging to this slice. It is then enough to observe the speed-up and efficiency curves and to determine the block size which will give the best efficiency with the best speed-up.

Example :

After analysing strong scaling, it was deducted that the optimal block size is between 20x20x20 and 40x40x40.

20x20x20 32x32x32 40x40x40
Efficiency 0.30 0.49 0.55
Speed-up (512 cores) 19.38 31.47 35.13

figure acc 20x20x20

figure acc 32x32x32

figure acc 40x40x40

In our case, the optimal block size is 40x40x40 since it has a better efficiency and has a better speed-up for a number of points higher than the other block sizes.

Profiling

Profiling is a way to optimize the code skill. It consists of the determination of the parts which use most resources.

Thanks to the command "gprof", it is possible to create a report of execution. The executable which calculates the case must be compilated with the option -pg, then it must be executed in order to obtain the report, gmon.out. In order to open this file, the bellowing command will be typped :

gprof NAME gmon.out>profilage.txt

A text file named profilage is created. This file gets back the data of the file gmon.out. NAME corresponds to the name of the executable calculating the case.

The file gmon.out is divided into 2 parts :

  • Call-graph profile : list of the routines in decreasing order of the CPU consummate time as well as the routines which are called from them.
  • Flat profile : list of the routines in the decreasing order of the CPU consummate time and the number of calls.

Analysis of the flat profile

Description
  • % time : percentage of total CPU time consumed by the routine.
  • Cumalative seconds : partial sum of CPU time.
  • Self seconds : CPU time of a routine.
  • Calls : number of calls of a routine.
  • Self ms/call : average of CPU time consumed by call to the routine, exclusive time.
  • Total ms/call : total time (exclusive + inclusive) consumed by call to the routine.
Analysis

Having to test a calculation on various combinations of processors, it seems clearly that the most consumer routines are :

  • der_sub_MOD_interpolate1_on_slice
  • der_sub_MOD_derivative2_on_slice
  • all_solver_MOD_momentum
  • der_sub_MOD_interpolate2_on_slice

For instance for a calculation on 4x1x1 processors, the flat profile will be the following :

figure

It indicates that the subroutine derivative2 on slice represents 20 % of total CPU time consumed. Thanks to the cumulative seconds, we can deduce that the total CPU time consumed egals 0.25 second. The number of seconds represented by the subroutine derivative2 on slice is about 0.05 second. This subroutine is invoked 545 times for 10 iterations. We can say that the average number of milliseconds spent in this subroutine and its children per call is 0.09 ms.

Remark : If the calculation is run on 2x2x1 instead of 4x1x1, the number of calls of the subroutines derivative and interpolate periodic are divided by 2, while the number of calls of the subroutines derivative and interpolate on slice increases.

Analysis of the call-graph profile

This part is very interesting because it allows to know which routine calls another one and how long does it spend.

The routines are indexed in decreasing order by CPU time comsumption.

This report details the percentage of the total time that was spent in a function and its children. The total of this percentages is not equal to 100 %. For instance the % run time for the main egals 100 % because it is a routine which takes a lot of time, but the routine solving the time stepping has a run time of 97.3%. This report gives also the amount of time that was propagated from the function into another, it depends of the different viewpoints (as explained when the comand gprof is run):

  • Current function (the function which has the index number at left) :
    • Self : this information corresponds to the total amount of time spent in this function.
    • Children : it represents the total amount of time propagated into this function by its children.
  • Function's parents (functions which call the current function, above the current function):
    • Self : it is the amount of time that was propagated from the current function into this parent.
    • Children : it is the amount of time that was propagated from the current function's children into this parent.
  • Function's children (functions which are called by the current function, below the current function) :
    • Self : it is the amount of time that was propagated directly from the child into the current function.
    • Children : it is the amount of time that was propagated from the child's children to the current function.

Remark : if the function's parents are not determined, then the term "spontaneous" is used.

The number of calls from a routine to another is also available.


Figure

For instance, if we consider the subroutine der_sub_MOD_echange [5], we can see that the parent functions are der_MOD_interpolate2 [16], der_MOD_derivative1 [15], der_MOD_derivative2 [11], der_MOD_interpolate1 [8].

This subroutine is called 1014 times for 10 iterations by the routines der_MOD_interpolate2 (93 calls), der_MOD_derivative1 (174 calls), der_MOD_derivative2 (327 calls) and der_MOD_interpolate1 (420 calls). The time spent in this function is about 0.01 second.

The children functions of this subroutine are the subroutines interpolate and derivative, periodic and slice.

The total amount of time propagated into this subroutine by its children is 0.13 second.

We can see that the subroutine echange calls 545 times the subroutine derivative2 on slice. This last one is called only by the subroutine echange since it is indicated 545/545.

The subroutine echange spends 56 % of the total time in run time.

Optimization of communication time

The run time consists of the calculation time and the communication time. Profiling aims also to minimize this communication time which can be decomposed in the time of latency, of additional cost and of transfer.

The additional cost is the time of preparation of a message.

How to optimize the communication time ?

The first step is to cover the communications by the calculations.

Then, it is necessary to avoid copying out messages in buffering memory spaces.

It is also necessary to minimize additional costs led by the repetitive calls to the subroutines of communication.

Finally, it is better to send an only big message rather than several small messages.

How to know the execution time of a routine ?

This information is given by the subroutine of MPI communication : MPI_WTIME().

This subroutine informs the user of the run time as it is called. Therefore, if this subroutine is called at the beginning and at the end of a routine, the run time can be deducted.

What way is it preferable to use for the sending and the reception of messages ?

Standard sending : If the MPI makes a buffering copy of the message then the sending ends when this one is ended. Else, the sending ends when the reception of the message is ended.

Blocking synchronous sending : This sending way is chosen in Allegro. It uses the subroutines MPI_SSEND() and MPI_RECV().

The blocking synchronous sending allows to avoid buffering copies of messages, in this way it also avoids additional costs.

Thanks to it, a factor 2 is gained with regard to the standard mode.

Non-blocking synchronous sending  : This way uses the subroutines MPI_ISSEND() and MPI_IRECV().

The non-blocking way allows to cover the communications by the calculations. The sending of a message ends when the reception was posted and when the reading of the message is ended.

The sending is coupled with the reception thus a factor 3 is gained with regard to the standard sending.

So that this mode can work correctly, it is necessary to add MPI_WAIT() after MPI_ISSEND() and MPI_IRECV(). It allows to synchronize the process till the end of the request.

Model of calculation

Time discretization

A third order Runge-Kutta method is used. This one allows to improve the precision of time integration and the stability.

The equations used in H-Allegro are defined below :

is the matrix of gross conservative data at the ith step and at the time t.

is the same matrix as before at the time .

and are Np x Nvar matrix, with Np, number of points and Nvar, number of variables.

Spatial discretization

All_init.f90

This file allows the user to initialize the problem, that is to say to give the initial conditions. For example a pressure peak can be defined, the density or the temperature can be fixed or not and so on.

In order to describe a function (for instance a pressure function) on all the points of the scalar grid S, 2 buckles are necessary to make vary the indexes I and J in the directions x and y (if it is a 2D case, otherwise if it is a 3D case, 3 buckles will be used by adding the index K). These indexes give knot coordinates. Since the indexes are appropriate for processors and not for physical problem, they cannot be used to code a case. That is why variables XP2(I,1) and XP2(J,2) are used. I indicates the position in the direction x, J indicates the position in the direction y, 1 indicates the first direction and 2, the second one.

The speed is described by WK1 and WK2 if the case is two-dimensional, more WK3 if the case is 3D. In our test case with a pressure peak, we consider these components as null.

The conservative variable Q(I,J,K,5) which corresponds to the energy (the number 5 indicates that is energy), is always defined as :


The conservative variable Q(I,J,K,4) corresponds to the density.

The routine interpolate1 is called in order to pass a scalar from the grid S to the grid of one momentum component. Thanks to this routine, speeds are expressed on V1, V2 and V3 grids.

Meso-combustor case

This case consists of a fuel-air mixing in a meso-combustor in three dimensions.

  • all_chemestry.f90 : this file allows to adjust the speed of a flame according to the local wealth for the laminated cases. If the case doesn't use laminated chemistry then the function XK_CH4_Air(Phi) must be deleted :
WK1(IP) = EXP(TaM/Tem(IP)+Ta_over_Tb)*(Q(IP,IF)*Q(IP,IO)/Q(IP,4))*XK_CH4_Air(Phi)
  • all_nscbcmezopropre.f90 : Navier Stokes characteristic boundary conditions of faces.

Neumann conditions for the wall :

3 points are used to described the domain : P, PP and PPP. P is on the face, PP and PPP are in the volume. For these points we act on Y and not on , so it is necessary to divise by .

  • communs.f90 : coordinates of the fuel and air entries centers, this file contains variables common to several files.
  • all_init_propre.f90 : this file is used to start a calculation. Twall and Twall2 allow to create a fresh gas heart surrounded with a warm gas. A linear regression is used to describe temperature closed to the wall if . The fuel and air entries are defined (radius, speed of fluids, temperature, initial mass fractions).
  • all_init_restart.f90 : this file is used to take back a calculation from a saving file. If you want to run the case from a save file, you have to enter the value 0 for the variable I_fresh_run (cf fort.10) and to indicate the number of the save file chosen. If you want to run the case from a save file and to modify some properties (for instance, the wall temperature), you have to enter the value 2 for the variable I_fresh_run. In this file, the fuel and air entries are not defined. Be careful, two interpolations (one before and one after the indexes buckles) are realized instead of only one. It allows to keep the mass when you restart a calculation.
  • all_solver.f90 : this file is used to solve the case ( error treatment, continuity, convection and so on). A patch of viscosity must be added.
  • fort.11 : list of variables which are post-processed.
  • allegro.f90 : initializes the calculation, allocates variables, manages the restart of a calculation, the number of iterations and so on.

All_post.f90

  • Parameters : variables definition (, , , XNuF_MF, XNuO_MO, YF_O, YO_O).
  • Dimensions and allocations.
  • Init Grid Derivative :
    • Average calculation :              compteur = Xend - Xstart
    • Y and id writing.
    • Vtk files creation ( Ichoice = 1, paraview ).
    • Gnu files creation ( Ichoice = 2, gnuplot ).
    • Slices creation, vtk files (Ichoice = 3, paraview). Be careful, this option is only available for a three-dimensional case.
    • Calculated fields writing.
  • Lectus post :
    • Reading of the quantities number to be calculated (cf fort.10, fort.11).
    • Lines number counting.
    • Reading of the variables names to be calculated.
    • Reading of the files names to be read.
    • Post-processing data reading (Prefix post, Ichoice, Xstart).
  • Init :
    • Speed field initialization on grid Vi.
    • Speed gradient initialization on grid S or Vij.
    • Speed divergence initialization on grid S.
    • Temperature and pressure initialization on grid S.
    • Temperature gradient initialization on grid Vj.
    • Viscosity initialization on grid S. Be careful, it is dynamic viscosity, .
  • Getvect :
    • Calculation of WQ(IP,Icmp), with WQ varying from 1 to Ncmp :

if REDIM = 0 , else

  • Allocate Fields :
    • Nfields calculation in function of the variables :

, or , or

    • This subroutine defines variables which can be post-processed. Check up that variables correpond to what you want. The available variables are : Pres, Temp, Rho, Rho U, Rho E, Y, U, We, Mach, Mach T, Mach P, Vel, SL, Grad T, Grad Z, Grad Y1, Div U, Rot U, e, S, FP, Z, C, Visc, Dilut, Q criteria, tak, c, Epsi.
  • Compute Fields :
    • Calculation of curls, vectors, scalars.
    • Calculation of WQ(IP,Ifield).
  • Read Fields :
    • Vtk files construction.
  • Vorticity :
    • The curl calculation in function of Ndim : RotX, RotY, RotZ.
  • Critere Q :
  • Takeno (cf : Thesis of Marianne Sjostrand)
    • Takeno index varies from -1 to 1. It is defined as :
  • Grad Grad :
    • The index of Takeno is defined as :
  • Grad Croise :
    • The index of Takeno is defined as :
  • Project :
    • Calculation of projection :
  • Dilution :
    • Calculation of the coefficient D :
  • Dissipation :
    • Calculation with the field of immediate speeds.