Marconi 100 - CINECA
- Model: IBM Power AC922 (Whiterspoon)
- Racks: 55 total (49 compute)
- Nodes: 980
- Processors: 2x16 cores IBM POWER9 AC922 at 2.6(3.1) GHz
- Accelerators: 4 x NVIDIA Volta V100 GPUs/node, Nvlink 2.0, 16GB
- Cores: 32 cores/node, Hyperthreading x4
- RAM: 256 GB/node (242 usable)
- Peak Performance: about 32 Pflop/s, 32 TFlops per node
- Internal Network: Mellanox IB EDR DragonFly++ 100Gb/s
- Disk Space: 8PB raw GPFS storage
Metrics
This Section is a brief description of some of the metrics collected by ExaMon from the Marconi100 cluster. It is intended only as an example and is therefore not exhaustive. The Marconi, Galileo and Galileo 100 clusters have similar metrics.
IPMI
The following table describes the metrics collected by the ipmi_pub plugin.
Metric Name | Description | Unit |
pX_coreY_temp | Temperature of core n. Y in the CPU socket n. X. X=0..1, Y=0..23 | °C |
dimmX_temp | Temperature of DIMM module n. X. X=0..15 | °C |
gpuX_core_temp | Temperature of the core for the GPU id X. X=0,1,3,4 | °C |
gpuX_mem_temp | Temperature of the memory for the GPU id X. X=0,1,3,4 | °C |
fanX_Y | Speed of the Fan Y in module X. X=0..3, Y=0,1 | RPM |
pX_vdd_temp | Temperature of the voltage regulator for the CPU socket n. X. X=0..1 | °C |
fan_disk_power | Power consumption of the disk fan | W |
pX_io_power | Power consumption for the I/O subsystem for the CPU socket n. X. X=0..1 | W |
pX_mem_power | Power consumption for the memory subsystem for the CPU socket n. X. X=0..1 | W |
pX_power | Power consumption for the CPU socket n. X. X=0..1 | W |
psX_input_power | Power consumption at the input of power supply n. X. X=0..1 | W |
total_power | Total node power consumption | W |
psX_input_voltag | Voltage at the input of power supply n. X. X=0..1 | V |
psX_output_volta | Voltage at the output of power supply n. X. X=0..1 | V |
psX_output_curre | Current at the output of power supply n. X. X=0..1 | A |
pcie | Temperature at the PCIExpress slots | °C |
ambient | Temperature at the node inlet | °C |
Ganglia
The following table describes the metrics collected by the ganglia_pub plugin. The data are extracted from a Ganglia^([6]) instance that CINECA runs on Marconi100.
Metric name | Type | Unit | Description |
gexec | core | gexec available | |
cpu_aidle | cpu | % | Percent of time since boot idle CPU |
cpu_idle | cpu | % | Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request |
cpu_nice | cpu | % | Percentage of CPU utilization that occurred while executing at the user level with nice priority |
cpu_speed | cpu | MHz | CPU Speed in terms of MHz |
cpu_steal | cpu | % | cpu_steal |
cpu_system | cpu | % | Percentage of CPU utilization that occurred while executing at the system level |
cpu_user | cpu | % | Percentage of CPU utilization that occurred while executing at the user level |
cpu_wio | cpu | % | Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request |
cpu_num | |||
disk_free | disk | GB | Total free disk space |
disk_total | disk | GB | Total available disk space |
part_max_used | disk | % | Maximum percent used for all partitions |
load_fifteen | load | Fifteen minute load average | |
load_five | load | Five minute load average | |
load_one | load | One minute load average | |
mem_buffers | memory | KB | Amount of buffered memory |
mem_cached | memory | KB | Amount of cached memory |
mem_free | memory | KB | Amount of available memory |
mem_shared | memory | KB | Amount of shared memory |
mem_total | memory | KB | Total amount of memory displayed in KBs |
swap_free | memory | KB | Amount of available swap memory |
swap_total | memory | KB | Total amount of swap space displayed in KBs |
bytes_in | network | bytes/sec | Number of bytes in per second |
bytes_out | network | bytes/sec | Number of bytes out per second |
pkts_in | network | packets/sec | Packets in per second |
pkts_out | network | packets/sec | Packets out per second |
proc_run | process | Total number of running processes | |
proc_total | process | Total number of processes | |
boottime | system | s | The last time that the system was started |
machine_type | system | System architecture | |
os_name | system | Operating system name | |
os_release | system | Operating system release date | |
cpu_ctxt | cpu | ctxs/sec | Context Switches |
cpu_intr | cpu | % | cpu_intr |
cpu_sintr | cpu | % | cpu_sintr |
multicpu_idle0 | cpu | % | Percentage of CPU utilization that occurred while executing at the idle level |
procs_blocked | cpu | processes | Processes blocked |
procs_created | cpu | proc/sec | Number of processes and threads created |
disk_free_absolute_developers | disk | GB | Disk space available (GB) on /developers |
disk_free_percent_developers | disk | % | Disk space available (%) on /developers |
diskstat_sda_io_time | diskstat | s | The time in seconds spent in I/O operations |
diskstat_sda_percent_io_time | diskstat | percent | The percent of disk time spent on I/O operations |
diskstat_sda_read_bytes_per_sec | diskstat | bytes/sec | The number of bytes read per second |
diskstat_sda_reads_merged | diskstat | reads | The number of reads merged. Reads which are adjacent to each other may be merged for efficiency. Multiple reads may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O. |
diskstat_sda_reads | diskstat | reads | The number of reads completed |
diskstat_sda_read_time | diskstat | s | The time in seconds spent reading |
diskstat_sda_weighted_io_time | diskstat | s | The weighted time in seconds spend in I/O operations. This measures each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/O operations in progress times the number of seconds spent doing I/O. |
diskstat_sda_write_bytes_per_sec | diskstat | bytes/sec | The number of bytes written per second |
diskstat_sda_writes_merged | diskstat | writes | The number of writes merged. Writes which are adjacent to each other may be merged for efficiency. Multiple writes may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O. |
diskstat_sda_writes | diskstat | writes | The number of writes completed |
diskstat_sda_write_time | diskstat | s | The time in seconds spent writing |
ipmi_ambient_temp | ipmi | C | IPMI data |
ipmi_avg_power | ipmi | Watts | IPMI data |
ipmi_cpu1_temp | ipmi | C | IPMI data |
ipmi_cpu2_temp | ipmi | C | IPMI data |
ipmi_gpu_outlet_temp | ipmi | C | IPMI data |
ipmi_hdd_inlet_temp | ipmi | C | IPMI data |
ipmi_pch_temp | ipmi | C | IPMI data |
ipmi_pci_riser_1_temp | ipmi | C | IPMI data |
ipmi_pci_riser_2_temp | ipmi | C | IPMI data |
ipmi_pib_ambient_temp | ipmi | C | IPMI data |
mem_anonpages | memory | Bytes | AnonPages |
mem_dirty | memory | Bytes | The total amount of memory waiting to be written back to the disk. |
mem_hardware_corrupted | memory | Bytes | HardwareCorrupted |
mem_mapped | memory | Bytes | Mapped |
mem_writeback | memory | Bytes | The total amount of memory actively being written back to the disk. |
vm_pgmajfault | memory_vm | ops/s | pgmajfault |
vm_pgpgin | memory_vm | ops/s | pgpgin |
vm_pgpgout | memory_vm | ops/s | pgpgout |
vm_vmeff | memory_vm | pct | VM efficiency |
rx_bytes_eth0 | network | bytes/sec | received bytes per sec |
rx_drops_eth0 | network | pkts/sec | receive packets dropped per sec |
rx_errs_eth0 | network | pkts/sec | received error packets per sec |
rx_pkts_eth0 | network | pkts/sec | received packets per sec |
tx_bytes_eth0 | network | bytes/sec | transmitted bytes per sec |
tx_drops_eth0 | network | pkts/sec | transmitted dropped packets per sec |
tx_errs_eth0 | network | pkts/sec | transmitted error packets per sec |
tx_pkts_eth0 | network | pkts/sec | transmitted packets per sec |
procstat_gmond_cpu | procstat | percent | The total percent CPU utilization |
procstat_gmond_mem | procstat | B | The total memory utilization |
softirq_blockiopoll | softirq | ops/s | Soft Interrupts |
softirq_block | softirq | ops/s | Soft Interrupts |
softirq_hi | softirq | ops/s | Soft Interrupts |
softirq_hrtimer | softirq | ops/s | Soft Interrupts |
softirq_netrx | softirq | ops/s | Soft Interrupts |
softirq_nettx | softirq | ops/s | Soft Interrupts |
softirq_rcu | softirq | ops/s | Soft Interrupts |
softirq_sched | softirq | ops/s | Soft Interrupts |
softirq_tasklet | softirq | ops/s | Soft Interrupts |
softirq_timer | softirq | ops/s | Soft Interrupts |
entropy_avail | ssl | bits | Entropy Available |
tcpext_listendrops | tcpext | count/s | listendrops |
tcpext_tcploss_percentage | tcpext | pct | TCP percentage loss, tcploss / insegs + outsegs |
tcp_attemptfails | tcp | count/s | attempt fails |
tcp_insegs | tcp | count/s | insegs |
tcp_outsegs | tcp | count/s | outsegs |
tcp_retrans_percentage | tcp | pct | TCP retrans percentage, retranssegs / insegs + outsegs |
udp_indatagrams | udp | count/s | indatagrams |
udp_inerrors | udp | count/s | inerrors |
udp_outdatagrams | udp | count/s | outdatagrams |
multicpu_idle16 | cpu | % | Percentage of CPU utilization that occurred while executing at the idle level |
multicpu_steal16 | cpu | % | Percentage of CPU preempted by the hypervisor |
multicpu_system16 | cpu | % | Percentage of CPU utilization that occurred while executing at the system level |
multicpu_user16 | cpu | % | Percentage of CPU utilization that occurred while executing at the user level |
multicpu_wio16 | cpu | % | Percentage of CPU utilization that occurred while executing at the wio level |
diskstat_sdb_io_time | diskstat | s | The time in seconds spent in I/O operations |
diskstat_sdb_percent_io_time | diskstat | percent | The percent of disk time spent on I/O operations |
diskstat_sdb_read_bytes_per_sec | diskstat | bytes/sec | The number of bytes read per second |
diskstat_sdb_reads_merged | diskstat | reads | The number of reads merged. Reads which are adjacent to each other may be merged for efficiency. Multiple reads may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O. |
diskstat_sdb_reads | diskstat | reads | The number of reads completed |
diskstat_sdb_read_time | diskstat | s | The time in seconds spent reading |
diskstat_sdb_weighted_io_time | diskstat | s | The weighted time in seconds spend in I/O operations. This measures each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/O operations in progress times the number of seconds spent doing I/O. |
diskstat_sdb_write_bytes_per_sec | diskstat | bytes/sec | The number of bytes written per second |
diskstat_sdb_writes_merged | diskstat | writes | The number of writes merged. Writes which are adjacent to each other may be merged for efficiency. Multiple writes may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O. |
diskstat_sdb_writes | diskstat | writes | The number of writes completed |
diskstat_sdb_write_time | diskstat | s | The time in seconds spent writing |
GpuX_dec_utilization | gpu | % | X=0,..,3 |
GpuX_enc_utilization | gpu | % | X=0,..,3 |
GpuX_enforced_power_limit | gpu | Watts | X=0,..,3 |
GpuX_gpu_temp | gpu | Celsius | X=0,..,3 |
GpuX_low_util_violation | gpu | X=0,..,3 | |
GpuX_mem_copy_utilization | gpu | % | X=0,..,3 |
GpuX_mem_util_samples | gpu | X=0,..,3 | |
GpuX_memory_clock | gpu | Mhz | X=0,..,3 |
GpuX_memory_temp | gpu | Celsius | X=0,..,3 |
GpuX_power_management_limit | gpu | Watts | X=0,..,3 |
GpuX_power_usage | gpu | Watts | X=0,..,3 |
GpuX_pstate | gpu | X=0,..,3 | |
GpuX_reliability_violation | gpu | X=0,..,3 | |
GpuX_sm_clock | gpu | Mhz | X=0,..,3 |
Nagios
This is a description of the metrics collected by the ExaMon "nagios_pub" plugin. The data reflect those monitored by the Nagios^([7]) tool that currently runs in the CINECA clusters. Specifically, the plugin interfaces with a Nagios extension developed by CINECA called "Hnagios"^([8]). Although the monitored services and metrics are similar between all clusters, here we will specifically discuss those of Marconi100.
Metrics
Currently, this plugin collects three metrics
name | |
0 | hostscheduleddowtimecomments |
1 | plugin_output |
2 | state |
Hostscheduleddowtimecomments
This metric is obtained from the "Hnagios" output and reports comments made by system administrators about the maintenance status of the specific monitored resource
name | tag key | tag values | |
0 | hostscheduleddowtimecomments | node | [ems02, login03, login08, master01, master02, ... |
1 | hostscheduleddowtimecomments | slot | [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 1... |
2 | hostscheduleddowtimecomments | description | [afs::blocked_conn::status, afs::bosserver::st... |
3 | hostscheduleddowtimecomments | plugin | [nagios_pub] |
4 | hostscheduleddowtimecomments | chnl | [data] |
5 | hostscheduleddowtimecomments | host_group | [compute, compute,cincompute, efgwcompute, efg... |
6 | hostscheduleddowtimecomments | cluster | [galileo, marconi, marconi100] |
7 | hostscheduleddowtimecomments | state | [0, 1, 2, 3] |
8 | hostscheduleddowtimecomments | nagiosdrained | [0, 1] |
9 | hostscheduleddowtimecomments | org | [cineca] |
10 | hostscheduleddowtimecomments | state_type | [0, 1] |
11 | hostscheduleddowtimecomments | rack | [205, 206, 207, 208, 209, 210, 211, 212, 213, ... |
Plugin_output
This metric collects the outbound messages from Nagios agents responsible for monitoring services.
name | tag key | tag values | |
0 | plugin_output | node | [ems02, ethcore01-mgt, ethcore02-mgt, gss03, g... |
1 | plugin_output | slot | [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 1... |
2 | plugin_output | description | [EFGW_cluster::status::availability, EFGW_clus... |
3 | plugin_output | plugin | [nagios_pub] |
4 | plugin_output | chnl | [data] |
5 | plugin_output | host_group | [compute, compute,cincompute, containers, cumu... |
6 | plugin_output | cluster | [galileo, marconi, marconi100] |
7 | plugin_output | state | [0, 1, 2, 3] |
8 | plugin_output | nagiosdrained | [0, 1] |
9 | plugin_output | org | [cineca] |
10 | plugin_output | state_type | [0, 1] |
11 | plugin_output | rack | [202, 205, 206, 207, 208, 209, 210, 211, 212, ... |
State
This metric collects the equivalent numerical value of the actual state of the service monitored by Nagios.
name | tag key | tag values | |
0 | state | node | [ems02, ethcore01-mgt, ethcore02-mgt, gss03, g... |
1 | state | slot | [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 1... |
2 | state | description | [EFGW_cluster::status::availability, EFGW_clus... |
3 | state | plugin | [nagios_pub] |
4 | state | chnl | [data] |
5 | state | host_group | [compute, compute,cincompute, containers, cumu... |
6 | state | cluster | [galileo, marconi, marconi100] |
7 | state | nagiosdrained | [0, 1] |
8 | state | org | [cineca] |
9 | state | state_type | [0, 1] |
10 | state | rack | [202, 205, 206, 207, 208, 209, 210, 211, 212, ... |
Resources monitored in Marconi100
The name and type of the services/resources monitored by Nagios and corresponding to the metrics just described above are collected in the "description" tag.
Nagios checks for Marconi100
In the following table is collected a brief description of the services monitored by Nagios in the Marconi100 cluster.
Service/resource | Description |
alive::ping | Ping command output |
backup::local::status | Backup service |
batchs::... | Batch scheduler services |
bmc::events | Events from the node BMC |
cluster::... | Cluster availability |
container::... | Status of the container system |
dev::... | Node devices |
file::integrity | Files integrity |
filesys::... | Filesystem elements |
galera::... | Status of the database components |
globus::... | Status of the FTP system |
memory::phys::total | Physical memory size |
monitoring::health | Monitoring subsystem |
net::ib::status | Infiniband |
nfs::rpc::status | NFS |
nvidia::... | GPUs |
service::... | Misc. services |
ssh::... | SSH server |
sys::... | Misc. systems (GPFS,...) |
Nagios state encoding
This table describes the numerical encoding of the state metric values and the state_type tag, as defined by Nagios.
[TABLE]
Nvidia
The following table describes the metrics collected by the nvidia_pub plugin.
PLEASE NOTE This plugin has collected data only for a short period (January/February 2020) and is currently not enabled due to CINECA policy.
Metric name | Description | Unit |
clock.sm | Current frequency of SM (Streaming Multiprocessor) clock. | MHz |
clocks.gr | Current frequency of graphics (shader) clock. | MHz |
clocks.mem | Current frequency of memory clock. | MHz |
clocks_throttle_reasons.active | Bitmask of active clock throttle reasons. See nvml.h for more details | |
power.draw | The last measured power draw for the entire board, in watts. Only available if power management is supported. This reading is accurate to within +/- 5 watts. | W |
temperature.gpu | Core GPU temperature. in degrees C. | °C |
Slurm
Currently the job scheduler data is collected as per-job data in plain Cassandra tables.
This is a description of the data currently stored (where available) for each executed job:
Table fields | Description |
account | charge to specified account |
accrue_time | time job is eligible for running |
admin_comment | administrator's arbitrary comment |
alloc_node | local node and system id making the resource allocation |
alloc_sid | local sid making resource alloc |
array_job_id | job_id of a job array or 0 if N/A |
array_max_tasks | Maximum number of running tasks |
array_task_id | task_id of a job array |
array_task_str | string expression of task IDs in this record |
assoc_id | association id for job |
batch_features | features required for batch script's node |
batch_flag | 1 if batch: queued job with script |
batch_host | name of host running batch script |
billable_tres | billable TRES cache. updated upon resize |
bitflags | Various job flags |
boards_per_node | boards per node required by job |
burst_buffer | burst buffer specifications |
burst_buffer_state | burst buffer state info |
command | command to be executed, built from submitted job's argv and NULL for salloc command |
comment | arbitrary comment |
contiguous | 1 if job requires contiguous nodes |
core_spec | specialized core count |
cores_per_socket | cores per socket required by job |
cpu_freq_gov | cpu frequency governor |
cpu_freq_max | Maximum cpu frequency |
cpu_freq_min | Minimum cpu frequency |
cpus_alloc_layout | map: list of cpu allocated per node |
cpus_allocated | map: number of cpu allocated per node |
cpus_per_task | number of processors required for each task |
cpus_per_tres | semicolon delimited list of TRES=# values |
dependency | synchronize job execution with other jobs |
derived_ec | highest exit code of all job steps |
eligible_time | time job is eligible for running |
end_time | time of termination, actual or expected |
exc_nodes | comma separated list of excluded nodes |
exit_code | exit code for job (status from wait call) |
features | comma separated list of required features |
group_id | group job submitted as |
job_id | job ID |
job_state | state of the job, see enum job_states |
last_sched_eval | last time job was evaluated for scheduling |
licenses | licenses required by the job |
max_cpus | maximum number of cpus usable by job |
max_nodes | maximum number of nodes usable by job |
mem_per_cpu | boolean |
mem_per_node | boolean |
mem_per_tres | semicolon delimited list of TRES=# values |
min_memory_cpu | minimum real memory required per allocated CPU |
min_memory_node | minimum real memory required per node |
name | name of the job |
network | network specification |
nice | requested priority change |
nodes | list of nodes allocated to job |
ntasks_per_board | number of tasks to invoke on each board |
ntasks_per_core | number of tasks to invoke on each core |
ntasks_per_core_str | number of tasks to invoke on each core as string |
ntasks_per_node | number of tasks to invoke on each node |
ntasks_per_socket | number of tasks to invoke on each socket |
ntasks_per_socket_str | number of tasks to invoke on each socket as string |
num_cpus | minimum number of cpus required by job |
num_nodes | minimum number of nodes required by job |
partition | name of assigned partition |
pn_min_cpus | minimum # CPUs per node, default=0 |
pn_min_memory | minimum real memory per node, default=0 |
pn_min_tmp_disk | minimum tmp disk per node, default=0 |
power_flags | power management flags, see SLURM_POWERFLAGS |
pre_sus_time | time job ran prior to last suspend |
preempt_time | preemption signal time |
priority | relative priority of the job, 0=held, 1=required nodes DOWN/DRAINED |
profile | Level of acct_gather_profile |
qos | Quality of Service |
reboot | node reboot requested before start |
req_nodes | comma separated list of required nodes |
req_switch | Minimum number of switches |
requeue | enable or disable job requeue option |
resize_time | time of latest size change |
restart_cnt | count of job restarts |
resv_name | reservation name |
run_time | job run time (seconds) |
run_time_str | job run time (seconds) as string |
sched_nodes | list of nodes scheduled to be used for job |
shared | 1 if job can share nodes with other jobs |
show_flags | conveys level of details requested |
sockets_per_board | sockets per board required by job |
sockets_per_node | sockets per node required by job |
start_time | time execution begins, actual or expected |
state_reason | reason job still pending or failed, see slurm.h:enum job_state_reason |
std_err | pathname of job's stderr file |
std_in | pathname of job's stdin file |
std_out | pathname of job's stdout file |
submit_time | time of job submission |
suspend_time | time job last suspended or resumed |
system_comment | slurmctld's arbitrary comment |
threads_per_core | threads per core required by job |
time_limit | maximum run time in minutes or INFINITE |
time_limit_str | maximum run time in minutes or INFINITE as string |
time_min | minimum run time in minutes or INFINITE |
tres_alloc_str | tres used in the job as string |
tres_bind | Task to TRES binding directives |
tres_freq | TRES frequency directives |
tres_per_job | semicolon delimited list of TRES=# values |
tres_per_node | semicolon delimited list of TRES=# values |
tres_per_socket | semicolon delimited list of TRES=# values |
tres_per_task | semicolon delimited list of TRES=# values |
tres_req_str | tres requested in the job as string |
user_id | user the job runs as |
wait4switch | Maximum time to wait for minimum switches |
wckey | wckey for job |
work_dir | pathname of working directory |