Skip to content

Marconi 100 - CINECA

  • Model: IBM Power AC922 (Whiterspoon)
  • Racks: 55 total (49 compute)
  • Nodes: 980
  • Processors: 2x16 cores IBM POWER9 AC922 at 2.6(3.1) GHz
  • Accelerators: 4 x NVIDIA Volta V100 GPUs/node, Nvlink 2.0, 16GB
  • Cores: 32 cores/node, Hyperthreading x4
  • RAM: 256 GB/node (242 usable)
  • Peak Performance: about 32 Pflop/s, 32 TFlops per node
  • Internal Network: Mellanox IB EDR DragonFly++ 100Gb/s
  • Disk Space: 8PB raw GPFS storage

Metrics

This Section is a brief description of some of the metrics collected by ExaMon from the Marconi100 cluster. It is intended only as an example and is therefore not exhaustive. The Marconi, Galileo and Galileo 100 clusters have similar metrics.

IPMI

The following table describes the metrics collected by the ipmi_pub plugin.

Metric Name Description Unit
pX_coreY_temp Temperature of core n. Y in the CPU socket n. X. X=0..1, Y=0..23 °C
dimmX_temp Temperature of DIMM module n. X. X=0..15 °C
gpuX_core_temp Temperature of the core for the GPU id X. X=0,1,3,4 °C
gpuX_mem_temp Temperature of the memory for the GPU id X. X=0,1,3,4 °C
fanX_Y Speed of the Fan Y in module X. X=0..3, Y=0,1 RPM
pX_vdd_temp Temperature of the voltage regulator for the CPU socket n. X. X=0..1 °C
fan_disk_power Power consumption of the disk fan W
pX_io_power Power consumption for the I/O subsystem for the CPU socket n. X. X=0..1 W
pX_mem_power Power consumption for the memory subsystem for the CPU socket n. X. X=0..1 W
pX_power Power consumption for the CPU socket n. X. X=0..1 W
psX_input_power Power consumption at the input of power supply n. X. X=0..1 W
total_power Total node power consumption W
psX_input_voltag Voltage at the input of power supply n. X. X=0..1 V
psX_output_volta Voltage at the output of power supply n. X. X=0..1 V
psX_output_curre Current at the output of power supply n. X. X=0..1 A
pcie Temperature at the PCIExpress slots °C
ambient Temperature at the node inlet °C

Ganglia

The following table describes the metrics collected by the ganglia_pub plugin. The data are extracted from a Ganglia^([6]) instance that CINECA runs on Marconi100.

Metric name Type Unit Description
gexec core gexec available
cpu_aidle cpu % Percent of time since boot idle CPU
cpu_idle cpu % Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request
cpu_nice cpu % Percentage of CPU utilization that occurred while executing at the user level with nice priority
cpu_speed cpu MHz CPU Speed in terms of MHz
cpu_steal cpu % cpu_steal
cpu_system cpu % Percentage of CPU utilization that occurred while executing at the system level
cpu_user cpu % Percentage of CPU utilization that occurred while executing at the user level
cpu_wio cpu % Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request
cpu_num
disk_free disk GB Total free disk space
disk_total disk GB Total available disk space
part_max_used disk % Maximum percent used for all partitions
load_fifteen load Fifteen minute load average
load_five load Five minute load average
load_one load One minute load average
mem_buffers memory KB Amount of buffered memory
mem_cached memory KB Amount of cached memory
mem_free memory KB Amount of available memory
mem_shared memory KB Amount of shared memory
mem_total memory KB Total amount of memory displayed in KBs
swap_free memory KB Amount of available swap memory
swap_total memory KB Total amount of swap space displayed in KBs
bytes_in network bytes/sec Number of bytes in per second
bytes_out network bytes/sec Number of bytes out per second
pkts_in network packets/sec Packets in per second
pkts_out network packets/sec Packets out per second
proc_run process Total number of running processes
proc_total process Total number of processes
boottime system s The last time that the system was started
machine_type system System architecture
os_name system Operating system name
os_release system Operating system release date
cpu_ctxt cpu ctxs/sec Context Switches
cpu_intr cpu % cpu_intr
cpu_sintr cpu % cpu_sintr
multicpu_idle0 cpu % Percentage of CPU utilization that occurred while executing at the idle level
procs_blocked cpu processes Processes blocked
procs_created cpu proc/sec Number of processes and threads created
disk_free_absolute_developers disk GB Disk space available (GB) on /developers
disk_free_percent_developers disk % Disk space available (%) on /developers
diskstat_sda_io_time diskstat s The time in seconds spent in I/O operations
diskstat_sda_percent_io_time diskstat percent The percent of disk time spent on I/O operations
diskstat_sda_read_bytes_per_sec diskstat bytes/sec The number of bytes read per second
diskstat_sda_reads_merged diskstat reads The number of reads merged. Reads which are adjacent to each other may be merged for efficiency. Multiple reads may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O.
diskstat_sda_reads diskstat reads The number of reads completed
diskstat_sda_read_time diskstat s The time in seconds spent reading
diskstat_sda_weighted_io_time diskstat s The weighted time in seconds spend in I/O operations. This measures each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/O operations in progress times the number of seconds spent doing I/O.
diskstat_sda_write_bytes_per_sec diskstat bytes/sec The number of bytes written per second
diskstat_sda_writes_merged diskstat writes The number of writes merged. Writes which are adjacent to each other may be merged for efficiency. Multiple writes may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O.
diskstat_sda_writes diskstat writes The number of writes completed
diskstat_sda_write_time diskstat s The time in seconds spent writing
ipmi_ambient_temp ipmi C IPMI data
ipmi_avg_power ipmi Watts IPMI data
ipmi_cpu1_temp ipmi C IPMI data
ipmi_cpu2_temp ipmi C IPMI data
ipmi_gpu_outlet_temp ipmi C IPMI data
ipmi_hdd_inlet_temp ipmi C IPMI data
ipmi_pch_temp ipmi C IPMI data
ipmi_pci_riser_1_temp ipmi C IPMI data
ipmi_pci_riser_2_temp ipmi C IPMI data
ipmi_pib_ambient_temp ipmi C IPMI data
mem_anonpages memory Bytes AnonPages
mem_dirty memory Bytes The total amount of memory waiting to be written back to the disk.
mem_hardware_corrupted memory Bytes HardwareCorrupted
mem_mapped memory Bytes Mapped
mem_writeback memory Bytes The total amount of memory actively being written back to the disk.
vm_pgmajfault memory_vm ops/s pgmajfault
vm_pgpgin memory_vm ops/s pgpgin
vm_pgpgout memory_vm ops/s pgpgout
vm_vmeff memory_vm pct VM efficiency
rx_bytes_eth0 network bytes/sec received bytes per sec
rx_drops_eth0 network pkts/sec receive packets dropped per sec
rx_errs_eth0 network pkts/sec received error packets per sec
rx_pkts_eth0 network pkts/sec received packets per sec
tx_bytes_eth0 network bytes/sec transmitted bytes per sec
tx_drops_eth0 network pkts/sec transmitted dropped packets per sec
tx_errs_eth0 network pkts/sec transmitted error packets per sec
tx_pkts_eth0 network pkts/sec transmitted packets per sec
procstat_gmond_cpu procstat percent The total percent CPU utilization
procstat_gmond_mem procstat B The total memory utilization
softirq_blockiopoll softirq ops/s Soft Interrupts
softirq_block softirq ops/s Soft Interrupts
softirq_hi softirq ops/s Soft Interrupts
softirq_hrtimer softirq ops/s Soft Interrupts
softirq_netrx softirq ops/s Soft Interrupts
softirq_nettx softirq ops/s Soft Interrupts
softirq_rcu softirq ops/s Soft Interrupts
softirq_sched softirq ops/s Soft Interrupts
softirq_tasklet softirq ops/s Soft Interrupts
softirq_timer softirq ops/s Soft Interrupts
entropy_avail ssl bits Entropy Available
tcpext_listendrops tcpext count/s listendrops
tcpext_tcploss_percentage tcpext pct TCP percentage loss, tcploss / insegs + outsegs
tcp_attemptfails tcp count/s attempt fails
tcp_insegs tcp count/s insegs
tcp_outsegs tcp count/s outsegs
tcp_retrans_percentage tcp pct TCP retrans percentage, retranssegs / insegs + outsegs
udp_indatagrams udp count/s indatagrams
udp_inerrors udp count/s inerrors
udp_outdatagrams udp count/s outdatagrams
multicpu_idle16 cpu % Percentage of CPU utilization that occurred while executing at the idle level
multicpu_steal16 cpu % Percentage of CPU preempted by the hypervisor
multicpu_system16 cpu % Percentage of CPU utilization that occurred while executing at the system level
multicpu_user16 cpu % Percentage of CPU utilization that occurred while executing at the user level
multicpu_wio16 cpu % Percentage of CPU utilization that occurred while executing at the wio level
diskstat_sdb_io_time diskstat s The time in seconds spent in I/O operations
diskstat_sdb_percent_io_time diskstat percent The percent of disk time spent on I/O operations
diskstat_sdb_read_bytes_per_sec diskstat bytes/sec The number of bytes read per second
diskstat_sdb_reads_merged diskstat reads The number of reads merged. Reads which are adjacent to each other may be merged for efficiency. Multiple reads may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O.
diskstat_sdb_reads diskstat reads The number of reads completed
diskstat_sdb_read_time diskstat s The time in seconds spent reading
diskstat_sdb_weighted_io_time diskstat s The weighted time in seconds spend in I/O operations. This measures each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/O operations in progress times the number of seconds spent doing I/O.
diskstat_sdb_write_bytes_per_sec diskstat bytes/sec The number of bytes written per second
diskstat_sdb_writes_merged diskstat writes The number of writes merged. Writes which are adjacent to each other may be merged for efficiency. Multiple writes may become one before it is handed to the disk, and it will be counted (and queued) as only one I/O.
diskstat_sdb_writes diskstat writes The number of writes completed
diskstat_sdb_write_time diskstat s The time in seconds spent writing
GpuX_dec_utilization gpu % X=0,..,3
GpuX_enc_utilization gpu % X=0,..,3
GpuX_enforced_power_limit gpu Watts X=0,..,3
GpuX_gpu_temp gpu Celsius X=0,..,3
GpuX_low_util_violation gpu X=0,..,3
GpuX_mem_copy_utilization gpu % X=0,..,3
GpuX_mem_util_samples gpu X=0,..,3
GpuX_memory_clock gpu Mhz X=0,..,3
GpuX_memory_temp gpu Celsius X=0,..,3
GpuX_power_management_limit gpu Watts X=0,..,3
GpuX_power_usage gpu Watts X=0,..,3
GpuX_pstate gpu X=0,..,3
GpuX_reliability_violation gpu X=0,..,3
GpuX_sm_clock gpu Mhz X=0,..,3

Nagios

This is a description of the metrics collected by the ExaMon "nagios_pub" plugin. The data reflect those monitored by the Nagios^([7]) tool that currently runs in the CINECA clusters. Specifically, the plugin interfaces with a Nagios extension developed by CINECA called "Hnagios"^([8]). Although the monitored services and metrics are similar between all clusters, here we will specifically discuss those of Marconi100.

Metrics

Currently, this plugin collects three metrics

name
0 hostscheduleddowtimecomments
1 plugin_output
2 state

Hostscheduleddowtimecomments

This metric is obtained from the "Hnagios" output and reports comments made by system administrators about the maintenance status of the specific monitored resource

name tag key tag values
0 hostscheduleddowtimecomments node [ems02, login03, login08, master01, master02, ...
1 hostscheduleddowtimecomments slot [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 1...
2 hostscheduleddowtimecomments description [afs::blocked_conn::status, afs::bosserver::st...
3 hostscheduleddowtimecomments plugin [nagios_pub]
4 hostscheduleddowtimecomments chnl [data]
5 hostscheduleddowtimecomments host_group [compute, compute,cincompute, efgwcompute, efg...
6 hostscheduleddowtimecomments cluster [galileo, marconi, marconi100]
7 hostscheduleddowtimecomments state [0, 1, 2, 3]
8 hostscheduleddowtimecomments nagiosdrained [0, 1]
9 hostscheduleddowtimecomments org [cineca]
10 hostscheduleddowtimecomments state_type [0, 1]
11 hostscheduleddowtimecomments rack [205, 206, 207, 208, 209, 210, 211, 212, 213, ...

Plugin_output

This metric collects the outbound messages from Nagios agents responsible for monitoring services.

name tag key tag values
0 plugin_output node [ems02, ethcore01-mgt, ethcore02-mgt, gss03, g...
1 plugin_output slot [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 1...
2 plugin_output description [EFGW_cluster::status::availability, EFGW_clus...
3 plugin_output plugin [nagios_pub]
4 plugin_output chnl [data]
5 plugin_output host_group [compute, compute,cincompute, containers, cumu...
6 plugin_output cluster [galileo, marconi, marconi100]
7 plugin_output state [0, 1, 2, 3]
8 plugin_output nagiosdrained [0, 1]
9 plugin_output org [cineca]
10 plugin_output state_type [0, 1]
11 plugin_output rack [202, 205, 206, 207, 208, 209, 210, 211, 212, ...

State

This metric collects the equivalent numerical value of the actual state of the service monitored by Nagios.

name tag key tag values
0 state node [ems02, ethcore01-mgt, ethcore02-mgt, gss03, g...
1 state slot [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 1...
2 state description [EFGW_cluster::status::availability, EFGW_clus...
3 state plugin [nagios_pub]
4 state chnl [data]
5 state host_group [compute, compute,cincompute, containers, cumu...
6 state cluster [galileo, marconi, marconi100]
7 state nagiosdrained [0, 1]
8 state org [cineca]
9 state state_type [0, 1]
10 state rack [202, 205, 206, 207, 208, 209, 210, 211, 212, ...

Resources monitored in Marconi100

The name and type of the services/resources monitored by Nagios and corresponding to the metrics just described above are collected in the "description" tag.

Nagios checks for Marconi100

In the following table is collected a brief description of the services  monitored by Nagios in the Marconi100 cluster.

Service/resource Description
alive::ping Ping command output
backup::local::status Backup service
batchs::... Batch scheduler services
bmc::events Events from the node BMC
cluster::... Cluster availability
container::... Status of the container system
dev::... Node devices
file::integrity Files integrity
filesys::... Filesystem elements
galera::... Status of the database components
globus::... Status of the FTP system
memory::phys::total Physical memory size
monitoring::health Monitoring subsystem
net::ib::status Infiniband
nfs::rpc::status NFS
nvidia::... GPUs
service::... Misc. services
ssh::... SSH server
sys::... Misc. systems (GPFS,...)

Nagios state encoding

This table describes the numerical encoding of the state metric values and the state_type tag, as defined by Nagios.

[TABLE]

Nvidia

The following table describes the metrics collected by the nvidia_pub plugin.

PLEASE NOTE This plugin has collected data only for a short period (January/February 2020) and is currently not enabled due to CINECA policy.

Metric name Description Unit
clock.sm Current frequency of SM (Streaming Multiprocessor) clock. MHz
clocks.gr Current frequency of graphics (shader) clock. MHz
clocks.mem Current frequency of memory clock. MHz
clocks_throttle_reasons.active Bitmask of active clock throttle reasons. See nvml.h for more details
power.draw The last measured power draw for the entire board, in watts. Only available if power management is supported. This reading is accurate to within +/- 5 watts. W
temperature.gpu Core GPU temperature. in degrees C. °C

Slurm

Currently the job scheduler data is collected as per-job data in plain Cassandra tables.

This is a description of the data currently stored (where available) for each executed job:

Table fields Description
account charge to specified account
accrue_time time job is eligible for running
admin_comment administrator's arbitrary comment
alloc_node local node and system id making the resource allocation
alloc_sid local sid making resource alloc
array_job_id job_id of a job array or 0 if N/A
array_max_tasks Maximum number of running tasks
array_task_id task_id of a job array
array_task_str string expression of task IDs in this record
assoc_id association id for job
batch_features features required for batch script's node
batch_flag 1 if batch: queued job with script
batch_host name of host running batch script
billable_tres billable TRES cache. updated upon resize
bitflags Various job flags
boards_per_node boards per node required by job
burst_buffer burst buffer specifications
burst_buffer_state burst buffer state info
command command to be executed, built from submitted  job's argv and NULL for salloc command
comment arbitrary comment
contiguous 1 if job requires contiguous nodes
core_spec specialized core count
cores_per_socket cores per socket required by job
cpu_freq_gov cpu frequency governor
cpu_freq_max Maximum cpu frequency
cpu_freq_min Minimum cpu frequency
cpus_alloc_layout map: list of cpu allocated per node
cpus_allocated map: number of cpu allocated per node
cpus_per_task number of processors required for each task
cpus_per_tres semicolon delimited list of TRES=# values
dependency synchronize job execution with other jobs
derived_ec highest exit code of all job steps
eligible_time time job is eligible for running
end_time time of termination, actual or expected
exc_nodes comma separated list of excluded nodes
exit_code exit code for job (status from wait call)
features comma separated list of required features
group_id group job submitted as
job_id job ID
job_state state of the job, see enum job_states
last_sched_eval last time job was evaluated for scheduling
licenses licenses required by the job
max_cpus maximum number of cpus usable by job
max_nodes maximum number of nodes usable by job
mem_per_cpu boolean
mem_per_node boolean
mem_per_tres semicolon delimited list of TRES=# values
min_memory_cpu minimum real memory required per allocated CPU
min_memory_node minimum real memory required per node
name name of the job
network network specification
nice requested priority change
nodes list of nodes allocated to job
ntasks_per_board number of tasks to invoke on each board
ntasks_per_core number of tasks to invoke on each core
ntasks_per_core_str number of tasks to invoke on each core  as string
ntasks_per_node number of tasks to invoke on each node
ntasks_per_socket number of tasks to invoke on each socket
ntasks_per_socket_str number of tasks to invoke on each socket as string
num_cpus minimum number of cpus required by job
num_nodes minimum number of nodes required by job
partition name of assigned partition
pn_min_cpus minimum # CPUs per node, default=0
pn_min_memory minimum real memory per node, default=0
pn_min_tmp_disk minimum tmp disk per node, default=0
power_flags power management flags,  see SLURM_POWERFLAGS
pre_sus_time time job ran prior to last suspend
preempt_time preemption signal time
priority relative priority of the job, 0=held, 1=required nodes DOWN/DRAINED
profile Level of acct_gather_profile
qos Quality of Service
reboot node reboot requested before start
req_nodes comma separated list of required nodes
req_switch Minimum number of switches
requeue enable or disable job requeue option
resize_time time of latest size change
restart_cnt count of job restarts
resv_name reservation name
run_time job run time (seconds)
run_time_str job run time (seconds) as string
sched_nodes list of nodes scheduled to be used for job
shared 1 if job can share nodes with other jobs
show_flags conveys level of details requested
sockets_per_board sockets per board required by job
sockets_per_node sockets per node required by job
start_time time execution begins, actual or expected
state_reason reason job still pending or failed, see slurm.h:enum job_state_reason
std_err pathname of job's stderr file
std_in pathname of job's stdin file
std_out pathname of job's stdout file
submit_time time of job submission
suspend_time time job last suspended or resumed
system_comment slurmctld's arbitrary comment
threads_per_core threads per core required by job
time_limit maximum run time in minutes or INFINITE
time_limit_str maximum run time in minutes or INFINITE as string
time_min minimum run time in minutes or INFINITE
tres_alloc_str tres used in the job as string
tres_bind Task to TRES binding directives
tres_freq TRES frequency directives
tres_per_job semicolon delimited list of TRES=# values
tres_per_node semicolon delimited list of TRES=# values
tres_per_socket semicolon delimited list of TRES=# values
tres_per_task semicolon delimited list of TRES=# values
tres_req_str tres requested in the job as string
user_id user the job runs as
wait4switch Maximum time to wait for minimum switches
wckey wckey for job
work_dir pathname of working directory