Taking on the challenge of COVID-19: The birth of six-dimensional interconnect technology for the supercomputer Fugaku (part 1)

[The Power to Create the Future Vol. 15] Yuichiro Ajima Principal Architect, System Development Division, Platform Development Unit Fujitsu Limited

by Fujitsu Blog Editor
Fujitsu
November 13, 2020

Main visual : Taking on the challenge of COVID-19: The birth of six-dimensional interconnect technology for the supercomputer Fugaku (part 1)

Countries around the world are now searching for the best possible measures to control the spread of novel coronavirus (COVID-19) by applying expertise from various fields. In Japan, the supercomputer Fugaku, which has been jointly developed by RIKEN and Fujitsu, has started partial operation ahead of schedule, and there are high expectations for its contribution to society.

One of Fugaku’s most defining supporting technologies is six-dimensional interconnect technology, which is a further evolution of the technology of its predecessor, the K computer.

So, how was this technology, which is unique to Japan, created and developed into the world's highest-end supercomputer?

We interviewed Yuichiro Ajima of Fujitsu, who is the architect that led the development of the K computer and Fugaku, and received the Medal with Purple Ribbon in the spring of 2020.

Developing a Japanese supercomputer to help society

Ajima: I think that supercomputer simulation is essential for the research and development of cutting-edge science and technology. Nowadays, many supercomputers around the world are being used by researchers to develop COVID-19 countermeasures. Originally, Fugaku was scheduled to start operation in FY2021, but the decision was made to launch partial operation ahead of schedule to find solutions for this pandemic. (For details, see part 2.)

The K computer is not owned by a specific research institute or university. It is made publicly available as a shared national facility to those who have passed the screening. If users disclose research results achieved using the K computer, no usage fee is charged. This policy will also apply to Fugaku, which is scheduled to start full operation in FY2021. In this respect, K computer and Fugaku are special, and there are no other supercomputers like them in the world. They offer both the ability to address problems of huge sizes and wide accessibility to a variety of users, and it is six-dimensional interconnect technology (intra-system network technology) that enabled both of them.

Challenges in developing an unprecedented ultra-high performance machine

Ajima: When development of the K computer started in 2004, it was not yet called "K," and the Peta-scale Computing (*1) Research Center was established within Fujitsu Laboratories. At that time, Japan and the US were racing to achieve peta-scale computing by 2010. In November 2004, Japan's Earth Simulator supercomputer yielded the world's top position to IBM's Blue Gene/L (according to the TOP500 supercomputer rankings). In the US, both IBM's Blue Gene/L and Cray's Red Storm connected more than 10,000 processors with three-dimensional interconnect technology (in which each processor is connected to the processors adjacent to it, in each of the six cardinal directions: top, bottom, left, right, front, and back).

(*1) Refers to a supercomputing technology with a computational performance of 1 petaflops (quadrillion operations per second) or more.

However, these three-dimensional interconnected supercomputers had shortcomings. IBM used a method of connecting separated partitions to each other via a dedicated switch. In this system, if one processor fails, system availability is significantly reduced as the failed processor is isolated from the network on a per-partition basis including other processors that have not failed.

Cray prioritized availability; however, communication collisions frequently occurred on the detouring communication path, which degraded communication performance. Despite the fact that Japanese technology at the time could connect only thousands of processors, we started considering the possibility of developing a new technology to supersede this three-dimensional technology.

Pursuing ultimate usability

Ajima: I joined the development project in 2005, when I was working on hardware architecture at Fujitsu Laboratories. In February 2006, I joined a training camp in Izu Kogen with people from Fujitsu's division. We were divided into groups of two or three people, brainstormed, and made presentations, where we determined the direction in which to expand three-dimensional interconnection. Around March 2006, based on these training camp discussions, we solidified the basic idea of a hierarchically expanded torus network to be used by users as virtual 3D network.

Virtual 3D network technology allows users to use networks that are more complex than 3D networks as if it were 3D. To realize a supercomputer with 10 petaflops performance (10 quadrillion operations per second), more than 80,000 processors must be connected to increase the number of communication paths. When 3D interconnection is not enough and expansion becomes necessary, from a network point of view, it is natural to add a hierarchy or to increase the number of paths by making it a four-dimensional or five-dimensional interconnection. However, developing software to make full use of such expanded complex communication paths may impose a heavy burden on users. Therefore, we have reached the breakthrough idea of providing users with highly usable virtual 3D networks instead of complex networks.

The next stage, six-dimensional interconnection

Ajima: Until 2006, Fujitsu Laboratories was the main player in conceptual design at Fujitsu. In 2007, the Research Center was transferred to the business division, where the Next Generation Technical Computing Unit was organized. As a result of a review conducted by the new organization, it was decided to enhance performance by increasing the amount of resources.

Until then, we had adopted a slightly complex hierarchical network to reduce resource usage, but we switched to a non-hierarchical system by increasing the amount of resources. Based on this, we examined the development of virtual 3D, which ultimately led to our creating the current six-dimensional interconnect system.

In six-dimensional interconnection, groups of 12 processors are interconnected in 3D, and the groups themselves are also interconnected to the other groups in 3D. The 3D size of a group is fixed to 2×3×2, and the number of groups can be increased in 3D. I think it is fascinating to realize 6D by combining two different types of 3D.

The final conceptual model of six-dimensional interconnect technology. A network of interconnecting big-sphere groups, each of which consists of 12 processors (small spheres), in 3D. Processors in a given group are also connected by a small 3D grid. The coordinates of the processors are represented in 6D, which is comprised of the 3D between groups and the 3D in each group.

I think that increasing resources and enhancing performance is the reason why the K computer came out on top in various benchmarks and continued to top the world in the communication intensive benchmark GRAPH500 until its 2019 retirement, all the while maintaining a good reputation with its users.

Shaping ideas based on a common understanding

Ajima: The biggest obstacle was shifting from the conceptual design to the detailed design phase toward the manufacturing stage. In conceptual design at Fujitsu Laboratories, I played a key role in the development of the architecture. My job was primarily to write computer code and design computer circuits. However, in detailed design, the number of members increased to hundreds. The hardware logic design/verification team and the software OS/device driver/communication library team took the lead, and I provided specifications to and supported each team. I prepared and distributed documents as well as explained them at meetings. After I moved to Fujitsu's development division, I understood for the first time that, in manufacturing with many people, specifications are the most important.

Codename "Tofu"

Ajima: Both the hierarchical system and the six-dimensional interconnection have been patented. When we first applied for a patent for the hierarchical system in 2006, we used the development codename "ToFu interconnect." Then, when we applied for a patent for the six-dimensional interconnection in 2007, we changed the uppercase "F" to lower case for "Tofu interconnect." The name comes from "tofu" (bean curd), a familiar food in Japan. I came up with the name because the system is water-cooled and can be partitioned flexibly. The name "ToFu" represents "torus" (*2) and "full connection," while "Tofu" represents "torus and fusion".

(*2) A torus refers to a multidimensional network interconnection in which each dimension has a loop structure.

If asked "What is the epoch-defining feature of six-dimensional interconnect technology?", I would answer that it has enabled the connection of more than 100,000 processors and the efficient concurrent execution of parallel programs of various sizes.

The network can be flexibly partitioned from any location with a group as a unit. In the event of a processor failure, this technology can reduce data traffic congestion by excluding the failed location as well as minimize the size of the field replacement unit, which improves performance and availability. I think these attributes have made it possible to realize a supercomputer that not only solves problems of huge sizes, but also is used by researchers in a variety of fields.

In part 2, we ask Ajima for his thoughts on the development team, Fugaku’s unveiled capabilities, and other topics.

Profile
Yuichiro Ajima

Principal Architect (Interconnect Architecture)
System Development Division, Platform Development Unit
Fujitsu Limited

1997: Graduated from the Department of Electrical Engineering, The University of Tokyo.
2002: Completed the Doctoral Program in Information Engineering, Graduate School of Engineering, The University of Tokyo, PhD (Engineering).
2002: Joined Fujitsu Laboratories.
2007: Transferred to a division at Fujitsu Limited and developed interconnects for the K computer and Fugaku.
2012: Won the Ichimura Prize in Industry for Distinguished Achievement.
2014: Won the Imperial Invention Prize, the National Commendation for Invention.
2017: Won the Prize for Science and Technology, Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology.
2020 (Spring): Awarded the Medal with Purple Ribbon.