<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
上次介紹到rank0的機器生成了ncclUniqueId,並完成了機器的bootstrap網路和通訊網路的初始化,這節接著看下所有節點間bootstrap的連線是如何建立的。
通過mpi將Id廣播到所有節點,然後所有節點都會執行ncclCommInitRank,這裡其他節點也會進行初始化bootstrap網路和通訊網路的操作,然後會執行到ncclCommInitRankSync。
ncclResult_t ncclCommInitRankSync(ncclComm_t* newcomm, int nranks, ncclUniqueId commId, int myrank, int cudaDev) { ncclResult_t res; CUDACHECK(cudaSetDevice(cudaDev)); NCCLCHECKGOTO(commAlloc(newcomm, nranks, myrank), res, cleanup); NCCLCHECKGOTO(initTransportsRank(*newcomm, &commId), res, cleanup); NCCLCHECKGOTO(devCommSetup(*newcomm), res, cleanup); INFO(NCCL_INIT,"comm %p rank %d nranks %d cudaDev %d busId %x - Init COMPLETE", *newcomm, myrank, nranks, (*newcomm)->cudaDev, (*newcomm)->busId); return ncclSuccess; cleanup: if ((*newcomm) && (*newcomm)->bootstrap) bootstrapAbort((*newcomm)->bootstrap); *newcomm = NULL; return res; }
ncclComm_t是指向ncclComm的指標,ncclComm是一個大雜燴,包含了通訊用到的所有上下文資訊,裡面的欄位等用到的時候再介紹,然後通過commAlloc分配newcom,並且完成初始化,比如當前是哪個卡,對應的pcie busid是什麼,
static ncclResult_t initTransportsRank(struct ncclComm* comm, ncclUniqueId* commId) { // We use 3 AllGathers // 1. { peerInfo, comm } // 2. ConnectTransport[nranks], ConnectValue[nranks] // 3. { nThreads, nrings, compCap, prev[MAXCHANNELS], next[MAXCHANNELS] } int rank = comm->rank; int nranks = comm->nRanks; uint64_t commHash = getHash(commId->internal, NCCL_UNIQUE_ID_BYTES); TRACE(NCCL_INIT, "comm %p, commHash %lx, rank %d nranks %d - BEGIN", comm, commHash, rank, nranks); NCCLCHECK(bootstrapInit(commId, rank, nranks, &comm->bootstrap)); // AllGather1 - begin struct { struct ncclPeerInfo peerInfo; struct ncclComm* comm; } *allGather1Data; NCCLCHECK(ncclCalloc(&allGather1Data, nranks)); allGather1Data[rank].comm = comm; struct ncclPeerInfo* myInfo = &allGather1Data[rank].peerInfo; NCCLCHECK(fillInfo(comm, myInfo, commHash)); NCCLCHECK(bootstrapAllGather(comm->bootstrap, allGather1Data, sizeof(*allGather1Data))); NCCLCHECK(ncclCalloc(&comm->peerInfo, nranks+1)); // Extra rank to represent CollNet root for (int i = 0; i < nranks; i++) { memcpy(comm->peerInfo+i, &allGather1Data[i].peerInfo, sizeof(struct ncclPeerInfo)); if ((i != rank) && (comm->peerInfo[i].hostHash == myInfo->hostHash) && (comm->peerInfo[i].busId == myInfo->busId)) { WARN("Duplicate GPU detected : rank %d and rank %d both on CUDA device %x", rank, i, myInfo->busId); return ncclInvalidUsage; } }
ncclResult_t bootstrapInit(ncclUniqueId * id, int rank, int nranks, void** commState) { ncclNetHandle_t* netHandle = (ncclNetHandle_t*) id; bool idFromEnv = getenv("NCCL_COMM_ID") != NULL; struct extState* state; NCCLCHECK(ncclCalloc(&state, 1)); state->rank = rank; state->nranks = nranks; *commState = state; TRACE(NCCL_INIT, "rank %d nranks %d", rank, nranks); struct extInfo info = { 0 }; info.rank = rank; info.nranks = nranks; void *tmpSendComm, *tmpRecvComm; // Pass the remote address to listen via info if (idFromEnv) { memcpy(&info.extHandleListen, netHandle, sizeof(ncclNetHandle_t)); memcpy(&info.extHandleListenRoot, netHandle, sizeof(ncclNetHandle_t)); } // listen will return the local address via info (specify interface type 'findSubnetIf') state->dev = idFromEnv ? findSubnetIf : 0; void* extBstrapListenCommRoot; NCCLCHECK(bootstrapNetListen(state->dev, &info.extHandleListen, &state->extBstrapListenComm)); NCCLCHECK(bootstrapNetListen(state->dev, &info.extHandleListenRoot, &extBstrapListenCommRoot)); // stagger connection times to avoid an overload of the root at very high rank counts if (nranks > 128) { long msec = rank; struct timespec tv; tv.tv_sec = msec / 1000; tv.tv_nsec = 1000000 * (msec % 1000); TRACE(NCCL_INIT, "rank %d delaying connection to root by %ld msec", rank, msec); (void) nanosleep(&tv, NULL); } // send info on my listening socket to root NCCLCHECK(bootstrapNetConnect(state->dev, netHandle, &tmpSendComm)); NCCLCHECK(bootstrapNetSend(tmpSendComm, &info, sizeof(info))); NCCLCHECK(bootstrapNetCloseSend(tmpSendComm)); // get info on my "next" rank in the bootstrap ring from root }
即ncclComm的bootstrap,型別為extState。
struct extState { void* extBstrapListenComm; void* extBstrapRingRecvComm; void* extBstrapRingSendComm; ncclNetHandle_t* peerBstrapHandles; struct unexConn* unexpectedConnections; int rank; int nranks; int dev; };
其中extBstrapRingSendComm是當前節點連線next的socket連線,extBstrapRingRecvComm是當前節點和prev節點的socket連線,extBstrapListenComm是當前節點的監聽socket,peerBstrapHandles是所有rank的ip port(對應extBstrapListenComm),dev預設為0,表示用第幾個ip地址。
然後通過bootstrapNetListen建立extHandleListen和extHandleListenRoot兩個bootstrap comm,如前文所述,bootstrap comm其實就是儲存了fd,這裡建立兩個comm的原因是extHandleListen是rank之間實際使用的bootstrap連線,extHandleListenRoot是rank0節點和其他所有rank進行通訊使用的連線。
static ncclResult_t bootstrapNetListen(int dev, ncclNetHandle_t* netHandle, void** listenComm)
bootstrapNetListen函數上節有介紹過,會獲取到第dev個當前機器的ip,然後listen獲取監聽fd,將ip port寫到nethandle,獲取到的bootstrap comm寫到listencomm。
然後將rank,nrank,extHandleListen和extHandleListenRoot寫到extInfo裡。
struct extInfo { int rank; int nranks; ncclNetHandle_t extHandleListenRoot; ncclNetHandle_t extHandleListen; };
netHandle為ncclUniqueId,即rank0的ip port,然後通過bootstrapNetConnect建立bootstrap send comm,類比bootstrapNetListen,bootstrapNetConnect就是建立到netHandle的socket連線,將socket寫到sendComm裡,這裡dev並沒有用到。
static ncclResult_t bootstrapNetConnect(int dev, ncclNetHandle_t* netHandle, void** sendComm)
然後通過bootstrapNetSend將extInfo傳送出去,即發給rank0:
static ncclResult_t bootstrapNetSend(void* sendComm, void* data, int size) { struct bootstrapNetComm* comm = (struct bootstrapNetComm*)sendComm; NCCLCHECK(socketSend(comm->fd, &size, sizeof(int))); NCCLCHECK(socketSend(comm->fd, data, size)); return ncclSuccess; }
其中socketSend就是執行send介面傳送資料。
rank0收到資料後會做什麼工作呢,回顧一下,rank0的節執行ncclGetUniqueId生成ncclUniqueId,其中在執行bootstrapCreateRoot的最後會啟動一個執行緒執行bootstrapRoot。
static void *bootstrapRoot(void* listenComm) { struct extInfo info; ncclNetHandle_t *rankHandles = NULL; ncclNetHandle_t *rankHandlesRoot = NULL; // for initial rank <-> root information exchange ncclNetHandle_t zero = { 0 }; // for sanity checking void* tmpComm; ncclResult_t res; setFilesLimit(); TRACE(NCCL_INIT, "BEGIN"); /* Receive addresses from all ranks */ int nranks = 0, c = 0; do { NCCLCHECKGOTO(bootstrapNetAccept(listenComm, &tmpComm), res, out); NCCLCHECKGOTO(bootstrapNetRecv(tmpComm, &info, sizeof(info)), res, out); NCCLCHECKGOTO(bootstrapNetCloseRecv(tmpComm), res, out); if (c == 0) { nranks = info.nranks; NCCLCHECKGOTO(ncclCalloc(&rankHandles, nranks), res, out); NCCLCHECKGOTO(ncclCalloc(&rankHandlesRoot, nranks), res, out); } if (nranks != info.nranks) { WARN("Bootstrap Root : mismatch in rank count from procs %d : %d", nranks, info.nranks); goto out; } if (memcmp(&zero, &rankHandlesRoot[info.rank], sizeof(ncclNetHandle_t)) != 0) { WARN("Bootstrap Root : rank %d of %d ranks has already checked in", info.rank, nranks); goto out; } // Save the connection handle for that rank memcpy(rankHandlesRoot+info.rank, info.extHandleListenRoot, sizeof(ncclNetHandle_t)); memcpy(rankHandles+info.rank, info.extHandleListen, sizeof(ncclNetHandle_t)); ++c; TRACE(NCCL_INIT, "Received connect from rank %d total %d/%d", info.rank, c, nranks); } while (c < nranks); TRACE(NCCL_INIT, "COLLECTED ALL %d HANDLES", nranks); // Send the connect handle for the next rank in the AllGather ring for (int r=0; r<nranks; ++r) { int next = (r+1) % nranks; void *tmpSendComm; NCCLCHECKGOTO(bootstrapNetConnect(0, rankHandlesRoot+r, &tmpSendComm), res, out); NCCLCHECKGOTO(bootstrapNetSend(tmpSendComm, rankHandles+next, sizeof(ncclNetHandle_t)), res, out); NCCLCHECKGOTO(bootstrapNetCloseSend(tmpSendComm), res, out); } TRACE(NCCL_INIT, "SENT OUT ALL %d HANDLES", nranks); out: bootstrapNetCloseListen(listenComm); if (rankHandles) free(rankHandles); if (rankHandlesRoot) free(rankHandlesRoot); TRACE(NCCL_INIT, "DONE"); return NULL; }
listenComm是上一個博文中rank0建立的監聽fd,bootstrapNetAccept是從listenComm中獲取一個新連線,使用新連線的fd建立recvcomm。
static ncclResult_t bootstrapNetAccept(void* listenComm, void** recvComm)
然後通過bootstrapNetRecv讀取tmpComm的資料,即其他rank傳送來的extInfo,然後儲存其他rank的extHandleListen和extHandleListenRoot,這個時候rank0就獲取到其他所有rank的ip和port了。
獲取完所有rank的info之後開始建環,將節點(r+1) % nranks的extHandleListen傳送給節點r,就是說將節點r的next節點的nethandle傳送給節點r。這裡可以看出,每個節點建立了兩個listen comm,其中rank0使用extHandleListenRoot進行通訊,其他節點之間通過extHandleListen進行通訊。
ncclResult_t bootstrapInit(ncclUniqueId * id, int rank, int nranks, void** commState) { // get info on my "next" rank in the bootstrap ring from root ncclNetHandle_t extHandleNext; NCCLCHECK(bootstrapNetAccept(extBstrapListenCommRoot, &tmpRecvComm)); NCCLCHECK(bootstrapNetRecv(tmpRecvComm, &extHandleNext, sizeof(extHandleNext))); NCCLCHECK(bootstrapNetCloseRecv(tmpRecvComm)); NCCLCHECK(bootstrapNetCloseListen(extBstrapListenCommRoot)); NCCLCHECK(bootstrapNetConnect(state->dev, &extHandleNext, &state->extBstrapRingSendComm)); // Accept the connect request from the previous rank in the AllGather ring NCCLCHECK(bootstrapNetAccept(state->extBstrapListenComm, &state->extBstrapRingRecvComm)); // AllGather all listen handlers NCCLCHECK(ncclCalloc(&state->peerBstrapHandles, nranks)); memcpy(state->peerBstrapHandles+rank, info.extHandleListen, sizeof(ncclNetHandle_t)); NCCLCHECK(bootstrapAllGather(state, state->peerBstrapHandles, sizeof(ncclNetHandle_t))); TRACE(NCCL_INIT, "rank %d nranks %d - DONE", rank, nranks); return ncclSuccess; }
接著所有rank都會在extHandleListenRoot上接收新連線建立tmpRecvComm,然後接收到當前rank的next的ip,port;然後連線next建立bscomm到state->extBstrapRingSendComm,接收prev的連線建立bscomm到state->extBstrapRingRecvComm,到現在bootstrap網路連線就完全建立起來了,如下圖:
首先將自己的nethandle放到peerBstrapHandles的對應位置,如下所示。
然後執行bootstrapAllGather:
ncclResult_t bootstrapAllGather(void* commState, void* allData, int size) { struct extState* state = (struct extState*)commState; char* data = (char*)allData; int rank = state->rank; int nranks = state->nranks; TRACE(NCCL_INIT, "rank %d nranks %d size %d", rank, nranks, size); /* Simple ring based AllGather * At each step i receive data from (rank-i-1) from left * and send previous step's data from (rank-i) to right */ for (int i=0; i<nranks-1; i++) { size_t rslice = (rank - i - 1 + nranks) % nranks; size_t sslice = (rank - i + nranks) % nranks; // Send slice to the right NCCLCHECK(bootstrapNetSend(state->extBstrapRingSendComm, data+sslice*size, size)); // Recv slice from the left NCCLCHECK(bootstrapNetRecv(state->extBstrapRingRecvComm, data+rslice*size, size)); } TRACE(NCCL_INIT, "rank %d nranks %d size %d - DONE", rank, nranks, size); return ncclSuccess; }
每一次將自己的data傳送給對應的rank,然後接收其他rank傳送過來的data,如下圖。
第一步:
第二步:
到這裡每個rank就都有了全域性所有rank的ip port。
最後總結一下,本節主要建立了bootstrap環形網路連線,並儲存到ncclComm裡。
歡迎 Star、試用 OneFlow 最新版本:github.com/Oneflow-Inc…
以上就是NCCL深度學習Bootstrap網路連線建立原始碼解析的詳細內容,更多關於NCCL Bootstrap網路連線的資料請關注it145.com其它相關文章!
相關文章
<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
综合看Anker超能充系列的性价比很高,并且与不仅和iPhone12/苹果<em>Mac</em>Book很配,而且适合多设备充电需求的日常使用或差旅场景,不管是安卓还是Switch同样也能用得上它,希望这次分享能给准备购入充电器的小伙伴们有所
2021-06-01 09:31:42
除了L4WUDU与吴亦凡已经多次共事,成为了明面上的厂牌成员,吴亦凡还曾带领20XXCLUB全队参加2020年的一场音乐节,这也是20XXCLUB首次全员合照,王嗣尧Turbo、陈彦希Regi、<em>Mac</em> Ova Seas、林渝植等人全部出场。然而让
2021-06-01 09:31:34
目前应用IPFS的机构:1 谷歌<em>浏览器</em>支持IPFS分布式协议 2 万维网 (历史档案博物馆)数据库 3 火狐<em>浏览器</em>支持 IPFS分布式协议 4 EOS 等数字货币数据存储 5 美国国会图书馆,历史资料永久保存在 IPFS 6 加
2021-06-01 09:31:24
开拓者的车机是兼容苹果和<em>安卓</em>,虽然我不怎么用,但确实兼顾了我家人的很多需求:副驾的门板还配有解锁开关,有的时候老婆开车,下车的时候偶尔会忘记解锁,我在副驾驶可以自己开门:第二排设计很好,不仅配置了一个很大的
2021-06-01 09:30:48
不仅是<em>安卓</em>手机,苹果手机的降价力度也是前所未有了,iPhone12也“跳水价”了,发布价是6799元,如今已经跌至5308元,降价幅度超过1400元,最新定价确认了。iPhone12是苹果首款5G手机,同时也是全球首款5nm芯片的智能机,它
2021-06-01 09:30:45