Disfluencies in face-to-face versus video-mediated communication
There is some research to show that the use of filled pauses is different in face-to-face versus telephone conversations (cf., high rate of FPs in Switchboard versus low rate in Santa Barbara Corpus). In particular, speakers tend to use more filled pauses on the telephone. It has been hypothesized that the reason for this is that because the visual element is lost, speakers are more apt to try to manage conversational turns through additional vocal elements and thus, they might use filled pauses more often to "hold" their conversational turn. (Leaving aside the debatable question of whether speakers actually use filled pauses in that way or not.)
A good question to ask now that video conferencing is becoming a ubiquitous form of communication is whether the use of disfluencies in video-mediated communication differ from those in regular face-to-face communication.
A first hypothesis could be simply, "no", with the reasoning that video-mediated communication is effectively the same as face-to-face communication. What differences there are -- such as less physical freedom (must remain in camera view and near microphone) -- don't affect disfluencies. But that hypothesis is probably too simplistic. The biggest influence that I think might be a factor in changing disfluency patterns is the temporal delay, and occasional audiovisual breakups, that users experience.
At a bare minimum, if the temporal delay is long enough, then interlocutors cannot be sure who has the current conversational turn as one starts to speak, only to realize that one's partner has started speaking simultaneously but because of the delay, seems to "interrupt" after a few moments of one's own speech. This can easily result in one breaking off one's own speech in ways that may introduce several different types of disfluency. The distribution of these types could be quite different than in face-to-face conversation when simultaneous initiation of speech is usually resolved immediately. Negotiating control of the conversation after already speaking a half sentence or so is very different.
Note also that this delay can vary throughout one session. That is, unlike telephony, in which once an existing delay is recognized by both parties, they can deal with it appropriately (by allowing longer silence before uptake), in Internet telephony, this seems much more variable. One moment, it may seem almost like real-time conversation and then another moment there is a noticeable delay -- which may then disappear some time later. Thus, we are constantly re-negotiating the understanding in terms of efficient management of turns. This must lead to a non-trivial shift in the occurrence and thus distribution of disfluencies.
But a further question could be whether there is a difference in the quality of some disfluencies. That is, are filled pauses acoustically different -- for example, longer, higher-pitched, voiced with less/greater intensity? To the extent that a speaker's voice changes when doing video-mediated communication, then I would expect a change. For example, if one habitually speaks louder and slower when using Zoom, then I would suppose that filled pauses could be more intense and longer in duration than in normal speech. If we could control for that in a data comparison, perhaps then we would find that there's no difference (otherwise).
Could there be other more subtle acoustic differences such as higher pitch, spectral tilt, shimmer? I suppose it's possible, together with basic articulatory differences.
Finally, would there be any expected differences in syntactic or discourse-related distribution of filled pauses? I think I would predict not. That is, whatever sorts of cognitive states led the speaker to produce a filled pause in the middle of their speech production process would be anticipated to occur in any communicative situation (holding other contextual factors constant). Thus, I would suppose the impetus to articulate a filled pause will still occur in the same sorts of syntactic and discourse-related positions.